DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Grid site availability evaluation and monitoring at CMS

Abstract

The Compact Muon Solenoid (CMS) experiment at the Large Hadron Collider (LHC) uses distributed grid computing to store, process, and analyse the vast quantity of scientific data recorded every year. The computing resources are grouped into sites and organized in a tiered structure. Each site provides computing and storage to the CMS computing grid. Over a hundred sites worldwide contribute with resources from hundred to well over ten thousand computing cores and storage from tens of TBytes to tens of PBytes. In such a large computing setup scheduled and unscheduled outages occur continually and are not allowed to significantly impact data handling, processing, and analysis. Unscheduled capacity and performance reductions need to be detected promptly and corrected. CMS developed a sophisticated site evaluation and monitoring system for Run 1 of the LHC based on tools of the Worldwide LHC Computing Grid. For Run 2 of the LHC the site evaluation and monitoring system is being overhauled to enable faster detection/reaction to failures and a more dynamic handling of computing resources. Furthermore, enhancements to better distinguish site from central service issues and to make evaluations more transparent and informative to site support staff are planned.

Authors:
 [1];  [2];  [3]; ORCiD logo [1];  [4]
  1. Fermi National Accelerator Lab. (FNAL), Batavia, IL (United States)
  2. Vilnius Univ., Vilnius (Lithuania)
  3. Univ. di Pisa & INFN, Pisa (Italy)
  4. European Organization for Nuclear Research (CERN), Geneva (Switzerland)
Publication Date:
Research Org.:
Fermi National Accelerator Lab. (FNAL), Batavia, IL (United States)
Sponsoring Org.:
USDOE Office of Science (SC), High Energy Physics (HEP)
OSTI Identifier:
1415641
Report Number(s):
FERMILAB-CONF-16-752-CD
Journal ID: ISSN 1742-6588; 1638611; TRN: US1800845
Grant/Contract Number:  
AC02-07CH11359
Resource Type:
Accepted Manuscript
Journal Name:
Journal of Physics. Conference Series
Additional Journal Information:
Journal Volume: 898; Journal Issue: 9; Journal ID: ISSN 1742-6588
Publisher:
IOP Publishing
Country of Publication:
United States
Language:
English
Subject:
43 PARTICLE ACCELERATORS

Citation Formats

Lyons, Gaston, Maciulaitis, Rokas, Bagliesi, Giuseppe, Lammel, Stephan, and Sciaba, Andrea. Grid site availability evaluation and monitoring at CMS. United States: N. p., 2017. Web. doi:10.1088/1742-6596/898/9/092014.
Lyons, Gaston, Maciulaitis, Rokas, Bagliesi, Giuseppe, Lammel, Stephan, & Sciaba, Andrea. Grid site availability evaluation and monitoring at CMS. United States. https://doi.org/10.1088/1742-6596/898/9/092014
Lyons, Gaston, Maciulaitis, Rokas, Bagliesi, Giuseppe, Lammel, Stephan, and Sciaba, Andrea. Sun . "Grid site availability evaluation and monitoring at CMS". United States. https://doi.org/10.1088/1742-6596/898/9/092014. https://www.osti.gov/servlets/purl/1415641.
@article{osti_1415641,
title = {Grid site availability evaluation and monitoring at CMS},
author = {Lyons, Gaston and Maciulaitis, Rokas and Bagliesi, Giuseppe and Lammel, Stephan and Sciaba, Andrea},
abstractNote = {The Compact Muon Solenoid (CMS) experiment at the Large Hadron Collider (LHC) uses distributed grid computing to store, process, and analyse the vast quantity of scientific data recorded every year. The computing resources are grouped into sites and organized in a tiered structure. Each site provides computing and storage to the CMS computing grid. Over a hundred sites worldwide contribute with resources from hundred to well over ten thousand computing cores and storage from tens of TBytes to tens of PBytes. In such a large computing setup scheduled and unscheduled outages occur continually and are not allowed to significantly impact data handling, processing, and analysis. Unscheduled capacity and performance reductions need to be detected promptly and corrected. CMS developed a sophisticated site evaluation and monitoring system for Run 1 of the LHC based on tools of the Worldwide LHC Computing Grid. For Run 2 of the LHC the site evaluation and monitoring system is being overhauled to enable faster detection/reaction to failures and a more dynamic handling of computing resources. Furthermore, enhancements to better distinguish site from central service issues and to make evaluations more transparent and informative to site support staff are planned.},
doi = {10.1088/1742-6596/898/9/092014},
journal = {Journal of Physics. Conference Series},
number = 9,
volume = 898,
place = {United States},
year = {Sun Oct 01 00:00:00 EDT 2017},
month = {Sun Oct 01 00:00:00 EDT 2017}
}