Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Monitoring techniques and alarm procedures for CMS services and sites in WLCG

Conference · · J.Phys.Conf.Ser.
 [1];  [2];  [3];  [4];  [5];  [4];  [6];  [7];  [8];  [9];  [10];  [11];  [4];  [10];  [11]
  1. UC, San Diego
  2. Bologna U.
  3. Fermilab
  4. CERN
  5. Madrid, CIEMAT
  6. Andes U., Bogota
  7. INFN, Pisa
  8. MIT
  9. Rio de Janeiro State U.
  10. Vilnius U.
  11. Beijing, Inst. High Energy Phys.
The CMS offline computing system is composed of roughly 80 sites (including most experienced T3s) and a number of central services to distribute, process and analyze data worldwide. A high level of stability and reliability is required from the underlying infrastructure and services, partially covered by local or automated monitoring and alarming systems such as Lemon and SLS, the former collects metrics from sensors installed on computing nodes and triggers alarms when values are out of range, the latter measures the quality of service and warns managers when service is affected. CMS has established computing shift procedures with personnel operating worldwide from remote Computing Centers, under the supervision of the Computing Run Coordinator at CERN. This dedicated 24/7 computing shift personnel is contributing to detect and react timely on any unexpected error and hence ensure that CMS workflows are carried out efficiently and in a sustained manner. Synergy among all the involved actors is exploited to ensure the 24/7 monitoring, alarming and troubleshooting of the CMS computing sites and services. We review the deployment of the monitoring and alarming procedures, and report on the experience gained throughout the first two years of LHC operation. We describe the efficiency of the communication tools employed, the coherent monitoring framework, the proactive alarming systems and the proficient troubleshooting procedures that helped the CMS Computing facilities and infrastructure to operate at high reliability levels.
Research Organization:
Fermi National Accelerator Laboratory (FNAL), Batavia, IL (United States)
Sponsoring Organization:
USDOE Office of Science (SC), High Energy Physics (HEP) (SC-25)
DOE Contract Number:
AC02-07CH11359
OSTI ID:
1405151
Report Number(s):
FERMILAB-CONF-12-830-CD; 1211477
Conference Information:
Journal Name: J.Phys.Conf.Ser. Journal Volume: 396
Country of Publication:
United States
Language:
English

Similar Records

The Commissioning of CMS Sites: Improving the Site Reliability
Conference · Thu Dec 31 23:00:00 EST 2009 · J.Phys.Conf.Ser. · OSTI ID:1967964

Time-critical database condition data handling in the CMS experiment during the first data taking period
Conference · Fri Dec 31 23:00:00 EST 2010 · J.Phys.Conf.Ser. · OSTI ID:1436748

FIRUS-88 networked alarm and utility monitoring system
Technical Report · Sun Dec 31 23:00:00 EST 1989 · OSTI ID:6630992

Related Subjects