skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Stability and Scalability of the CMS Global Pool: Pushing HTCondor and GlideinWMS to New Limits

Abstract

The CMS Global Pool, based on HTCondor and glideinWMS, is the main computing resource provisioning system for all CMS workflows, including analysis, Monte Carlo production, and detector data reprocessing activities. The total resources at Tier-1 and Tier-2 grid sites pledged to CMS exceed 100,000 CPU cores, while another 50,000 to 100,000 CPU cores are available opportunistically, pushing the needs of the Global Pool to higher scales each year. These resources are becoming more diverse in their accessibility and configuration over time. Furthermore, the challenge of stably running at higher and higher scales while introducing new modes of operation such as multi-core pilots, as well as the chaotic nature of physics analysis workflows, places huge strains on the submission infrastructure. This paper details some of the most important challenges to scalability and stability that the CMS Global Pool has faced since the beginning of the LHC Run II and how they were overcome.

Authors:
 [1];  [2]; ORCiD logo [3];  [4];  [5];  [3];  [6];  [7];  [3];  [3];  [8];  [3]
  1. Caltech
  2. Nebraska U.
  3. Fermilab
  4. Notre Dame U.
  5. NCP, Islamabad
  6. UC, San Diego
  7. Sao Paulo, IFT
  8. Madrid, CIEMAT
Publication Date:
Research Org.:
Fermi National Accelerator Lab. (FNAL), Batavia, IL (United States)
Sponsoring Org.:
USDOE Office of Science (SC), High Energy Physics (HEP) (SC-25)
OSTI Identifier:
1420915
Report Number(s):
FERMILAB-CONF-16-754-CD
1638488
DOE Contract Number:  
AC02-07CH11359
Resource Type:
Conference
Resource Relation:
Journal Name: J.Phys.Conf.Ser.; Journal Volume: 898; Journal Issue: 5; Conference: 22nd International Conference on Computing in High Energy and Nuclear Physics, San Francisco, CA, 10/10-10/14/2016
Country of Publication:
United States
Language:
English

Citation Formats

Balcas, J., Bockelman, B., Hufnagel, D., Hurtado Anampa, K., Aftab Khan, F., Larson, K., Letts, J., Marra da Silva, J., Mascheroni, M., Mason, D., Perez-Calero Yzquierdo, A., and Tiradani, A. Stability and Scalability of the CMS Global Pool: Pushing HTCondor and GlideinWMS to New Limits. United States: N. p., 2017. Web. doi:10.1088/1742-6596/898/5/052031.
Balcas, J., Bockelman, B., Hufnagel, D., Hurtado Anampa, K., Aftab Khan, F., Larson, K., Letts, J., Marra da Silva, J., Mascheroni, M., Mason, D., Perez-Calero Yzquierdo, A., & Tiradani, A. Stability and Scalability of the CMS Global Pool: Pushing HTCondor and GlideinWMS to New Limits. United States. doi:10.1088/1742-6596/898/5/052031.
Balcas, J., Bockelman, B., Hufnagel, D., Hurtado Anampa, K., Aftab Khan, F., Larson, K., Letts, J., Marra da Silva, J., Mascheroni, M., Mason, D., Perez-Calero Yzquierdo, A., and Tiradani, A. Wed . "Stability and Scalability of the CMS Global Pool: Pushing HTCondor and GlideinWMS to New Limits". United States. doi:10.1088/1742-6596/898/5/052031. https://www.osti.gov/servlets/purl/1420915.
@article{osti_1420915,
title = {Stability and Scalability of the CMS Global Pool: Pushing HTCondor and GlideinWMS to New Limits},
author = {Balcas, J. and Bockelman, B. and Hufnagel, D. and Hurtado Anampa, K. and Aftab Khan, F. and Larson, K. and Letts, J. and Marra da Silva, J. and Mascheroni, M. and Mason, D. and Perez-Calero Yzquierdo, A. and Tiradani, A.},
abstractNote = {The CMS Global Pool, based on HTCondor and glideinWMS, is the main computing resource provisioning system for all CMS workflows, including analysis, Monte Carlo production, and detector data reprocessing activities. The total resources at Tier-1 and Tier-2 grid sites pledged to CMS exceed 100,000 CPU cores, while another 50,000 to 100,000 CPU cores are available opportunistically, pushing the needs of the Global Pool to higher scales each year. These resources are becoming more diverse in their accessibility and configuration over time. Furthermore, the challenge of stably running at higher and higher scales while introducing new modes of operation such as multi-core pilots, as well as the chaotic nature of physics analysis workflows, places huge strains on the submission infrastructure. This paper details some of the most important challenges to scalability and stability that the CMS Global Pool has faced since the beginning of the LHC Run II and how they were overcome.},
doi = {10.1088/1742-6596/898/5/052031},
journal = {J.Phys.Conf.Ser.},
number = 5,
volume = 898,
place = {United States},
year = {Wed Nov 22 00:00:00 EST 2017},
month = {Wed Nov 22 00:00:00 EST 2017}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share: