DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Leveraging the checkpoint-restart technique for optimizing CPU efficiency of ATLAS production applications on opportunistic platforms

Abstract

Data processing applications of the ATLAS experiment, such as event simulation and reconstruction, spend considerable amount of time in the initialization phase. This phase includes loading a large number of shared libraries, reading detector geometry and condition data from external databases, building a transient representation of the detector geometry and initializing various algorithms and services. In some cases the initialization step can take as long as 10-15 minutes. Such slow initialization has a significant negative impact on overall CPU efficiency of the production job, especially when the job is executed on opportunistic, often short-lived, resources such as commercial clouds or volunteer computing. In order to improve this situation, we can take advantage of the fact that ATLAS runs large numbers of production jobs with similar configuration parameters (e.g. jobs within the same production task). This allows us to checkpoint one job at the end of its configuration step and then use the generated checkpoint image for rapid startup of thousands of production jobs. By applying this technique we can bring the initialization time of a job from tens of minutes down to just a few seconds. In addition to that we can leverage container technology for restarting checkpointed applications onmore » the variety of computing platforms, in particular of platforms different from the one on which the checkpoint image was created. We will describe the mechanism of creating checkpoint images of Geant4 simulation jobs with AthenaMP (the multi-process version of the ATLAS data simulation, reconstruction and analysis framework Athena) and the usage of these images for running ATLAS Simulation production jobs on volunteer computing resources (ATLAS@Home) and on Supercomputers.« less

Authors:
; ; ; ; ; ;
Publication Date:
Research Org.:
Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States). National Energy Research Scientific Computing Center (NERSC)
Sponsoring Org.:
USDOE Office of Science (SC)
Contributing Org.:
ATLAS Collaboration
OSTI Identifier:
1544174
Grant/Contract Number:  
AC02-05CH11231
Resource Type:
Accepted Manuscript
Journal Name:
Journal of Physics. Conference Series
Additional Journal Information:
Journal Volume: 1085; Journal ID: ISSN 1742-6588
Publisher:
IOP Publishing
Country of Publication:
United States
Language:
English
Subject:
72 PHYSICS OF ELEMENTARY PARTICLES AND FIELDS; 97 MATHEMATICS AND COMPUTING

Citation Formats

Cameron, D., Elmsheuser, J., Heinrich, L., Lavrijsen, W., Nilsson, P., Tsulaia, V., and Vogel, M. Leveraging the checkpoint-restart technique for optimizing CPU efficiency of ATLAS production applications on opportunistic platforms. United States: N. p., 2018. Web. doi:10.1088/1742-6596/1085/3/032028.
Cameron, D., Elmsheuser, J., Heinrich, L., Lavrijsen, W., Nilsson, P., Tsulaia, V., & Vogel, M. Leveraging the checkpoint-restart technique for optimizing CPU efficiency of ATLAS production applications on opportunistic platforms. United States. https://doi.org/10.1088/1742-6596/1085/3/032028
Cameron, D., Elmsheuser, J., Heinrich, L., Lavrijsen, W., Nilsson, P., Tsulaia, V., and Vogel, M. Sat . "Leveraging the checkpoint-restart technique for optimizing CPU efficiency of ATLAS production applications on opportunistic platforms". United States. https://doi.org/10.1088/1742-6596/1085/3/032028. https://www.osti.gov/servlets/purl/1544174.
@article{osti_1544174,
title = {Leveraging the checkpoint-restart technique for optimizing CPU efficiency of ATLAS production applications on opportunistic platforms},
author = {Cameron, D. and Elmsheuser, J. and Heinrich, L. and Lavrijsen, W. and Nilsson, P. and Tsulaia, V. and Vogel, M.},
abstractNote = {Data processing applications of the ATLAS experiment, such as event simulation and reconstruction, spend considerable amount of time in the initialization phase. This phase includes loading a large number of shared libraries, reading detector geometry and condition data from external databases, building a transient representation of the detector geometry and initializing various algorithms and services. In some cases the initialization step can take as long as 10-15 minutes. Such slow initialization has a significant negative impact on overall CPU efficiency of the production job, especially when the job is executed on opportunistic, often short-lived, resources such as commercial clouds or volunteer computing. In order to improve this situation, we can take advantage of the fact that ATLAS runs large numbers of production jobs with similar configuration parameters (e.g. jobs within the same production task). This allows us to checkpoint one job at the end of its configuration step and then use the generated checkpoint image for rapid startup of thousands of production jobs. By applying this technique we can bring the initialization time of a job from tens of minutes down to just a few seconds. In addition to that we can leverage container technology for restarting checkpointed applications on the variety of computing platforms, in particular of platforms different from the one on which the checkpoint image was created. We will describe the mechanism of creating checkpoint images of Geant4 simulation jobs with AthenaMP (the multi-process version of the ATLAS data simulation, reconstruction and analysis framework Athena) and the usage of these images for running ATLAS Simulation production jobs on volunteer computing resources (ATLAS@Home) and on Supercomputers.},
doi = {10.1088/1742-6596/1085/3/032028},
journal = {Journal of Physics. Conference Series},
number = ,
volume = 1085,
place = {United States},
year = {Sat Sep 01 00:00:00 EDT 2018},
month = {Sat Sep 01 00:00:00 EDT 2018}
}

Works referenced in this record:

Geant4—a simulation toolkit
journal, July 2003

  • Agostinelli, S.; Allison, J.; Amako, K.
  • Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment, Vol. 506, Issue 3
  • DOI: 10.1016/S0168-9002(03)01368-8

Use of checkpoint-restart for complex HEP software on traditional architectures and Intel MIC
journal, June 2014


LHC Machine
journal, August 2008


The ATLAS Simulation Infrastructure
journal, September 2010


Use of checkpoint-restart for complex HEP software on traditional architectures and Intel MIC
journal, June 2014


ATLAS@Home: Harnessing Volunteer Computing for HEP
journal, December 2015


Running ATLAS workloads within massively parallel distributed applications using Athena Multi-Process framework (AthenaMP)
journal, December 2015