Leveraging the checkpoint-restart technique for optimizing CPU efficiency of ATLAS production applications on opportunistic platforms

Cameron, D.; Elmsheuser, J.; Heinrich, L.; Lavrijsen, W.; Nilsson, P.; Tsulaia, V.; Vogel, M.

doi:10.1088/1742-6596/1085/3/032028

Title: Leveraging the checkpoint-restart technique for optimizing CPU efficiency of ATLAS production applications on opportunistic platforms

Abstract

Data processing applications of the ATLAS experiment, such as event simulation and reconstruction, spend considerable amount of time in the initialization phase. This phase includes loading a large number of shared libraries, reading detector geometry and condition data from external databases, building a transient representation of the detector geometry and initializing various algorithms and services. In some cases the initialization step can take as long as 10-15 minutes. Such slow initialization has a significant negative impact on overall CPU efficiency of the production job, especially when the job is executed on opportunistic, often short-lived, resources such as commercial clouds or volunteer computing. In order to improve this situation, we can take advantage of the fact that ATLAS runs large numbers of production jobs with similar configuration parameters (e.g. jobs within the same production task). This allows us to checkpoint one job at the end of its configuration step and then use the generated checkpoint image for rapid startup of thousands of production jobs. By applying this technique we can bring the initialization time of a job from tens of minutes down to just a few seconds. In addition to that we can leverage container technology for restarting checkpointed applications onmore »« less

Authors:: Cameron, D.; Elmsheuser, J.; Heinrich, L.; Lavrijsen, W.; Nilsson, P.; Tsulaia, V.; Vogel, M.

Publication Date:: Sat Sep 01 00:00:00 EDT 2018

Research Org.:: Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States). National Energy Research Scientific Computing Center (NERSC)

Sponsoring Org.:: USDOE Office of Science (SC)

Contributing Org.:: ATLAS Collaboration

OSTI Identifier:: 1544174

Grant/Contract Number:: AC02-05CH11231

Resource Type:: Accepted Manuscript

Journal Name:: Journal of Physics. Conference Series

Additional Journal Information:: Journal Volume: 1085; Journal ID: ISSN 1742-6588

Publisher:: IOP Publishing

Country of Publication:: United States

Language:: English

Subject:: 72 PHYSICS OF ELEMENTARY PARTICLES AND FIELDS; 97 MATHEMATICS AND COMPUTING

Citation Formats


                    Cameron, D., Elmsheuser, J., Heinrich, L., Lavrijsen, W., Nilsson, P., Tsulaia, V., and Vogel, M. Leveraging the checkpoint-restart technique for optimizing CPU efficiency of ATLAS production applications on opportunistic platforms.  United States: N. p., 2018. 
Web.  doi:10.1088/1742-6596/1085/3/032028.

Copy to clipboard


                    Cameron, D., Elmsheuser, J., Heinrich, L., Lavrijsen, W., Nilsson, P., Tsulaia, V., & Vogel, M. Leveraging the checkpoint-restart technique for optimizing CPU efficiency of ATLAS production applications on opportunistic platforms.  United States.  https://doi.org/10.1088/1742-6596/1085/3/032028

Copy to clipboard


                    Cameron, D., Elmsheuser, J., Heinrich, L., Lavrijsen, W., Nilsson, P., Tsulaia, V., and Vogel, M. Sat .  
"Leveraging the checkpoint-restart technique for optimizing CPU efficiency of ATLAS production applications on opportunistic platforms".  United States.  https://doi.org/10.1088/1742-6596/1085/3/032028.  https://www.osti.gov/servlets/purl/1544174.

Copy to clipboard


                    
@article{osti_1544174,

  title        = {Leveraging the checkpoint-restart technique for optimizing CPU efficiency of ATLAS production applications on opportunistic platforms},

  author       = {Cameron, D. and Elmsheuser, J. and Heinrich, L. and Lavrijsen, W. and Nilsson, P. and Tsulaia, V. and Vogel, M.},

  abstractNote = {Data processing applications of the ATLAS experiment, such as event simulation and reconstruction, spend considerable amount of time in the initialization phase. This phase includes loading a large number of shared libraries, reading detector geometry and condition data from external databases, building a transient representation of the detector geometry and initializing various algorithms and services. In some cases the initialization step can take as long as 10-15 minutes. Such slow initialization has a significant negative impact on overall CPU efficiency of the production job, especially when the job is executed on opportunistic, often short-lived, resources such as commercial clouds or volunteer computing. In order to improve this situation, we can take advantage of the fact that ATLAS runs large numbers of production jobs with similar configuration parameters (e.g. jobs within the same production task). This allows us to checkpoint one job at the end of its configuration step and then use the generated checkpoint image for rapid startup of thousands of production jobs. By applying this technique we can bring the initialization time of a job from tens of minutes down to just a few seconds. In addition to that we can leverage container technology for restarting checkpointed applications on the variety of computing platforms, in particular of platforms different from the one on which the checkpoint image was created. We will describe the mechanism of creating checkpoint images of Geant4 simulation jobs with AthenaMP (the multi-process version of the ATLAS data simulation, reconstruction and analysis framework Athena) and the usage of these images for running ATLAS Simulation production jobs on volunteer computing resources (ATLAS@Home) and on Supercomputers.},

  doi          = {10.1088/1742-6596/1085/3/032028},

  journal      = {Journal of Physics. Conference Series},

  number       = ,

  volume       = 1085,

  place        = {United States},

  year         = {Sat Sep 01 00:00:00 EDT 2018},

  month        = {Sat Sep 01 00:00:00 EDT 2018}

}

Copy to clipboard

Journal Article:

Free Publicly Available Full Text

Accepted Manuscript (DOE)

Publisher's Version of Record

https://doi.org/10.1088/1742-6596/1085/3/032028

Other availability

Search WorldCat to find libraries that may hold this journal

Save / Share:

Export Metadata

Save to My Library

Works referenced in this record:

Geant4—a simulation toolkit
journal, July 2003

Agostinelli, S.; Allison, J.; Amako, K.
Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment, Vol. 506, Issue 3
DOI: 10.1016/S0168-9002(03)01368-8

Use of checkpoint-restart for complex HEP software on traditional architectures and Intel MIC
journal, June 2014

Arya, Kapil; Cooperman, Gene; Dotti, Andrea
Journal of Physics: Conference Series, Vol. 523
DOI: 10.1088/1742-6596/523/1/012015

LHC Machine
journal, August 2008

Evans, Lyndon; Bryant, Philip
Journal of Instrumentation, Vol. 3, Issue 08
DOI: 10.1088/1748-0221/3/08/S08001

The ATLAS Simulation Infrastructure
journal, September 2010

Aad, G.; Abbott, B.; Abdallah, J.
The European Physical Journal C, Vol. 70, Issue 3
DOI: 10.1140/epjc/s10052-010-1429-9

Use of checkpoint-restart for complex HEP software on traditional architectures and Intel MIC
journal, June 2014

Arya, Kapil; Cooperman, Gene; Dotti, Andrea
Journal of Physics: Conference Series, Vol. 523
DOI: 10.1088/1742-6596/523/1/012015

ATLAS@Home: Harnessing Volunteer Computing for HEP
journal, December 2015

Adam-Bourdarios, C.; Cameron, D.; Filipčič, A.
Journal of Physics: Conference Series, Vol. 664, Issue 2
DOI: 10.1088/1742-6596/664/2/022009

Running ATLAS workloads within massively parallel distributed applications using Athena Multi-Process framework (AthenaMP)
journal, December 2015

Calafiura, Paolo; Leggett, Charles; Seuster, Rolf
Journal of Physics: Conference Series, Vol. 664, Issue 7
DOI: 10.1088/1742-6596/664/7/072050

Similar Records in DOE PAGES and OSTI.GOV collections:

SCR-Exa: Enhanced Scalable Checkpoint Restart (SCR) Library for Next Generation Exascale Computing

Technical Report Dai, Donglai

As the field of High-Performance Computing (HPC) heads towards exascale with modern processing, networking and storage technologies, it is increasingly important to provide support for fast I/O operations and scalable checkpoint-restart for users of these systems. Fast I/O support is critical for applications handling large-scale data and for visualizing the results. Checkpoint-restart enables users to tolerate failures in the underlying commodity components (processors, memory, interconnect, and storage) of HPC systems and run applications on a continuous basis without productivity loss. The Scalable Checkpoint-Restart (SCR) project, funded by DOE and developed by researchers from the Lawrence Livermore National Laboratory (LLNL), hasmore »« less
SPARC: Demonstrate burst-buffer-based checkpoint/restart on ATS-1.

Technical Report Oldfield, Ron A. ; Ulmer, Craig D. ; Widener, Patrick ; ...

Recent high-performance computing (HPC) platforms such as the Trinity Advanced Technology System (ATS-1) feature burst buffer resources that can have a dramatic impact on an application’s I/O performance. While these non-volatile memory (NVM) resources provide a new tier in the storage hierarchy, developers must find the right way to incorporate the technology into their applications in order to reap the benefits. Similar to other laboratories, Sandia is actively investigating ways in which these resources can be incorporated into our existing libraries and workflows without burdening our application developers with excessive, platform-specific details. This FY18Q1 milestone summaries our progress in adaptingmore »« less
https://doi.org/10.2172/1417577

Full Text Available
The Scalable Checkpoint/Restart Library

Software Moody, A. ; USDOE

The Scalable Checkpoint/Restart (SCR) library provides an interface that codes may use to worite our and read in application-level checkpoints in a scalable fashion. In the current implementation, checkpoint files are cached in local storage (hard disk or RAM disk) on the compute nodes. This technique provides scalable aggregate bandwidth and uses storage resources that are fully dedicated to the job. This approach addresses the two common drawbacks of checkpointing a large-scale application to a shared parallel file system, namely, limited bandwidth and file system contention. In fact, on current platforms, SCR scales linearly with the number of compute nodes.more »« less
https://doi.org/10.11578/dc.20171025.1160

View Software
Asynchronous Checkpoint Migration with MRNet in the Scalable Checkpoint / Restart Library

Conference Mohror, K ; Moody, A ; de Supinski, B R

Applications running on today's supercomputers tolerate failures by periodically saving their state in checkpoint files on stable storage, such as a parallel file system. Although this approach is simple, the overhead of writing the checkpoints can be prohibitive, especially for large-scale jobs. In this paper, we present initial results of an enhancement to our Scalable Checkpoint/Restart Library (SCR). We employ MRNet, a tree-based overlay network library, to transfer checkpoints from the compute nodes to the parallel file system asynchronously. This enhancement increases application efficiency by removing the need for an application to block while checkpoints are transferred to the parallelmore »« less
https://doi.org/10.1109/DSNW.2012.6264668

Full Text Available
Berkeley lab checkpoint/restart (BLCR) for Linux clusters

Journal Article Hargrove, Paul H. ; Duell, Jason C. - Journal of Physics. Conference Series

This article describes the motivation, design and implementation of Berkeley Lab Checkpoint/Restart (BLCR), a system-level checkpoint/restart implementation for Linux clusters that targets the space of typical High Performance Computing applications, including MPI. Application-level solutions, including both checkpointing and fault-tolerant algorithms, are recognized as more time and space efficient than system-level checkpoints, which cannot make use of any application-specific knowledge. However, system-level checkpointing allows for preemption, making it suitable for responding to fault precursors (for instance, elevated error rates from ECC memory or network CRCs, or elevated temperature from sensors). Preemption can also increase the efficiency of batch scheduling; for instancemore »« less
Cited by 165
https://doi.org/10.1088/1742-6596/46/1/067

Full Text Available

Similar Records

Title: Leveraging the checkpoint-restart technique for optimizing CPU efficiency of ATLAS production applications on opportunistic platforms

Abstract

Citation Formats

Geant4—a simulation toolkit journal, July 2003

Use of checkpoint-restart for complex HEP software on traditional architectures and Intel MIC journal, June 2014

LHC Machine journal, August 2008

The ATLAS Simulation Infrastructure journal, September 2010

Use of checkpoint-restart for complex HEP software on traditional architectures and Intel MIC journal, June 2014

ATLAS@Home: Harnessing Volunteer Computing for HEP journal, December 2015

Running ATLAS workloads within massively parallel distributed applications using Athena Multi-Process framework (AthenaMP) journal, December 2015

Geant4—a simulation toolkit
journal, July 2003

Use of checkpoint-restart for complex HEP software on traditional architectures and Intel MIC
journal, June 2014

LHC Machine
journal, August 2008

The ATLAS Simulation Infrastructure
journal, September 2010

Use of checkpoint-restart for complex HEP software on traditional architectures and Intel MIC
journal, June 2014

ATLAS@Home: Harnessing Volunteer Computing for HEP
journal, December 2015

Running ATLAS workloads within massively parallel distributed applications using Athena Multi-Process framework (AthenaMP)
journal, December 2015