Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Optimizing Asynchronous Multi-Level Checkpoint/Restart Configurations with Machine Learning

Conference ·
With the emergence of versatile storage systems, multi-level checkpointing (MLC) has become a common approach to gain efficiency. However, multi-level checkpoint/restart can cause enormous I/O traffic on HPC systems. To use multilevel checkpointing efficiently, it is important to optimize check-point/restart configurations. Current approaches, namely modeling and simulation, are either inaccurate or slow in determining the optimal configuration for a large scale system. In this paper, we show that machine learning models can be used in combination with accurate simulation to determine the optimal checkpoint configurations. We also demonstrate that more advanced techniques such as neural networks can further improve the performance in optimizing checkpoint configurations.
Research Organization:
Argonne National Laboratory (ANL)
Sponsoring Organization:
U.S. Department of Energy (Office not specified); USDOE Office of Science; National Science Foundation (NSF)
DOE Contract Number:
AC02-06CH11357
OSTI ID:
1770373
Country of Publication:
United States
Language:
English

References (10)

Detailed Modeling, Design, and Evaluation of a Scalable Multi-level Checkpointing System report April 2010
A Scalable Checkpoint Encoding Algorithm for Diskless Checkpointing
  • Chen, Zizhong; Dongarra, Jack
  • 2008 IEEE 11th High-Assurance Systems Engineering Symposium (HASE), 2008 11th IEEE High Assurance Systems Engineering Symposium https://doi.org/10.1109/HASE.2008.13
conference December 2008
Random decision forests conference January 1995
Checkpointing algorithms and fault prediction journal February 2014
Checkpointing strategies for parallel jobs
  • Bougeret, Marin; Casanova, Henri; Rabie, Mikael
  • Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11 https://doi.org/10.1145/2063384.2063428
conference January 2011
Improving the Computing Efficiency of HPC Systems Using a Combination of Proactive and Preventive Checkpointing
  • Bouguerra, Mohamed Slim; Gainaru, Ana; Gomez, Leonardo Bautista
  • 2013 IEEE 27th International Symposium on Parallel and Distributed Processing (IPDPS 2013) https://doi.org/10.1109/IPDPS.2013.74
conference May 2013
Design and modeling of a non-blocking checkpointing system
  • Sato, Kento; Maruyama, Naoya; Mohror, Kathryn
  • 2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2012.46
conference November 2012
Modeling Coordinated Checkpointing for Large-Scale Supercomputers conference January 2005
FTI: high performance fault tolerance interface for hybrid systems
  • Bautista-Gomez, Leonardo; Tsuboi, Seiji; Komatitsch, Dimitri
  • Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11 https://doi.org/10.1145/2063384.2063427
conference January 2011
Optimization of Multi-level Checkpoint Model for Large Scale HPC Applications
  • Di, Sheng; Bouguerra, Mohamed Slim; Bautista-Gomez, Leonardo
  • 2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2014 IEEE 28th International Parallel and Distributed Processing Symposium https://doi.org/10.1109/IPDPS.2014.122
conference May 2014

Similar Records

Asynchronous Checkpoint Migration with MRNet in the Scalable Checkpoint / Restart Library
Conference · 2012 · OSTI ID:1047769

Checkpoint/Restart Vision and Strategies for NERSC’s Production Workloads
Technical Report · 2021 · OSTI ID:1814161

Toward an optimal online checkpoint solution under a two-level HPC checkpoint model
Journal Article · 2016 · IEEE Transactions on Parallel and Distributed Systems · OSTI ID:1346727