Toward an optimal online checkpoint solution under a two-level HPC checkpoint model

Di, Sheng; Robert, Yves; Vivien, Frederic; Cappello, Franck

doi:10.1109/TPDS.2016.2546248

Toward an optimal online checkpoint solution under a two-level HPC checkpoint model

Journal Article · Tue Mar 29 00:00:00 EDT 2016 · IEEE Transactions on Parallel and Distributed Systems

DOI:https://doi.org/10.1109/TPDS.2016.2546248· OSTI ID:1346727

Di, Sheng ^[1]; Robert, Yves ^[2]; Vivien, Frederic ^[3]; Cappello, Franck ^[1]

Argonne National Lab. (ANL), Chicago, IL (United States)
Lab. LIP, CNRS, ENS Lyon, INRIA, and UCB Lyon, Lyon (France); Univ. of Tennessee, Knoxville, TN (United States)
Lab. LIP, CNRS, ENS Lyon, INRIA, and UCB Lyon, Lyon (France)

The traditional single-level checkpointing method suffers from significant overhead on large-scale platforms. Hence, multilevel checkpointing protocols have been studied extensively in recent years. The multilevel checkpoint approach allows different levels of checkpoints to be set (each with different checkpoint overheads and recovery abilities), in order to further improve the fault tolerance performance of extreme-scale HPC applications. How to optimize the checkpoint intervals for each level, however, is an extremely difficult problem. In this paper, we construct an easy-to-use two-level checkpoint model. Checkpoint level 1 deals with errors with low checkpoint/recovery overheads such as transient memory errors, while checkpoint level 2 deals with hardware crashes such as node failures. Compared with previous optimization work, our new optimal checkpoint solution offers two improvements: (1) it is an online solution without requiring knowledge of the job length in advance, and (2) it shows that periodic patterns are optimal and determines the best pattern. We evaluate the proposed solution and compare it with the most up-to-date related approaches on an extreme-scale simulation testbed constructed based on a real HPC application execution. Simulation results show that our proposed solution outperforms other optimized solutions and can improve the performance significantly in some cases. Specifically, with the new solution the wall-clock time can be reduced by up to 25.3% over that of other state-of-the-art approaches. Lastly, a brute-force comparison with all possible patterns shows that our solution is always within 1% of the best pattern in the experiments.

Research Organization:: Argonne National Laboratory (ANL)

Sponsoring Organization:: USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR) (SC-21)

Grant/Contract Number:: AC02-06CH11357

OSTI ID:: 1346727

Journal Information:: IEEE Transactions on Parallel and Distributed Systems, Journal Name: IEEE Transactions on Parallel and Distributed Systems Journal Issue: 1 Vol. 28; ISSN 1045-9219

Publisher:: IEEECopyright Statement

Country of Publication:: United States

Language:: English

Cited By (2)

On the modelling of optimal coordinated checkpoint period in supercomputers Moríñigo, José A.; Rodríguez-Pascual, Manuel; Mayo-García, Rafael The Journal of Supercomputing, Vol. 75, Issue 2 https://doi.org/10.1007/s11227-018-2621-1	journal	September 2018
Compression Challenges in Large Scale Partial Differential Equation Solvers Götschel, Sebastian; Weiser, Martin Algorithms, Vol. 12, Issue 9 https://doi.org/10.3390/a12090197	journal	September 2019

Similar Records

Optimizing Asynchronous Multi-Level Checkpoint/Restart Configurations with Machine Learning

Conference · Tue Dec 31 23:00:00 EST 2019 · OSTI ID:1770373

Performance Efficient Multiresilience Using Checkpoint Recovery in Iterative Algorithms

Conference · Fri Nov 30 23:00:00 EST 2018 · OSTI ID:1493144

Lazy Checkpointing : Exploiting Temporal Locality in Failures to Mitigate Checkpointing Overheads on Extreme-Scale Systems

Conference · Tue Dec 31 23:00:00 EST 2013 · OSTI ID:1130431

Related Subjects

97 MATHEMATICS AND COMPUTING
Fault Tolerance
High-Performance Computing
Multilevel Checkpoint
Optimization

Toward an optimal online checkpoint solution under a two-level HPC checkpoint model

Citation Formats

Cited By (2)

Similar Records

Related Subjects