DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Toward an optimal online checkpoint solution under a two-level HPC checkpoint model

Abstract

The traditional single-level checkpointing method suffers from significant overhead on large-scale platforms. Hence, multilevel checkpointing protocols have been studied extensively in recent years. The multilevel checkpoint approach allows different levels of checkpoints to be set (each with different checkpoint overheads and recovery abilities), in order to further improve the fault tolerance performance of extreme-scale HPC applications. How to optimize the checkpoint intervals for each level, however, is an extremely difficult problem. In this paper, we construct an easy-to-use two-level checkpoint model. Checkpoint level 1 deals with errors with low checkpoint/recovery overheads such as transient memory errors, while checkpoint level 2 deals with hardware crashes such as node failures. Compared with previous optimization work, our new optimal checkpoint solution offers two improvements: (1) it is an online solution without requiring knowledge of the job length in advance, and (2) it shows that periodic patterns are optimal and determines the best pattern. We evaluate the proposed solution and compare it with the most up-to-date related approaches on an extreme-scale simulation testbed constructed based on a real HPC application execution. Simulation results show that our proposed solution outperforms other optimized solutions and can improve the performance significantly in some cases. Specifically, with themore » new solution the wall-clock time can be reduced by up to 25.3% over that of other state-of-the-art approaches. Lastly, a brute-force comparison with all possible patterns shows that our solution is always within 1% of the best pattern in the experiments.« less

Authors:
 [1];  [2];  [3];  [1]
  1. Argonne National Lab. (ANL), Chicago, IL (United States)
  2. Lab. LIP, CNRS, ENS Lyon, INRIA, and UCB Lyon, Lyon (France); Univ. of Tennessee, Knoxville, TN (United States)
  3. Lab. LIP, CNRS, ENS Lyon, INRIA, and UCB Lyon, Lyon (France)
Publication Date:
Research Org.:
Argonne National Lab. (ANL), Argonne, IL (United States)
Sponsoring Org.:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)
OSTI Identifier:
1346727
Grant/Contract Number:  
AC02-06CH11357
Resource Type:
Accepted Manuscript
Journal Name:
IEEE Transactions on Parallel and Distributed Systems
Additional Journal Information:
Journal Volume: 28; Journal Issue: 1; Journal ID: ISSN 1045-9219
Publisher:
IEEE
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING; Optimization; Fault Tolerance; High-Performance Computing; Multilevel Checkpoint

Citation Formats

Di, Sheng, Robert, Yves, Vivien, Frederic, and Cappello, Franck. Toward an optimal online checkpoint solution under a two-level HPC checkpoint model. United States: N. p., 2016. Web. doi:10.1109/TPDS.2016.2546248.
Di, Sheng, Robert, Yves, Vivien, Frederic, & Cappello, Franck. Toward an optimal online checkpoint solution under a two-level HPC checkpoint model. United States. https://doi.org/10.1109/TPDS.2016.2546248
Di, Sheng, Robert, Yves, Vivien, Frederic, and Cappello, Franck. Tue . "Toward an optimal online checkpoint solution under a two-level HPC checkpoint model". United States. https://doi.org/10.1109/TPDS.2016.2546248. https://www.osti.gov/servlets/purl/1346727.
@article{osti_1346727,
title = {Toward an optimal online checkpoint solution under a two-level HPC checkpoint model},
author = {Di, Sheng and Robert, Yves and Vivien, Frederic and Cappello, Franck},
abstractNote = {The traditional single-level checkpointing method suffers from significant overhead on large-scale platforms. Hence, multilevel checkpointing protocols have been studied extensively in recent years. The multilevel checkpoint approach allows different levels of checkpoints to be set (each with different checkpoint overheads and recovery abilities), in order to further improve the fault tolerance performance of extreme-scale HPC applications. How to optimize the checkpoint intervals for each level, however, is an extremely difficult problem. In this paper, we construct an easy-to-use two-level checkpoint model. Checkpoint level 1 deals with errors with low checkpoint/recovery overheads such as transient memory errors, while checkpoint level 2 deals with hardware crashes such as node failures. Compared with previous optimization work, our new optimal checkpoint solution offers two improvements: (1) it is an online solution without requiring knowledge of the job length in advance, and (2) it shows that periodic patterns are optimal and determines the best pattern. We evaluate the proposed solution and compare it with the most up-to-date related approaches on an extreme-scale simulation testbed constructed based on a real HPC application execution. Simulation results show that our proposed solution outperforms other optimized solutions and can improve the performance significantly in some cases. Specifically, with the new solution the wall-clock time can be reduced by up to 25.3% over that of other state-of-the-art approaches. Lastly, a brute-force comparison with all possible patterns shows that our solution is always within 1% of the best pattern in the experiments.},
doi = {10.1109/TPDS.2016.2546248},
journal = {IEEE Transactions on Parallel and Distributed Systems},
number = 1,
volume = 28,
place = {United States},
year = {Tue Mar 29 00:00:00 EDT 2016},
month = {Tue Mar 29 00:00:00 EDT 2016}
}

Journal Article:
Free Publicly Available Full Text
Publisher's Version of Record

Citation Metrics:
Cited by: 30 works
Citation information provided by
Web of Science

Save / Share:

Works referencing / citing this record:

On the modelling of optimal coordinated checkpoint period in supercomputers
journal, September 2018

  • Moríñigo, José A.; Rodríguez-Pascual, Manuel; Mayo-García, Rafael
  • The Journal of Supercomputing, Vol. 75, Issue 2
  • DOI: 10.1007/s11227-018-2621-1

Compression Challenges in Large Scale Partial Differential Equation Solvers
journal, September 2019

  • Götschel, Sebastian; Weiser, Martin
  • Algorithms, Vol. 12, Issue 9
  • DOI: 10.3390/a12090197