A survey on checkpointing strategies: Should we always checkpoint à la Young/Daly?
Journal Article
·
· Future Generations Computer Systems
- Barcelona Supercomputing Center (BSC) (Spain)
- École Normale Supérieure de Lyon (ENS de Lyon) (France); National Institute for Research in Digital Science and Technology (Inria), Lyon (France); Institut Universitaire de France (IUF) (France)
- Argonne National Laboratory (ANL), Argonne, IL (United States)
- Univ. of Tennessee, Knoxville, TN (United States)
- École Normale Supérieure de Lyon (ENS de Lyon) (France); National Institute for Research in Digital Science and Technology (Inria), Lyon (France); Univ. of Tennessee, Knoxville, TN (United States)
- Univ. of Kansas, Lawrence, KS (United States)
The Young/Daly formula provides an approximation of the optimal checkpointing period for a parallel application executing on a supercomputing platform. It was originally designed to handle fail-stop errors for preemptible tightly-coupled applications, but has been extended to other application and resilience frameworks. Here, we provide some background and survey various scenarios to assess the usefulness and limitations of the formula, both for preemptible applications and workflow applications represented as a graph of tasks. We also discuss scenarios with uncertainties, and extend the study to silent errors. We exhibit cases where the optimal period is of a different order than that dictated by the Young/Daly formula, and finally we explain how checkpointing can be further combined with replication.
- Research Organization:
- Argonne National Laboratory (ANL), Argonne, IL (United States)
- Sponsoring Organization:
- National Science Foundation (NSF); USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)
- Grant/Contract Number:
- AC02-06CH11357
- OSTI ID:
- 2406527
- Journal Information:
- Future Generations Computer Systems, Journal Name: Future Generations Computer Systems Vol. 161; ISSN 0167-739X
- Publisher:
- ElsevierCopyright Statement
- Country of Publication:
- United States
- Language:
- English
Similar Records
Coping with silent and fail-stop errors at scale by combining replication and checkpointing
Checkpointing Strategies for Shared High-Performance Computing Platforms
Toward an optimal online checkpoint solution under a two-level HPC checkpoint model
Journal Article
·
Fri Nov 30 23:00:00 EST 2018
· Journal of Parallel and Distributed Computing
·
OSTI ID:1475194
Checkpointing Strategies for Shared High-Performance Computing Platforms
Journal Article
·
Mon Dec 31 19:00:00 EST 2018
· International Journal of Networking and Computing
·
OSTI ID:1492861
Toward an optimal online checkpoint solution under a two-level HPC checkpoint model
Journal Article
·
Mon Mar 28 20:00:00 EDT 2016
· IEEE Transactions on Parallel and Distributed Systems
·
OSTI ID:1346727