Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

A survey on checkpointing strategies: Should we always checkpoint à la Young/Daly?

Journal Article · · Future Generations Computer Systems
 [1];  [2];  [3];  [4];  [5];  [6]
  1. Barcelona Supercomputing Center (BSC) (Spain)
  2. École Normale Supérieure de Lyon (ENS de Lyon) (France); National Institute for Research in Digital Science and Technology (Inria), Lyon (France); Institut Universitaire de France (IUF) (France)
  3. Argonne National Laboratory (ANL), Argonne, IL (United States)
  4. Univ. of Tennessee, Knoxville, TN (United States)
  5. École Normale Supérieure de Lyon (ENS de Lyon) (France); National Institute for Research in Digital Science and Technology (Inria), Lyon (France); Univ. of Tennessee, Knoxville, TN (United States)
  6. Univ. of Kansas, Lawrence, KS (United States)
The Young/Daly formula provides an approximation of the optimal checkpointing period for a parallel application executing on a supercomputing platform. It was originally designed to handle fail-stop errors for preemptible tightly-coupled applications, but has been extended to other application and resilience frameworks. Here, we provide some background and survey various scenarios to assess the usefulness and limitations of the formula, both for preemptible applications and workflow applications represented as a graph of tasks. We also discuss scenarios with uncertainties, and extend the study to silent errors. We exhibit cases where the optimal period is of a different order than that dictated by the Young/Daly formula, and finally we explain how checkpointing can be further combined with replication.
Research Organization:
Argonne National Laboratory (ANL), Argonne, IL (United States)
Sponsoring Organization:
National Science Foundation (NSF); USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)
Grant/Contract Number:
AC02-06CH11357
OSTI ID:
2406527
Journal Information:
Future Generations Computer Systems, Journal Name: Future Generations Computer Systems Vol. 161; ISSN 0167-739X
Publisher:
ElsevierCopyright Statement
Country of Publication:
United States
Language:
English

References (54)

Computational complexity of PERT problems journal June 1988
An evaluation of User-Level Failure Mitigation support in MPI journal May 2013
A high-performance, portable implementation of the MPI message passing interface standard journal September 1996
A higher order estimate of the optimum checkpoint interval for restart dumps journal February 2006
Fault-tolerant elastic scheduling algorithm for workflow in Cloud systems journal July 2017
Multi-level checkpointing and silent error detection for linear workflows journal September 2018
Algorithm-based fault tolerance applied to high performance computing journal April 2009
Coping with recall and precision of soft error detectors journal December 2016
Coping with silent and fail-stop errors at scale by combining replication and checkpointing journal December 2018
Using two-level stable storage for efficient checkpointing journal January 1998
The effect of cosmic rays on the soft error rate of a DRAM at ground level journal April 1994
Cosmic ray soft error rates of 16-Mb DRAM memory chips journal January 1998
Diskless checkpointing journal January 1998
Spatial Support Vector Regression to Detect Silent Errors in the Exascale Era conference May 2016
MACORD: Online Adaptive Machine Learning Framework for Silent Error Detection conference September 2017
Survey of failures and fault tolerance in cloud conference February 2017
Assessing the Impact of Partial Verifications against Silent Data Corruptions conference September 2015
A Different Re-execution Speed Can Help conference August 2016
Enabling In-situ Execution of Coupled Scientific Workflow on Multi-core Platform
  • Zhang, Fan; Docan, Ciprian; Parashar, Manish
  • 2012 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), 2012 IEEE 26th International Parallel and Distributed Processing Symposium https://doi.org/10.1109/IPDPS.2012.122
conference May 2012
PaRSEC: Exploiting Heterogeneity to Enhance Scalability journal November 2013
Exploring Automatic, Online Failure Recovery for Scientific Applications at Extreme Scales
  • Gamell, Marc; Katz, Daniel S.; Kolla, Hemanth
  • SC14: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2014.78
conference November 2014
Optimization of a Multilevel Checkpoint Model with Uncertain Execution Scales
  • Di, Sheng; Bautista-Gome, Leonardo; Cappello, Franck
  • SC14: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2014.79
conference November 2014
Unprotected Computing: A Large-Scale Study of DRAM Raw Error Rate on a Supercomputer
  • Bautista-Gomez, Leonardo; Zyulkyarov, Ferad; Unsal, Osman
  • SC16: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2016.54
conference November 2016
Algorithm-Based Fault Tolerance for Matrix Operations journal June 1984
Towards Optimal Multi-Level Checkpointing journal July 2017
Dynamic Resource Provisioning With Fault Tolerance for Data-Intensive Meteorological Workflows in Cloud journal September 2020
Adaptive Impact-Driven Detection of Silent Data Corruption for HPC Applications journal October 2016
Toward an Optimal Online Checkpoint Solution under a Two-Level HPC Checkpoint Model journal January 2017
Optimal Checkpointing Strategies for Iterative Applications journal March 2022
The Complexity of Enumeration and Reliability Problems journal August 1979
The Complexity of Counting Cuts and of Computing the Probability that a Graph is Connected journal November 1983
On the Optimum Checkpoint Selection Problem journal August 1984
FTI: high performance fault tolerance interface for hybrid systems
  • Bautista-Gomez, Leonardo; Tsuboi, Seiji; Komatitsch, Dimitri
  • Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11 https://doi.org/10.1145/2063384.2063427
conference January 2011
Checkpointing strategies for parallel jobs
  • Bougeret, Marin; Casanova, Henri; Rabie, Mikael
  • Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11 https://doi.org/10.1145/2063384.2063428
conference January 2011
Modeling and tolerating heterogeneous failures in large parallel systems conference November 2011
Distributed snapshots: determining global states of distributed systems journal February 1985
Cosmic rays don't strike twice journal March 2012
When is multi-version checkpointing needed? conference January 2013
Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods journal August 2013
Self-stabilizing iterative solvers conference January 2013
Assessing General-Purpose Algorithms to Cope with Fail-Stop and Silent Errors journal July 2016
A Generic Approach to Scheduling and Checkpointing Workflows conference August 2018
Checkpointing Workflows à la Young/Daly Is Not Good Enough journal December 2022
Checkpointing à la Young/Daly: An Overview conference August 2022
A first order approximation to the optimum checkpoint interval journal September 1974
When to checkpoint at the end of a fixed-length reservation?
  • Barbut, Quentin; Benoit, Anne; Herault, Thomas
  • Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis https://doi.org/10.1145/3624062.3624115
conference November 2023
IBM experiments in soft fails in computer electronics (1978–1994) journal January 1996
The Use of Triple-Modular Redundancy to Improve Computer Reliability journal April 1962
Addressing failures in exascale computing journal March 2014
Use cases of lossy compression for floating-point data in scientific data sets journal May 2019
Performance and reliability trade-offs for the double checkpointing algorithm journal January 2014
Checkpointing Strategies for Scheduling Computational Workflows journal January 2016
Combining Checkpointing and Replication for Reliable Execution of Linear Workflows with Fail-Stop and Silent Errors journal January 2019
Checkpointing Strategies for Shared High-Performance Computing Platforms journal January 2019

Similar Records

Coping with silent and fail-stop errors at scale by combining replication and checkpointing
Journal Article · Fri Nov 30 23:00:00 EST 2018 · Journal of Parallel and Distributed Computing · OSTI ID:1475194

Checkpointing Strategies for Shared High-Performance Computing Platforms
Journal Article · Mon Dec 31 19:00:00 EST 2018 · International Journal of Networking and Computing · OSTI ID:1492861

Toward an optimal online checkpoint solution under a two-level HPC checkpoint model
Journal Article · Mon Mar 28 20:00:00 EDT 2016 · IEEE Transactions on Parallel and Distributed Systems · OSTI ID:1346727