A survey on checkpointing strategies: Should we always checkpoint à la Young/Daly?

Bautista-Gomez, Leonardo; Benoit, Anne; Di, Sheng; Herault, Thomas; Robert, Yves; Sun, Hongyang

doi:10.1016/j.future.2024.07.022

A survey on checkpointing strategies: Should we always checkpoint à la Young/Daly?

Journal Article · Thu Jul 18 00:00:00 EDT 2024 · Future Generations Computer Systems

DOI:https://doi.org/10.1016/j.future.2024.07.022· OSTI ID:2406527

Bautista-Gomez, Leonardo ^[1]; ^[2]; Di, Sheng ^[3]; Herault, Thomas ^[4]; Robert, Yves ^[5]; Sun, Hongyang ^[6]

Barcelona Supercomputing Center (BSC) (Spain)
École Normale Supérieure de Lyon (ENS de Lyon) (France); National Institute for Research in Digital Science and Technology (Inria), Lyon (France); Institut Universitaire de France (IUF) (France)
Argonne National Laboratory (ANL), Argonne, IL (United States)
Univ. of Tennessee, Knoxville, TN (United States)
École Normale Supérieure de Lyon (ENS de Lyon) (France); National Institute for Research in Digital Science and Technology (Inria), Lyon (France); Univ. of Tennessee, Knoxville, TN (United States)
Univ. of Kansas, Lawrence, KS (United States)

The Young/Daly formula provides an approximation of the optimal checkpointing period for a parallel application executing on a supercomputing platform. It was originally designed to handle fail-stop errors for preemptible tightly-coupled applications, but has been extended to other application and resilience frameworks. Here, we provide some background and survey various scenarios to assess the usefulness and limitations of the formula, both for preemptible applications and workflow applications represented as a graph of tasks. We also discuss scenarios with uncertainties, and extend the study to silent errors. We exhibit cases where the optimal period is of a different order than that dictated by the Young/Daly formula, and finally we explain how checkpointing can be further combined with replication.

View Accepted Manuscript (DOE)

Research Organization:: Argonne National Laboratory (ANL), Argonne, IL (United States)

Sponsoring Organization:: National Science Foundation (NSF); USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)

Grant/Contract Number:: AC02-06CH11357

OSTI ID:: 2406527

Journal Information:: Future Generations Computer Systems, Journal Name: Future Generations Computer Systems Vol. 161; ISSN 0167-739X

Publisher:: ElsevierCopyright Statement

Country of Publication:: United States

Language:: English

References (54)

Computational complexity of PERT problems Hagstrom, Jane N. Networks, Vol. 18, Issue 2 https://doi.org/10.1002/net.3230180206	journal	June 1988
An evaluation of User-Level Failure Mitigation support in MPI Bland, Wesley; Bouteiller, Aurelien; Herault, Thomas Computing, Vol. 95, Issue 12 https://doi.org/10.1007/s00607-013-0331-3	journal	May 2013
A high-performance, portable implementation of the MPI message passing interface standard Gropp, William; Lusk, Ewing; Doss, Nathan Parallel Computing, Vol. 22, Issue 6 https://doi.org/10.1016/0167-8191(96)00024-5	journal	September 1996
A higher order estimate of the optimum checkpoint interval for restart dumps Daly, J. T. Future Generation Computer Systems, Vol. 22, Issue 3, p. 303-312 https://doi.org/10.1016/j.future.2004.11.016	journal	February 2006
Fault-tolerant elastic scheduling algorithm for workflow in Cloud systems Ding, Yongsheng; Yao, Guangshun; Hao, Kuangrong Information Sciences, Vol. 393 https://doi.org/10.1016/j.ins.2017.01.035	journal	July 2017
Multi-level checkpointing and silent error detection for linear workflows Benoit, Anne; Cavelan, Aurélien; Robert, Yves Journal of Computational Science, Vol. 28 https://doi.org/10.1016/j.jocs.2017.03.024	journal	September 2018
Algorithm-based fault tolerance applied to high performance computing Bosilca, George; Delmas, Rémi; Dongarra, Jack Journal of Parallel and Distributed Computing, Vol. 69, Issue 4 https://doi.org/10.1016/j.jpdc.2008.12.002	journal	April 2009
Coping with recall and precision of soft error detectors Bautista-Gomez, Leonardo; Benoit, Anne; Cavelan, Aurélien Journal of Parallel and Distributed Computing, Vol. 98 https://doi.org/10.1016/j.jpdc.2016.07.007	journal	December 2016
Coping with silent and fail-stop errors at scale by combining replication and checkpointing Benoit, Anne; Cavelan, Aurélien; Cappello, Franck Journal of Parallel and Distributed Computing, Vol. 122 https://doi.org/10.1016/j.jpdc.2018.08.002	journal	December 2018
Using two-level stable storage for efficient checkpointing Silva, L. M.; Silva, J. G. IEE Proceedings - Software, Vol. 145, Issue 6 https://doi.org/10.1049/ip-sen:19982440	journal	January 1998
The effect of cosmic rays on the soft error rate of a DRAM at ground level O'Gorman, T. J. IEEE Transactions on Electron Devices, Vol. 41, Issue 4 https://doi.org/10.1109/16.278509	journal	April 1994
Cosmic ray soft error rates of 16-Mb DRAM memory chips Ziegler, J. F.; Nelson, M. E.; Shell, J. D. IEEE Journal of Solid-State Circuits, Vol. 33, Issue 2 https://doi.org/10.1109/4.658626	journal	January 1998
Diskless checkpointing Plank, J. S.; Puening, M. A. IEEE Transactions on Parallel and Distributed Systems, Vol. 9, Issue 10 https://doi.org/10.1109/71.730527	journal	January 1998
Spatial Support Vector Regression to Detect Silent Errors in the Exascale Era Subasi, Omer; Di, Sheng; Bautista-Gomez, Leonardo 2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid) https://doi.org/10.1109/CCGrid.2016.33	conference	May 2016
MACORD: Online Adaptive Machine Learning Framework for Silent Error Detection Subasi, Omer; Di, Sheng; Balaprakash, Prasanna 2017 IEEE International Conference on Cluster Computing (CLUSTER) https://doi.org/10.1109/CLUSTER.2017.128	conference	September 2017
Survey of failures and fault tolerance in cloud Prathiba, Soma; Sowvarnica, S. 2017 2nd International Conference on Computing and Communications Technologies (ICCCT) https://doi.org/10.1109/ICCCT2.2017.7972271	conference	February 2017
Assessing the Impact of Partial Verifications against Silent Data Corruptions Cavelan, Aurelien; Raina, Saurabh K.; Robert, Yves 2015 44th International Conference on Parallel Processing https://doi.org/10.1109/ICPP.2015.53	conference	September 2015
A Different Re-execution Speed Can Help Benoit, Anne; Cavelan, Aurelien; Fevre, Valentin Le 2016 45th International Conference on Parallel Processing Workshops (ICPPW) https://doi.org/10.1109/ICPPW.2016.45	conference	August 2016
Enabling In-situ Execution of Coupled Scientific Workflow on Multi-core Platform Zhang, Fan; Docan, Ciprian; Parashar, Manish 2012 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), 2012 IEEE 26th International Parallel and Distributed Processing Symposium https://doi.org/10.1109/IPDPS.2012.122	conference	May 2012
PaRSEC: Exploiting Heterogeneity to Enhance Scalability Bosilca, George; Bouteiller, Aurelien; Danalis, Anthony Computing in Science & Engineering, Vol. 15, Issue 6 https://doi.org/10.1109/MCSE.2013.98	journal	November 2013
Exploring Automatic, Online Failure Recovery for Scientific Applications at Extreme Scales Gamell, Marc; Katz, Daniel S.; Kolla, Hemanth SC14: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2014.78	conference	November 2014
Optimization of a Multilevel Checkpoint Model with Uncertain Execution Scales Di, Sheng; Bautista-Gome, Leonardo; Cappello, Franck SC14: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2014.79	conference	November 2014
Unprotected Computing: A Large-Scale Study of DRAM Raw Error Rate on a Supercomputer Bautista-Gomez, Leonardo; Zyulkyarov, Ferad; Unsal, Osman SC16: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2016.54	conference	November 2016
Algorithm-Based Fault Tolerance for Matrix Operations No authors listed IEEE Transactions on Computers, Vol. C-33, Issue 6 https://doi.org/10.1109/TC.1984.1676475	journal	June 1984
Towards Optimal Multi-Level Checkpointing Benoit, Anne; Cavelan, Aurelien; Le Fevre, Valentin IEEE Transactions on Computers, Vol. 66, Issue 7 https://doi.org/10.1109/TC.2016.2643660	journal	July 2017
Dynamic Resource Provisioning With Fault Tolerance for Data-Intensive Meteorological Workflows in Cloud Xu, Xiaolong; Mo, Ruichao; Dai, Fei IEEE Transactions on Industrial Informatics, Vol. 16, Issue 9 https://doi.org/10.1109/TII.2019.2959258	journal	September 2020
Adaptive Impact-Driven Detection of Silent Data Corruption for HPC Applications Di, Sheng; Cappello, Franck IEEE Transactions on Parallel and Distributed Systems, Vol. 27, Issue 10 https://doi.org/10.1109/TPDS.2016.2517639	journal	October 2016
Toward an Optimal Online Checkpoint Solution under a Two-Level HPC Checkpoint Model Di, Sheng; Robert, Yves; Vivien, Frederic IEEE Transactions on Parallel and Distributed Systems, Vol. 28, Issue 1 https://doi.org/10.1109/TPDS.2016.2546248	journal	January 2017
Optimal Checkpointing Strategies for Iterative Applications Du, Yishu; Marchal, Loris; Pallez, Guillaume IEEE Transactions on Parallel and Distributed Systems, Vol. 33, Issue 3 https://doi.org/10.1109/TPDS.2021.3099440	journal	March 2022
The Complexity of Enumeration and Reliability Problems Valiant, Leslie G. SIAM Journal on Computing, Vol. 8, Issue 3 https://doi.org/10.1137/0208032	journal	August 1979
The Complexity of Counting Cuts and of Computing the Probability that a Graph is Connected Provan, J. Scott; Ball, Michael O. SIAM Journal on Computing, Vol. 12, Issue 4 https://doi.org/10.1137/0212053	journal	November 1983
On the Optimum Checkpoint Selection Problem Toueg, Sam; Babaoglu, Özalp SIAM Journal on Computing, Vol. 13, Issue 3 https://doi.org/10.1137/0213039	journal	August 1984
FTI: high performance fault tolerance interface for hybrid systems Bautista-Gomez, Leonardo; Tsuboi, Seiji; Komatitsch, Dimitri Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11 https://doi.org/10.1145/2063384.2063427	conference	January 2011
Checkpointing strategies for parallel jobs Bougeret, Marin; Casanova, Henri; Rabie, Mikael Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11 https://doi.org/10.1145/2063384.2063428	conference	January 2011
Modeling and tolerating heterogeneous failures in large parallel systems Heien, Eric; Kondo, Derrick; Gainaru, Ana Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1145/2063384.2063444	conference	November 2011
Distributed snapshots: determining global states of distributed systems Chandy, K. Mani; Lamport, Leslie ACM Transactions on Computer Systems, Vol. 3, Issue 1 https://doi.org/10.1145/214451.214456	journal	February 1985
Cosmic rays don't strike twice Hwang, Andy A.; Stefanovici, Ioan A.; Schroeder, Bianca ACM SIGARCH Computer Architecture News, Vol. 40, Issue 1 https://doi.org/10.1145/2189750.2150989	journal	March 2012
When is multi-version checkpointing needed? Lu, Guoming; Zheng, Ziming; Chien, Andrew A. Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale - FTXS '13 https://doi.org/10.1145/2465813.2465821	conference	January 2013
Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods Chen, Zizhong ACM SIGPLAN Notices, Vol. 48, Issue 8 https://doi.org/10.1145/2517327.2442533	journal	August 2013
Self-stabilizing iterative solvers Sao, Piyush; Vuduc, Richard Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems - ScalA '13 https://doi.org/10.1145/2530268.2530272	conference	January 2013
Assessing General-Purpose Algorithms to Cope with Fail-Stop and Silent Errors Benoit, Anne; Cavelan, Aurélien; Robert, Yves ACM Transactions on Parallel Computing, Vol. 3, Issue 2 https://doi.org/10.1145/2897189	journal	July 2016
A Generic Approach to Scheduling and Checkpointing Workflows Han, Li; Le Fèvre, Valentin; Canon, Louis-Claude Proceedings of the 47th International Conference on Parallel Processing https://doi.org/10.1145/3225058.3225145	conference	August 2018
Checkpointing Workflows à la Young/Daly Is Not Good Enough Benoit, Anne; Perotin, Luca; Robert, Yves ACM Transactions on Parallel Computing, Vol. 9, Issue 4 https://doi.org/10.1145/3548607	journal	December 2022
Checkpointing à la Young/Daly: An Overview Benoit, Anne; Du, Yishu; Herault, Thomas Proceedings of the 2022 Fourteenth International Conference on Contemporary Computing https://doi.org/10.1145/3549206.3549328	conference	August 2022
A first order approximation to the optimum checkpoint interval Young, John W. Communications of the ACM, Vol. 17, Issue 9 https://doi.org/10.1145/361147.361115	journal	September 1974
When to checkpoint at the end of a fixed-length reservation? Barbut, Quentin; Benoit, Anne; Herault, Thomas Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis https://doi.org/10.1145/3624062.3624115	conference	November 2023
IBM experiments in soft fails in computer electronics (1978–1994) Ziegler, J. F.; Curtis, H. W.; Muhlfeld, H. P. IBM Journal of Research and Development, Vol. 40, Issue 1 https://doi.org/10.1147/rd.401.0003	journal	January 1996
The Use of Triple-Modular Redundancy to Improve Computer Reliability Lyons, R. E.; Vanderkulk, W. IBM Journal of Research and Development, Vol. 6, Issue 2 https://doi.org/10.1147/rd.62.0200	journal	April 1962
Addressing failures in exascale computing Snir, Marc; Wisniewski, Robert W.; Abraham, Jacob A. The International Journal of High Performance Computing Applications, Vol. 28, Issue 2 https://doi.org/10.1177/1094342014522573	journal	March 2014
Use cases of lossy compression for floating-point data in scientific data sets Cappello, Franck; Di, Sheng; Li, Sihuan The International Journal of High Performance Computing Applications, Vol. 33, Issue 6 https://doi.org/10.1177/1094342019853336	journal	May 2019
Performance and reliability trade-offs for the double checkpointing algorithm Dongarra, Jack; Hérault, Thomas; Robert, Yves International Journal of Networking and Computing, Vol. 4, Issue 1 https://doi.org/10.15803/ijnc.4.1_23	journal	January 2014
Checkpointing Strategies for Scheduling Computational Workflows Aupy, Guillaume; Benoit, Anne; Casanova, Henri International Journal of Networking and Computing, Vol. 6, Issue 1 https://doi.org/10.15803/ijnc.6.1_2	journal	January 2016
Combining Checkpointing and Replication for Reliable Execution of Linear Workflows with Fail-Stop and Silent Errors Benoit, Anne; Cavelan, Aurélien; Ciorba, Florina M. International Journal of Networking and Computing, Vol. 9, Issue 1 https://doi.org/10.15803/ijnc.9.1_2	journal	January 2019
Checkpointing Strategies for Shared High-Performance Computing Platforms Herault, Thomas; Robert, Yves; Bouteiller, Aurelien International Journal of Networking and Computing, Vol. 9, Issue 1 https://doi.org/10.15803/ijnc.9.1_28	journal	January 2019

Similar Records

Coping with silent and fail-stop errors at scale by combining replication and checkpointing

Journal Article · Fri Nov 30 23:00:00 EST 2018 · Journal of Parallel and Distributed Computing · OSTI ID:1475194

Checkpointing Strategies for Shared High-Performance Computing Platforms

Journal Article · Mon Dec 31 19:00:00 EST 2018 · International Journal of Networking and Computing · OSTI ID:1492861

Toward an optimal online checkpoint solution under a two-level HPC checkpoint model

Journal Article · Mon Mar 28 20:00:00 EDT 2016 · IEEE Transactions on Parallel and Distributed Systems · OSTI ID:1346727

Related Subjects

97 MATHEMATICS AND COMPUTING
Checkpointing
Optimal period
Resilience
Young/Daly formula

A survey on checkpointing strategies: Should we always checkpoint à la Young/Daly?

Citation Formats

References (54)

Similar Records

Related Subjects