Detailed Modeling and Evaluation of a Scalable Multilevel Checkpointing System
Abstract
High-performance computing (HPC) systems are growing more powerful by utilizing more components. As the system mean time before failure correspondingly drops, applications must checkpoint frequently to make progress. But, at scale, the cost of checkpointing becomes prohibitive. A solution to this problem is multilevel checkpointing, which employs multiple types of checkpoints in a single run. Moreover, lightweight checkpoints can handle the most common failure modes, while more expensive checkpoints can handle severe failures. We designed a multilevel checkpointing library, the Scalable Checkpoint/Restart (SCR) library, that writes lightweight checkpoints to node-local storage in addition to the parallel file system. We present probabilistic Markov models of SCR's performance. We show that on future large-scale systems, SCR can lead to a gain in machine efficiency of up to 35 percent, and reduce the load on the parallel file system by a factor of two. In addition, we predict that checkpoint scavenging, or only writing checkpoints to the parallel file system on application termination, can reduce the load on the parallel file system by 20 × on today's systems and still maintain high application efficiency.
- Authors:
-
- Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)
- Publication Date:
- Research Org.:
- Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)
- Sponsoring Org.:
- USDOE
- OSTI Identifier:
- 1225695
- Report Number(s):
- LLNL-JRNL-564721
Journal ID: ISSN 1045-9219
- DOE Contract Number:
- AC52-07NA27344
- Resource Type:
- Journal Article
- Journal Name:
- IEEE Transactions on Parallel and Distributed Systems
- Additional Journal Information:
- Journal Volume: 25; Journal Issue: 9; Journal ID: ISSN 1045-9219
- Publisher:
- IEEE
- Country of Publication:
- United States
- Language:
- English
- Subject:
- 97 MATHEMATICS, COMPUTING, AND INFORMATION SCIENCE; fault tolerance; measurement; evaluation; modeling; simulation of multiple-processor systems
Citation Formats
Mohror, Kathryn, Moody, Adam, Bronevetsky, Greg, and de Supinski, Bronis R. Detailed Modeling and Evaluation of a Scalable Multilevel Checkpointing System. United States: N. p., 2014.
Web. doi:10.1109/TPDS.2013.100.
Mohror, Kathryn, Moody, Adam, Bronevetsky, Greg, & de Supinski, Bronis R. Detailed Modeling and Evaluation of a Scalable Multilevel Checkpointing System. United States. https://doi.org/10.1109/TPDS.2013.100
Mohror, Kathryn, Moody, Adam, Bronevetsky, Greg, and de Supinski, Bronis R. 2014.
"Detailed Modeling and Evaluation of a Scalable Multilevel Checkpointing System". United States. https://doi.org/10.1109/TPDS.2013.100. https://www.osti.gov/servlets/purl/1225695.
@article{osti_1225695,
title = {Detailed Modeling and Evaluation of a Scalable Multilevel Checkpointing System},
author = {Mohror, Kathryn and Moody, Adam and Bronevetsky, Greg and de Supinski, Bronis R.},
abstractNote = {High-performance computing (HPC) systems are growing more powerful by utilizing more components. As the system mean time before failure correspondingly drops, applications must checkpoint frequently to make progress. But, at scale, the cost of checkpointing becomes prohibitive. A solution to this problem is multilevel checkpointing, which employs multiple types of checkpoints in a single run. Moreover, lightweight checkpoints can handle the most common failure modes, while more expensive checkpoints can handle severe failures. We designed a multilevel checkpointing library, the Scalable Checkpoint/Restart (SCR) library, that writes lightweight checkpoints to node-local storage in addition to the parallel file system. We present probabilistic Markov models of SCR's performance. We show that on future large-scale systems, SCR can lead to a gain in machine efficiency of up to 35 percent, and reduce the load on the parallel file system by a factor of two. In addition, we predict that checkpoint scavenging, or only writing checkpoints to the parallel file system on application termination, can reduce the load on the parallel file system by 20 × on today's systems and still maintain high application efficiency.},
doi = {10.1109/TPDS.2013.100},
url = {https://www.osti.gov/biblio/1225695},
journal = {IEEE Transactions on Parallel and Distributed Systems},
issn = {1045-9219},
number = 9,
volume = 25,
place = {United States},
year = {Mon Sep 01 00:00:00 EDT 2014},
month = {Mon Sep 01 00:00:00 EDT 2014}
}