Detailed Modeling and Evaluation of a Scalable Multilevel Checkpointing System

Mohror, Kathryn; Moody, Adam; Bronevetsky, Greg; de Supinski, Bronis R.

doi:10.1109/TPDS.2013.100

Title: Detailed Modeling and Evaluation of a Scalable Multilevel Checkpointing System

Journal Article · Mon Sep 01 00:00:00 EDT 2014 · IEEE Transactions on Parallel and Distributed Systems

DOI:https://doi.org/10.1109/TPDS.2013.100· OSTI ID:1225695

Mohror, Kathryn ^[1]; Moody, Adam ^[1]; Bronevetsky, Greg ^[1]; de Supinski, Bronis R. ^[1]

Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)

High-performance computing (HPC) systems are growing more powerful by utilizing more components. As the system mean time before failure correspondingly drops, applications must checkpoint frequently to make progress. But, at scale, the cost of checkpointing becomes prohibitive. A solution to this problem is multilevel checkpointing, which employs multiple types of checkpoints in a single run. Moreover, lightweight checkpoints can handle the most common failure modes, while more expensive checkpoints can handle severe failures. We designed a multilevel checkpointing library, the Scalable Checkpoint/Restart (SCR) library, that writes lightweight checkpoints to node-local storage in addition to the parallel file system. We present probabilistic Markov models of SCR's performance. We show that on future large-scale systems, SCR can lead to a gain in machine efficiency of up to 35 percent, and reduce the load on the parallel file system by a factor of two. In addition, we predict that checkpoint scavenging, or only writing checkpoints to the parallel file system on application termination, can reduce the load on the parallel file system by 20 × on today's systems and still maintain high application efficiency.

View Journal Article

Cite

Export

Save

Research Organization:: Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)

Sponsoring Organization:: USDOE

DOE Contract Number:: AC52-07NA27344

OSTI ID:: 1225695

Report Number(s):: LLNL-JRNL-564721

Journal Information:: IEEE Transactions on Parallel and Distributed Systems, Vol. 25, Issue 9; ISSN 1045-9219

Publisher:: IEEE

Country of Publication:: United States

Language:: English

Similar Records

Detailed Modeling, Design, and Evaluation of a Scalable Multi-level Checkpointing System

Technical Report · Fri Apr 09 00:00:00 EDT 2010 · OSTI ID:1225695

Moody, A T; Bronevetsky, G; Mohror, K M; +1 more

SCR-Exa: Enhanced Scalable Checkpoint Restart (SCR) Library for Next Generation Exascale Computing

Technical Report · Mon Feb 21 00:00:00 EST 2022 · OSTI ID:1225695

Dai, Donglai

Asynchronous Checkpoint Migration with MRNet in the Scalable Checkpoint / Restart Library

Conference · Tue Mar 20 00:00:00 EDT 2012 · OSTI ID:1225695

Mohror, K; Moody, A; de Supinski, B R

Related Subjects

97 MATHEMATICS, COMPUTING, AND INFORMATION SCIENCE
fault tolerance
measurement
evaluation
modeling
simulation of multiple-processor systems

Title: Detailed Modeling and Evaluation of a Scalable Multilevel Checkpointing System

Citation Formats

Similar Records

Related Subjects