skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Detailed Modeling, Design, and Evaluation of a Scalable Multi-level Checkpointing System

Abstract

High-performance computing (HPC) systems are growing more powerful by utilizing more hardware components. As the system mean-time-before-failure correspondingly drops, applications must checkpoint more frequently to make progress. However, as the system memory sizes grow faster than the bandwidth to the parallel file system, the cost of checkpointing begins to dominate application run times. A potential solution to this problem is to use multi-level checkpointing, which employs multiple types of checkpoints with different costs and different levels of resiliency in a single run. The goal is to design light-weight checkpoints to handle the most common failure modes and rely on more expensive checkpoints for less common, but more severe failures. While this approach is theoretically promising, it has not been fully evaluated in a large-scale, production system context. To this end we have designed a system, called the Scalable Checkpoint/Restart (SCR) library, that writes checkpoints to storage on the compute nodes utilizing RAM, Flash, or disk, in addition to the parallel file system. We present the performance and reliability properties of SCR as well as a probabilistic Markov model that predicts its performance on current and future systems. We show that multi-level checkpointing improves efficiency on existing large-scale systems and thatmore » this benefit increases as the system size grows. In particular, we developed low-cost checkpoint schemes that are 100x-1000x faster than the parallel file system and effective against 85% of our system failures. This leads to a gain in machine efficiency of up to 35%, and it reduces the the load on the parallel file system by a factor of two on current and future systems.« less

Authors:
; ; ;
Publication Date:
Research Org.:
Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)
Sponsoring Org.:
USDOE
OSTI Identifier:
984082
Report Number(s):
LLNL-TR-440491
TRN: US201015%%1052
DOE Contract Number:  
W-7405-ENG-48
Resource Type:
Technical Report
Country of Publication:
United States
Language:
English
Subject:
99 GENERAL AND MISCELLANEOUS; DESIGN; EFFICIENCY; EVALUATION; PERFORMANCE; PRODUCTION; RELIABILITY; SIMULATION; STORAGE

Citation Formats

Moody, A T, Bronevetsky, G, Mohror, K M, and de Supinski, B R. Detailed Modeling, Design, and Evaluation of a Scalable Multi-level Checkpointing System. United States: N. p., 2010. Web. doi:10.2172/984082.
Moody, A T, Bronevetsky, G, Mohror, K M, & de Supinski, B R. Detailed Modeling, Design, and Evaluation of a Scalable Multi-level Checkpointing System. United States. https://doi.org/10.2172/984082
Moody, A T, Bronevetsky, G, Mohror, K M, and de Supinski, B R. Fri . "Detailed Modeling, Design, and Evaluation of a Scalable Multi-level Checkpointing System". United States. https://doi.org/10.2172/984082. https://www.osti.gov/servlets/purl/984082.
@article{osti_984082,
title = {Detailed Modeling, Design, and Evaluation of a Scalable Multi-level Checkpointing System},
author = {Moody, A T and Bronevetsky, G and Mohror, K M and de Supinski, B R},
abstractNote = {High-performance computing (HPC) systems are growing more powerful by utilizing more hardware components. As the system mean-time-before-failure correspondingly drops, applications must checkpoint more frequently to make progress. However, as the system memory sizes grow faster than the bandwidth to the parallel file system, the cost of checkpointing begins to dominate application run times. A potential solution to this problem is to use multi-level checkpointing, which employs multiple types of checkpoints with different costs and different levels of resiliency in a single run. The goal is to design light-weight checkpoints to handle the most common failure modes and rely on more expensive checkpoints for less common, but more severe failures. While this approach is theoretically promising, it has not been fully evaluated in a large-scale, production system context. To this end we have designed a system, called the Scalable Checkpoint/Restart (SCR) library, that writes checkpoints to storage on the compute nodes utilizing RAM, Flash, or disk, in addition to the parallel file system. We present the performance and reliability properties of SCR as well as a probabilistic Markov model that predicts its performance on current and future systems. We show that multi-level checkpointing improves efficiency on existing large-scale systems and that this benefit increases as the system size grows. In particular, we developed low-cost checkpoint schemes that are 100x-1000x faster than the parallel file system and effective against 85% of our system failures. This leads to a gain in machine efficiency of up to 35%, and it reduces the the load on the parallel file system by a factor of two on current and future systems.},
doi = {10.2172/984082},
url = {https://www.osti.gov/biblio/984082}, journal = {},
number = ,
volume = ,
place = {United States},
year = {2010},
month = {4}
}