Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

The Scalable Checkpoint/Restart Library

Software ·
DOI:https://doi.org/10.11578/dc.20171025.1160· OSTI ID:code-1155 · Code ID:1155

The Scalable Checkpoint/Restart (SCR) library provides an interface that codes may use to worite our and read in application-level checkpoints in a scalable fashion. In the current implementation, checkpoint files are cached in local storage (hard disk or RAM disk) on the compute nodes. This technique provides scalable aggregate bandwidth and uses storage resources that are fully dedicated to the job. This approach addresses the two common drawbacks of checkpointing a large-scale application to a shared parallel file system, namely, limited bandwidth and file system contention. In fact, on current platforms, SCR scales linearly with the number of compute nodes. It has been benchmarked as high as 720GB/s on 1094 nodes of Atlas, which is nearly two orders of magnitude faster thanthe parallel file system.

Short Name / Acronym:
SCR
Site Accession Number:
4349
Software Type:
Scientific
License(s):
Other (Commercial or Open-Source)
Research Organization:
Lawrence Livermore National Laboratory
Sponsoring Organization:
USDOE

Primary Award/Contract Number:
AC52-07NA27344
DOE Contract Number:
AC52-07NA27344
Code ID:
1155
OSTI ID:
code-1155
Country of Origin:
United States

Similar Records

Asynchronous Checkpoint Migration with MRNet in the Scalable Checkpoint / Restart Library
Conference · Tue Mar 20 00:00:00 EDT 2012 · OSTI ID:1047769

Detailed Modeling, Design, and Evaluation of a Scalable Multi-level Checkpointing System
Technical Report · Fri Apr 09 00:00:00 EDT 2010 · OSTI ID:984082

Detailed Modeling and Evaluation of a Scalable Multilevel Checkpointing System
Journal Article · Mon Sep 01 00:00:00 EDT 2014 · IEEE Transactions on Parallel and Distributed Systems · OSTI ID:1225695

Related Subjects

ECP