The Scalable Checkpoint/Restart Library
The Scalable Checkpoint/Restart (SCR) library provides an interface that codes may use to worite our and read in application-level checkpoints in a scalable fashion. In the current implementation, checkpoint files are cached in local storage (hard disk or RAM disk) on the compute nodes. This technique provides scalable aggregate bandwidth and uses storage resources that are fully dedicated to the job. This approach addresses the two common drawbacks of checkpointing a large-scale application to a shared parallel file system, namely, limited bandwidth and file system contention. In fact, on current platforms, SCR scales linearly with the number of compute nodes. It has been benchmarked as high as 720GB/s on 1094 nodes of Atlas, which is nearly two orders of magnitude faster thanthe parallel file system.
- Short Name / Acronym:
- SCR
- Site Accession Number:
- 4349
- Software Type:
- Scientific
- License(s):
- Other (Commercial or Open-Source)
- Research Organization:
- Lawrence Livermore National Laboratory
- Sponsoring Organization:
- USDOEPrimary Award/Contract Number:AC52-07NA27344
- DOE Contract Number:
- AC52-07NA27344
- Code ID:
- 1155
- OSTI ID:
- code-1155
- Country of Origin:
- United States
Similar Records
Detailed Modeling, Design, and Evaluation of a Scalable Multi-level Checkpointing System
Detailed Modeling and Evaluation of a Scalable Multilevel Checkpointing System