The Scalable Checkpoint/Restart Library

RESOURCE

Abstract

The Scalable Checkpoint/Restart (SCR) library provides an interface that codes may use to worite our and read in application-level checkpoints in a scalable fashion. In the current implementation, checkpoint files are cached in local storage (hard disk or RAM disk) on the compute nodes. This technique provides scalable aggregate bandwidth and uses storage resources that are fully dedicated to the job. This approach addresses the two common drawbacks of checkpointing a large-scale application to a shared parallel file system, namely, limited bandwidth and file system contention. In fact, on current platforms, SCR scales linearly with the number of compute nodes. It has been benchmarked as high as 720GB/s on 1094 nodes of Atlas, which is nearly two orders of magnitude faster thanthe parallel file system.
Developers:
Release Date:
2009-02-23
Project Type:
Open Source, Publicly Available Repository
Software Type:
Scientific
Licenses:
Other (Commercial or Open-Source): https://github.com/LLNL/scr/blob/master/LICENSE.TXT
Sponsoring Org.:
Code ID:
1155
Site Accession Number:
4349
Research Org.:
Lawrence Livermore National Laboratory
Country of Origin:
United States
Keywords:
ECP

RESOURCE

Citation Formats

Moody, A. The Scalable Checkpoint/Restart Library. Computer Software. https://github.com/LLNL/scr. USDOE. 23 Feb. 2009. Web. doi:10.11578/dc.20171025.1160.
Moody, A. (2009, February 23). The Scalable Checkpoint/Restart Library. [Computer software]. https://github.com/LLNL/scr. https://doi.org/10.11578/dc.20171025.1160.
Moody, A. "The Scalable Checkpoint/Restart Library." Computer software. February 23, 2009. https://github.com/LLNL/scr. https://doi.org/10.11578/dc.20171025.1160.
@misc{ doecode_1155,
title = {The Scalable Checkpoint/Restart Library},
author = {Moody, A.},
abstractNote = {The Scalable Checkpoint/Restart (SCR) library provides an interface that codes may use to worite our and read in application-level checkpoints in a scalable fashion. In the current implementation, checkpoint files are cached in local storage (hard disk or RAM disk) on the compute nodes. This technique provides scalable aggregate bandwidth and uses storage resources that are fully dedicated to the job. This approach addresses the two common drawbacks of checkpointing a large-scale application to a shared parallel file system, namely, limited bandwidth and file system contention. In fact, on current platforms, SCR scales linearly with the number of compute nodes. It has been benchmarked as high as 720GB/s on 1094 nodes of Atlas, which is nearly two orders of magnitude faster thanthe parallel file system.},
doi = {10.11578/dc.20171025.1160},
url = {https://doi.org/10.11578/dc.20171025.1160},
howpublished = {[Computer Software] \url{https://doi.org/10.11578/dc.20171025.1160}},
year = {2009},
month = {feb}
}