Abstract
The Scalable Checkpoint/Restart (SCR) library provides an interface that codes may use to worite our and read in application-level checkpoints in a scalable fashion. In the current implementation, checkpoint files are cached in local storage (hard disk or RAM disk) on the compute nodes. This technique provides scalable aggregate bandwidth and uses storage resources that are fully dedicated to the job. This approach addresses the two common drawbacks of checkpointing a large-scale application to a shared parallel file system, namely, limited bandwidth and file system contention. In fact, on current platforms, SCR scales linearly with the number of compute nodes. It has been benchmarked as high as 720GB/s on 1094 nodes of Atlas, which is nearly two orders of magnitude faster thanthe parallel file system.
- Developers:
- Release Date:
- 2009-02-23
- Project Type:
- Open Source, Publicly Available Repository
- Software Type:
- Scientific
- Licenses:
-
Other (Commercial or Open-Source): https://github.com/LLNL/scr/blob/master/LICENSE.TXT
- Sponsoring Org.:
-
USDOEPrimary Award/Contract Number:AC52-07NA27344
- Code ID:
- 1155
- Site Accession Number:
- 4349
- Research Org.:
- Lawrence Livermore National Laboratory
- Country of Origin:
- United States
- Keywords:
- ECP
Citation Formats
Moody, A.
The Scalable Checkpoint/Restart Library.
Computer Software.
https://github.com/LLNL/scr.
USDOE.
23 Feb. 2009.
Web.
doi:10.11578/dc.20171025.1160.
Moody, A.
(2009, February 23).
The Scalable Checkpoint/Restart Library.
[Computer software].
https://github.com/LLNL/scr.
https://doi.org/10.11578/dc.20171025.1160.
Moody, A.
"The Scalable Checkpoint/Restart Library." Computer software.
February 23, 2009.
https://github.com/LLNL/scr.
https://doi.org/10.11578/dc.20171025.1160.
@misc{
doecode_1155,
title = {The Scalable Checkpoint/Restart Library},
author = {Moody, A.},
abstractNote = {The Scalable Checkpoint/Restart (SCR) library provides an interface that codes may use to worite our and read in application-level checkpoints in a scalable fashion. In the current implementation, checkpoint files are cached in local storage (hard disk or RAM disk) on the compute nodes. This technique provides scalable aggregate bandwidth and uses storage resources that are fully dedicated to the job. This approach addresses the two common drawbacks of checkpointing a large-scale application to a shared parallel file system, namely, limited bandwidth and file system contention. In fact, on current platforms, SCR scales linearly with the number of compute nodes. It has been benchmarked as high as 720GB/s on 1094 nodes of Atlas, which is nearly two orders of magnitude faster thanthe parallel file system.},
doi = {10.11578/dc.20171025.1160},
url = {https://doi.org/10.11578/dc.20171025.1160},
howpublished = {[Computer Software] \url{https://doi.org/10.11578/dc.20171025.1160}},
year = {2009},
month = {feb}
}