DOE CODE / / The Scalable Checkpoint/Restart Library

The Scalable Checkpoint/Restart Library

Full Project

Abstract

The Scalable Checkpoint/Restart (SCR) library provides an interface that codes may use to worite our and read in application-level checkpoints in a scalable fashion. In the current implementation, checkpoint files are cached in local storage (hard disk or RAM disk) on the compute nodes. This technique provides scalable aggregate bandwidth and uses storage resources that are fully dedicated to the job. This approach addresses the two common drawbacks of checkpointing a large-scale application to a shared parallel file system, namely, limited bandwidth and file system contention. In fact, on current platforms, SCR scales linearly with the number of compute nodes. It has been benchmarked as high as 720GB/s on 1094 nodes of Atlas, which is nearly two orders of magnitude faster thanthe parallel file system.

Developers:: Moody, A.

Release Date:: 2009-02-23

Project Type:: Open Source, Publicly Available Repository

Software Type:: Scientific

Licenses:: Other (Commercial or Open-Source): https://github.com/LLNL/scr/blob/master/LICENSE.TXT

Sponsoring Org.:: USDOE

Primary Award/Contract Number:

AC52-07NA27344

Code ID:: 1155

Site Accession Number:: 4349

Research Org.:: Lawrence Livermore National Laboratory

Country of Origin:: United States

Keywords:: ECP

Citation Formats

Moody, A. The Scalable Checkpoint/Restart Library. Computer Software. https://github.com/LLNL/scr. USDOE. 23 Feb. 2009. Web. doi:10.11578/dc.20171025.1160.

Moody, A. (2009, February 23). The Scalable Checkpoint/Restart Library. [Computer software]. https://github.com/LLNL/scr. https://doi.org/10.11578/dc.20171025.1160.

Moody, A. "The Scalable Checkpoint/Restart Library." Computer software. February 23, 2009. https://github.com/LLNL/scr. https://doi.org/10.11578/dc.20171025.1160.

@misc{ doecode_1155,

title = {The Scalable Checkpoint/Restart Library},

author = {Moody, A.},

abstractNote = {The Scalable Checkpoint/Restart (SCR) library provides an interface that codes may use to worite our and read in application-level checkpoints in a scalable fashion. In the current implementation, checkpoint files are cached in local storage (hard disk or RAM disk) on the compute nodes. This technique provides scalable aggregate bandwidth and uses storage resources that are fully dedicated to the job. This approach addresses the two common drawbacks of checkpointing a large-scale application to a shared parallel file system, namely, limited bandwidth and file system contention. In fact, on current platforms, SCR scales linearly with the number of compute nodes. It has been benchmarked as high as 720GB/s on 1094 nodes of Atlas, which is nearly two orders of magnitude faster thanthe parallel file system.},

doi = {10.11578/dc.20171025.1160},

url = {https://doi.org/10.11578/dc.20171025.1160},

howpublished = {[Computer Software] \url{https://doi.org/10.11578/dc.20171025.1160}},

year = {2009},

month = {feb}

}

RESOURCE

SAVE / SHARE

Abstract

RESOURCE

SAVE / SHARE

Citation Formats