Asynchronous Checkpoint Migration with MRNet in the Scalable Checkpoint / Restart Library

Mohror, K; Moody, A; de Supinski, B R

doi:10.1109/DSNW.2012.6264668

Title: Asynchronous Checkpoint Migration with MRNet in the Scalable Checkpoint / Restart Library

Conference · Tue Mar 20 00:00:00 EDT 2012

DOI:https://doi.org/10.1109/DSNW.2012.6264668· OSTI ID:1047769

Mohror, K; Moody, A; de Supinski, B R

Applications running on today's supercomputers tolerate failures by periodically saving their state in checkpoint files on stable storage, such as a parallel file system. Although this approach is simple, the overhead of writing the checkpoints can be prohibitive, especially for large-scale jobs. In this paper, we present initial results of an enhancement to our Scalable Checkpoint/Restart Library (SCR). We employ MRNet, a tree-based overlay network library, to transfer checkpoints from the compute nodes to the parallel file system asynchronously. This enhancement increases application efficiency by removing the need for an application to block while checkpoints are transferred to the parallel file system. We show that the integration of SCR with MRNet can reduce the time spent in I/O operations by as much as 15x. However, our experiments exposed new scalability issues with our initial implementation. We discuss the sources of the scalability problems and our plans to address them.

View Conference

Cite

Export

Save

Research Organization:: Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)

Sponsoring Organization:: USDOE

DOE Contract Number:: W-7405-ENG-48

OSTI ID:: 1047769

Report Number(s):: LLNL-PROC-540391; TRN: US201216%%484

Resource Relation:: Conference: Presented at: Workshop on Fault-Tolerance for HPC at Extreme Scale (FTXS 2012), Boston, MA, United States, Jun 25 - Jun 28, 2012

Country of Publication:: United States

Language:: English

Similar Records

SCR-Exa: Enhanced Scalable Checkpoint Restart (SCR) Library for Next Generation Exascale Computing

Technical Report · Mon Feb 21 00:00:00 EST 2022 · OSTI ID:1047769

Dai, Donglai

The Scalable Checkpoint/Restart Library

Software · Mon Feb 23 00:00:00 EST 2009 · OSTI ID:1047769

Moody, A.

Detailed Modeling, Design, and Evaluation of a Scalable Multi-level Checkpointing System

Technical Report · Fri Apr 09 00:00:00 EDT 2010 · OSTI ID:1047769

Moody, A T; Bronevetsky, G; Mohror, K M; +1 more

Related Subjects

97 MATHEMATICAL METHODS AND COMPUTING
EFFICIENCY
IMPLEMENTATION
STORAGE
SUPERCOMPUTERS

Title: Asynchronous Checkpoint Migration with MRNet in the Scalable Checkpoint / Restart Library

Citation Formats

Similar Records

Related Subjects