Asynchronous Checkpoint Migration with MRNet in the Scalable Checkpoint / Restart Library
Applications running on today's supercomputers tolerate failures by periodically saving their state in checkpoint files on stable storage, such as a parallel file system. Although this approach is simple, the overhead of writing the checkpoints can be prohibitive, especially for large-scale jobs. In this paper, we present initial results of an enhancement to our Scalable Checkpoint/Restart Library (SCR). We employ MRNet, a tree-based overlay network library, to transfer checkpoints from the compute nodes to the parallel file system asynchronously. This enhancement increases application efficiency by removing the need for an application to block while checkpoints are transferred to the parallel file system. We show that the integration of SCR with MRNet can reduce the time spent in I/O operations by as much as 15x. However, our experiments exposed new scalability issues with our initial implementation. We discuss the sources of the scalability problems and our plans to address them.
- Research Organization:
- Lawrence Livermore National Laboratory (LLNL), Livermore, CA
- Sponsoring Organization:
- USDOE
- DOE Contract Number:
- W-7405-ENG-48
- OSTI ID:
- 1047769
- Report Number(s):
- LLNL-PROC-540391
- Country of Publication:
- United States
- Language:
- English
Similar Records
The Scalable Checkpoint/Restart Library
Detailed Modeling and Evaluation of a Scalable Multilevel Checkpointing System
Detailed Modeling, Design, and Evaluation of a Scalable Multi-level Checkpointing System
Software
·
Sun Feb 22 19:00:00 EST 2009
·
OSTI ID:code-1155
Detailed Modeling and Evaluation of a Scalable Multilevel Checkpointing System
Journal Article
·
Mon Sep 01 00:00:00 EDT 2014
· IEEE Transactions on Parallel and Distributed Systems
·
OSTI ID:1225695
Detailed Modeling, Design, and Evaluation of a Scalable Multi-level Checkpointing System
Technical Report
·
Fri Apr 09 00:00:00 EDT 2010
·
OSTI ID:984082