The unexpected virtue of almost: Exploiting MPI collective operations to approximately coordinate checkpoints
- Center for Computing Research Sandia National Laboratories Albuquerque New Mexico
Summary Coordinated checkpoint/restart is currently the dominant approach to mitigating the impact of failures on important scientific applications running on large‐scale distributed systems. However, there is widespread evidence that coordinated checkpointing may no longer be viable on next‐generation systems. Uncoordinated checkpoint/restart attempts to address the shortcomings of coordinated checkpoint/restart by allowing application processes to checkpoint their state independently. However, eliminating coordination may significantly degrade application performance. In this paper, we propose an approach that leverages existing coordination in important scientific applications to approximately coordinate checkpoints. Specifically, we propose to extend MPI implementations to force checkpoints to occur immediately after the completion of a collective operation. We evaluate the performance implications of this approach using an existing validated simulation framework. Our results demonstrate that approximately coordinated checkpointing can significantly improve application performance relative to totally uncoordinated checkpointing. We also show that forcing checkpoints to occur following a collective operation has a small impact on the nominal checkpoint interval for several important workloads. As a whole, the results presented in this paper demonstrate that approximately coordinated checkpointing may provide significant performance benefits without significantly increasing the cost of failure recovery.
- Sponsoring Organization:
- USDOE
- Grant/Contract Number:
- NA0003525
- OSTI ID:
- 1469227
- Journal Information:
- Concurrency and Computation. Practice and Experience, Journal Name: Concurrency and Computation. Practice and Experience Vol. 32 Journal Issue: 3; ISSN 1532-0626
- Publisher:
- Wiley Blackwell (John Wiley & Sons)Copyright Statement
- Country of Publication:
- United Kingdom
- Language:
- English
Web of Science
Similar Records
The Unexpected Virtue of Almost: Exploiting MPI Collective Operations to Approximately Coordinate Checkpoints.
Evaluating Online Global Recovery with Fenix Using Application-Aware In-Memory Checkpointing Techniques