skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: The unexpected virtue of almost: Exploiting MPI collective operations to approximately coordinate checkpoints

Journal Article · · Concurrency and Computation. Practice and Experience
DOI:https://doi.org/10.1002/cpe.4890· OSTI ID:1469227

Summary Coordinated checkpoint/restart is currently the dominant approach to mitigating the impact of failures on important scientific applications running on large‐scale distributed systems. However, there is widespread evidence that coordinated checkpointing may no longer be viable on next‐generation systems. Uncoordinated checkpoint/restart attempts to address the shortcomings of coordinated checkpoint/restart by allowing application processes to checkpoint their state independently. However, eliminating coordination may significantly degrade application performance. In this paper, we propose an approach that leverages existing coordination in important scientific applications to approximately coordinate checkpoints. Specifically, we propose to extend MPI implementations to force checkpoints to occur immediately after the completion of a collective operation. We evaluate the performance implications of this approach using an existing validated simulation framework. Our results demonstrate that approximately coordinated checkpointing can significantly improve application performance relative to totally uncoordinated checkpointing. We also show that forcing checkpoints to occur following a collective operation has a small impact on the nominal checkpoint interval for several important workloads. As a whole, the results presented in this paper demonstrate that approximately coordinated checkpointing may provide significant performance benefits without significantly increasing the cost of failure recovery.

Sponsoring Organization:
USDOE
Grant/Contract Number:
NA0003525
OSTI ID:
1469227
Journal Information:
Concurrency and Computation. Practice and Experience, Journal Name: Concurrency and Computation. Practice and Experience Vol. 32 Journal Issue: 3; ISSN 1532-0626
Publisher:
Wiley Blackwell (John Wiley & Sons)Copyright Statement
Country of Publication:
United Kingdom
Language:
English
Citation Metrics:
Cited by: 1 work
Citation information provided by
Web of Science

References (19)

Distributed snapshots: determining global states of distributed systems journal February 1985
Uncoordinated Checkpointing Without Domino Effect for Send-Deterministic MPI Applications
  • Guermouche, Amina; Ropars, Thomas; Brunet, Elisabeth
  • Distributed Processing Symposium (IPDPS), 2011 IEEE International Parallel & Distributed Processing Symposium https://doi.org/10.1109/IPDPS.2011.95
conference May 2011
Checkpointing strategies for parallel jobs
  • Bougeret, Marin; Casanova, Henri; Rabie, Mikael
  • Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11 https://doi.org/10.1145/2063384.2063428
conference January 2011
ickp: a consistent checkpointer for multicomputers journal July 1994
A higher order estimate of the optimum checkpoint interval for restart dumps journal February 2006
The Design and Implementation of Checkpoint/Restart Process Fault Tolerance for Open MPI conference March 2007
CTH: A three-dimensional shock wave physics code journal January 1990
Evaluating the viability of process replication reliability for exascale systems
  • Ferreira, Kurt; Stearley, Jon; Laros, James H.
  • Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11 https://doi.org/10.1145/2063384.2063443
conference January 2011
Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI Protocols journal January 2008
Two algorithms for barrier synchronization journal February 1988
How I Learned to Stop Worrying and Love In Situ Analytics: Leveraging Latent Synchronization in MPI Collective Algorithms conference January 2016
The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing journal November 2005
Recovery in distributed systems using asynchronous message logging and checkpointing conference January 1988
Modeling the Impact of Checkpoints on Next-Generation Systems conference September 2007
An Examination of the Impact of Failure Distribution on Coordinated Checkpoint/Restart conference January 2016
LogGOPSim: simulating large-scale applications in the LogGOPS model
  • Hoefler, Torsten; Schneider, Timo; Lumsdaine, Andrew
  • Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing - HPDC '10 https://doi.org/10.1145/1851476.1851564
conference January 2010
On noise and the performance benefit of nonblocking collectives journal August 2015
Fast Parallel Algorithms for Short-Range Molecular Dynamics journal March 1995
Understanding the Effects of Communication and Coordination on Checkpointing at Scale
  • Ferreira, Kurt B.; Widener, Patrick; Levy, Scott
  • SC14: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2014.77
conference November 2014

Similar Records

MPI Stages: Checkpointing MPI State for Bulk Synchronous Applications
Journal Article · Mon Jan 01 00:00:00 EST 2018 · EuroMPI'18 Proceedings of the 25th European MPI Users' Group Meeting, Barcelona, Spain, September 23 - 26, 2018 · OSTI ID:1469227

The Unexpected Virtue of Almost: Exploiting MPI Collective Operations to Approximately Coordinate Checkpoints.
Conference · Wed Nov 01 00:00:00 EDT 2017 · OSTI ID:1469227

Evaluating Online Global Recovery with Fenix Using Application-Aware In-Memory Checkpointing Techniques
Conference · Mon Aug 01 00:00:00 EDT 2016 · OSTI ID:1469227

Related Subjects