DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Failure recovery for bulk synchronous applications with MPI stages

Journal Article · · Parallel Computing
 [1];  [2];  [2];  [3];  [3]
  1. Auburn Univ., AL (United States)
  2. Univ. of Tennessee at Chattanooga, TN (United States)
  3. Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)

When an MPI program experiences a failure, the most common recovery approach is to restart all processes from a previous checkpoint and to re-queue the entire job. A disadvantage of this method is that, although the failure occurred within the main application loop, live processes must start again from the beginning of the program, along with new replacement processes—this incurs unnecessary overhead for live processes. To avoid such overheads and concomitant delays, we introduce the concept of “MPI Stages.” MPI Stages saves internal MPI state in a separate checkpoint in conjunction with application state. Upon failure, both MPI and application state are recovered, respectively, from their last synchronous checkpoints and continue without restarting the overall MPI job. Live processes roll back only a few iterations within the main loop instead of rolling back to the beginning of the program, while a replacement of failed process restarts and reintegrates, thereby achieving faster failure recovery. As a result, this approach integrates well with large-scale, bulk synchronous applications and checkpoint/restart.

Research Organization:
Lawrence Livermore National Laboratory (LLNL), Livermore, CA (United States)
Sponsoring Organization:
National Science Foundation; USDOE; USDOE National Nuclear Security Administration (NNSA)
Grant/Contract Number:
AC52-07NA27344
OSTI ID:
1784608
Report Number(s):
LLNL-JRNL--759751; 948325
Journal Information:
Parallel Computing, Journal Name: Parallel Computing Journal Issue: na Vol. 84; ISSN 0167-8191
Publisher:
ElsevierCopyright Statement
Country of Publication:
United States
Language:
English

References (9)

Redesigning the message logging model for high performance
  • Bouteiller, Aurelien; Bosilca, George; Dongarra, Jack
  • Concurrency and Computation: Practice and Experience, Vol. 22, Issue 16 https://doi.org/10.1002/cpe.1589
journal June 2010
Unified model for assessing checkpointing protocols at extreme-scale: CHECKPOINTING PROTOCOLS AT EXTREME-SCALE
  • Bosilca, George; Bouteiller, Aurélien; Brunet, Elisabeth
  • Concurrency and Computation: Practice and Experience, Vol. 26, Issue 17 https://doi.org/10.1002/cpe.3173
journal November 2013
Berkeley lab checkpoint/restart (BLCR) for Linux clusters journal September 2006
The consensus problem in fault-tolerant computing journal June 1993
The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing journal November 2005
Fault Tolerance in Petascale/ Exascale Systems: Current Knowledge, Challenges and Research Opportunities journal July 2009
Toward Exascale Resilience journal September 2009
Post-failure recovery of MPI communication capability: Design and rationale journal June 2013
Evaluating and extending user-level fault tolerance in MPI applications journal July 2016

Similar Records

MPI Stages: Checkpointing MPI State for Bulk Synchronous Applications
Journal Article · 2018 · EuroMPI'18 Proceedings of the 25th European MPI Users' Group Meeting, Barcelona, Spain, September 23 - 26, 2018 · OSTI ID:1544207

A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance
Conference · 2007 · OSTI ID:931501

EReinit: Scalable and efficient fault-tolerance for bulk-synchronous MPI applications
Journal Article · 2018 · Concurrency and Computation. Practice and Experience · OSTI ID:1708993