skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: MPI Stages: Checkpointing MPI State for Bulk Synchronous Applications

Abstract

When an MPI program experiences a failure, the most common recovery approach is to restart all processes from a previous checkpoint and to re-queue the entire job. A disadvantage of this method is that, although the failure occurred within the main application loop, live processes must start again from the beginning of the program, along with new replacement processes---this incurs unnecessary overhead for live processes. To avoid such overheads and concomitant delays, we introduce the concept of "MPI Stages." MPI Stages saves internal MPI state in a separate checkpoint in conjunction with application state. Upon failure, both MPI and application state are recovered, respectively, from their last synchronous checkpoints and continue without restarting the overall MPI job. Live processes roll back only a few iterations within the main loop instead of rolling back to the beginning of the program, while a replacement of failed process restarts and reintegrates, thereby achieving faster failure recovery. This approach integrates well with large-scale, bulk synchronous applications and checkpoint/restart. In this paper, we identify requirements for production MPI implementations to support state checkpointing with MPI Stages, which includes capturing and managing internal MPI state and serializing and deserializing user handles to MPI objects. We evaluatemore » our fault tolerance approach with a proof-of-concept prototype MPI implementation that includes MPI Stages. We demonstrate its functionality and performance using LULESH and microbenchmarks. Our results show that MPI Stages reduces the recovery time by 13× for LULESH in comparison to checkpoint/restart.« less

Authors:
 [1];  [2];  [3];  [1];  [3];  [3]
  1. Auburn University, Auburn, AL
  2. University of Tennessee at Chattanooga, Chattanooga, TN
  3. Lawrence Livermore National Laboratory, Livermore, CA
Publication Date:
Research Org.:
Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States). National Energy Research Scientific Computing Center (NERSC)
Sponsoring Org.:
USDOE Office of Science (SC)
OSTI Identifier:
1544207
Resource Type:
Journal Article
Journal Name:
EuroMPI'18 Proceedings of the 25th European MPI Users' Group Meeting, Barcelona, Spain, September 23 - 26, 2018
Additional Journal Information:
Journal Name: EuroMPI'18 Proceedings of the 25th European MPI Users' Group Meeting, Barcelona, Spain, September 23 - 26, 2018
Country of Publication:
United States
Language:
English

Citation Formats

Sultana, Nawrin, Skjellum, Anthony, Laguna, Ignacio, Farmer, Matthew Shane, Mohror, Kathryn, and Emani, Murali. MPI Stages: Checkpointing MPI State for Bulk Synchronous Applications. United States: N. p., 2018. Web. doi:10.1145/3236367.3236385.
Sultana, Nawrin, Skjellum, Anthony, Laguna, Ignacio, Farmer, Matthew Shane, Mohror, Kathryn, & Emani, Murali. MPI Stages: Checkpointing MPI State for Bulk Synchronous Applications. United States. doi:10.1145/3236367.3236385.
Sultana, Nawrin, Skjellum, Anthony, Laguna, Ignacio, Farmer, Matthew Shane, Mohror, Kathryn, and Emani, Murali. Mon . "MPI Stages: Checkpointing MPI State for Bulk Synchronous Applications". United States. doi:10.1145/3236367.3236385.
@article{osti_1544207,
title = {MPI Stages: Checkpointing MPI State for Bulk Synchronous Applications},
author = {Sultana, Nawrin and Skjellum, Anthony and Laguna, Ignacio and Farmer, Matthew Shane and Mohror, Kathryn and Emani, Murali},
abstractNote = {When an MPI program experiences a failure, the most common recovery approach is to restart all processes from a previous checkpoint and to re-queue the entire job. A disadvantage of this method is that, although the failure occurred within the main application loop, live processes must start again from the beginning of the program, along with new replacement processes---this incurs unnecessary overhead for live processes. To avoid such overheads and concomitant delays, we introduce the concept of "MPI Stages." MPI Stages saves internal MPI state in a separate checkpoint in conjunction with application state. Upon failure, both MPI and application state are recovered, respectively, from their last synchronous checkpoints and continue without restarting the overall MPI job. Live processes roll back only a few iterations within the main loop instead of rolling back to the beginning of the program, while a replacement of failed process restarts and reintegrates, thereby achieving faster failure recovery. This approach integrates well with large-scale, bulk synchronous applications and checkpoint/restart. In this paper, we identify requirements for production MPI implementations to support state checkpointing with MPI Stages, which includes capturing and managing internal MPI state and serializing and deserializing user handles to MPI objects. We evaluate our fault tolerance approach with a proof-of-concept prototype MPI implementation that includes MPI Stages. We demonstrate its functionality and performance using LULESH and microbenchmarks. Our results show that MPI Stages reduces the recovery time by 13× for LULESH in comparison to checkpoint/restart.},
doi = {10.1145/3236367.3236385},
journal = {EuroMPI'18 Proceedings of the 25th European MPI Users' Group Meeting, Barcelona, Spain, September 23 - 26, 2018},
number = ,
volume = ,
place = {United States},
year = {2018},
month = {1}
}

Works referenced in this record:

Toward Exascale Resilience
journal, September 2009

  • Cappello, Franck; Geist, Al; Gropp, Bill
  • The International Journal of High Performance Computing Applications, Vol. 23, Issue 4
  • DOI: 10.1177/1094342009347767

Fault Tolerance in Petascale/ Exascale Systems: Current Knowledge, Challenges and Research Opportunities
journal, July 2009

  • Cappello, Franck
  • The International Journal of High Performance Computing Applications, Vol. 23, Issue 3
  • DOI: 10.1177/1094342009106189

Post-failure recovery of MPI communication capability: Design and rationale
journal, June 2013

  • Bland, Wesley; Bouteiller, Aurelien; Herault, Thomas
  • The International Journal of High Performance Computing Applications, Vol. 27, Issue 3
  • DOI: 10.1177/1094342013488238

Evaluating and extending user-level fault tolerance in MPI applications
journal, July 2016

  • Laguna, Ignacio; Richards, David F.; Gamblin, Todd
  • The International Journal of High Performance Computing Applications, Vol. 30, Issue 3
  • DOI: 10.1177/1094342015623623

Exploring Automatic, Online Failure Recovery for Scientific Applications at Extreme Scales
conference, November 2014

  • Gamell, Marc; Katz, Daniel S.; Kolla, Hemanth
  • SC14: International Conference for High Performance Computing, Networking, Storage and Analysis
  • DOI: 10.1109/SC.2014.78

The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing
journal, November 2005

  • Sankaran, Sriram; Squyres, Jeffrey M.; Barrett, Brian
  • The International Journal of High Performance Computing Applications, Vol. 19, Issue 4
  • DOI: 10.1177/1094342005056139

Redesigning the message logging model for high performance
journal, June 2010

  • Bouteiller, Aurelien; Bosilca, George; Dongarra, Jack
  • Concurrency and Computation: Practice and Experience, Vol. 22, Issue 16
  • DOI: 10.1002/cpe.1589

Unified model for assessing checkpointing protocols at extreme-scale: CHECKPOINTING PROTOCOLS AT EXTREME-SCALE
journal, November 2013

  • Bosilca, George; Bouteiller, Aurélien; Brunet, Elisabeth
  • Concurrency and Computation: Practice and Experience, Vol. 26, Issue 17
  • DOI: 10.1002/cpe.3173