MPI Stages: Checkpointing MPI State for Bulk Synchronous Applications

Sultana, Nawrin; Skjellum, Anthony; Laguna, Ignacio; Farmer, Matthew Shane; Mohror, Kathryn; Emani, Murali

doi:10.1145/3236367.3236385

MPI Stages: Checkpointing MPI State for Bulk Synchronous Applications

Journal Article · Mon Jan 01 04:00:00 EST 2018 · EuroMPI'18 Proceedings of the 25th European MPI Users' Group Meeting, Barcelona, Spain, September 23 - 26, 2018

DOI:https://doi.org/10.1145/3236367.3236385· OSTI ID:1544207

Sultana, Nawrin ^[1]; Skjellum, Anthony ^[2]; Laguna, Ignacio ^[3]; Farmer, Matthew Shane ^[1]; Mohror, Kathryn ^[3]; Emani, Murali ^[3]

Auburn University, Auburn, AL
University of Tennessee at Chattanooga, Chattanooga, TN
Lawrence Livermore National Laboratory, Livermore, CA

When an MPI program experiences a failure, the most common recovery approach is to restart all processes from a previous checkpoint and to re-queue the entire job. A disadvantage of this method is that, although the failure occurred within the main application loop, live processes must start again from the beginning of the program, along with new replacement processes---this incurs unnecessary overhead for live processes. To avoid such overheads and concomitant delays, we introduce the concept of "MPI Stages." MPI Stages saves internal MPI state in a separate checkpoint in conjunction with application state. Upon failure, both MPI and application state are recovered, respectively, from their last synchronous checkpoints and continue without restarting the overall MPI job. Live processes roll back only a few iterations within the main loop instead of rolling back to the beginning of the program, while a replacement of failed process restarts and reintegrates, thereby achieving faster failure recovery. This approach integrates well with large-scale, bulk synchronous applications and checkpoint/restart. In this paper, we identify requirements for production MPI implementations to support state checkpointing with MPI Stages, which includes capturing and managing internal MPI state and serializing and deserializing user handles to MPI objects. We evaluate our fault tolerance approach with a proof-of-concept prototype MPI implementation that includes MPI Stages. We demonstrate its functionality and performance using LULESH and microbenchmarks. Our results show that MPI Stages reduces the recovery time by 13× for LULESH in comparison to checkpoint/restart.

Research Organization:: Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States). National Energy Research Scientific Computing Center (NERSC)

Sponsoring Organization:: USDOE Office of Science (SC)

OSTI ID:: 1544207

Journal Information:: EuroMPI'18 Proceedings of the 25th European MPI Users' Group Meeting, Barcelona, Spain, September 23 - 26, 2018, Journal Name: EuroMPI'18 Proceedings of the 25th European MPI Users' Group Meeting, Barcelona, Spain, September 23 - 26, 2018

Country of Publication:: United States

Language:: English

References (8)

Post-failure recovery of MPI communication capability: Design and rationale Bland, Wesley; Bouteiller, Aurelien; Herault, Thomas The International Journal of High Performance Computing Applications, Vol. 27, Issue 3 https://doi.org/10.1177/1094342013488238	journal	June 2013
The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing Sankaran, Sriram; Squyres, Jeffrey M.; Barrett, Brian The International Journal of High Performance Computing Applications, Vol. 19, Issue 4 https://doi.org/10.1177/1094342005056139	journal	November 2005
Toward Exascale Resilience Cappello, Franck; Geist, Al; Gropp, Bill The International Journal of High Performance Computing Applications, Vol. 23, Issue 4 https://doi.org/10.1177/1094342009347767	journal	September 2009
Unified model for assessing checkpointing protocols at extreme-scale: CHECKPOINTING PROTOCOLS AT EXTREME-SCALE Bosilca, George; Bouteiller, Aurélien; Brunet, Elisabeth Concurrency and Computation: Practice and Experience, Vol. 26, Issue 17 https://doi.org/10.1002/cpe.3173	journal	November 2013
Evaluating and extending user-level fault tolerance in MPI applications Laguna, Ignacio; Richards, David F.; Gamblin, Todd The International Journal of High Performance Computing Applications, Vol. 30, Issue 3 https://doi.org/10.1177/1094342015623623	journal	July 2016
Redesigning the message logging model for high performance Bouteiller, Aurelien; Bosilca, George; Dongarra, Jack Concurrency and Computation: Practice and Experience, Vol. 22, Issue 16 https://doi.org/10.1002/cpe.1589	journal	June 2010
Exploring Automatic, Online Failure Recovery for Scientific Applications at Extreme Scales Gamell, Marc; Katz, Daniel S.; Kolla, Hemanth SC14: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2014.78	conference	November 2014
Fault Tolerance in Petascale/ Exascale Systems: Current Knowledge, Challenges and Research Opportunities Cappello, Franck The International Journal of High Performance Computing Applications, Vol. 23, Issue 3 https://doi.org/10.1177/1094342009106189	journal	July 2009

Similar Records

Failure recovery for bulk synchronous applications with MPI stages

Journal Article · Tue Feb 26 19:00:00 EST 2019 · Parallel Computing · OSTI ID:1784608

A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance

Conference · Sun Dec 31 23:00:00 EST 2006 · OSTI ID:931501

Exploring Automatic, Online Failure Recovery for Scientific Applications at Extreme Scales, SC '14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Conference · Sat Nov 01 00:00:00 EDT 2014 · SC14: INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS · OSTI ID:1567373

MPI Stages: Checkpointing MPI State for Bulk Synchronous Applications

Citation Formats

References (8)

Similar Records

Related Subjects