skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: EReinit: Scalable and efficient fault-tolerance for bulk-synchronous MPI applications

Journal Article · · Concurrency and Computation. Practice and Experience
DOI:https://doi.org/10.1002/cpe.4863· OSTI ID:1708993

Scientists from many different fields have been developing Bulk-Synchronous MPI applications to simulate and study a wide variety of scientific phenomena. Since failure rates are expected to increase in larger-scale future HPC systems, providing efficient fault-tolerance mechanisms for this class of applications is paramount. The global-restart model has been proposed to decrease the time of failure recovery in Bulk-Synchronous applications by allowing a fast reinitialization of MPI. However, the current implementations of this model have several drawbacks: they lack efficiency; their scalability have not been shown; and they require the use of the MPI profiling interface, which precludes the use of tools. Here, we present EReinit, an implementation of the global-restart model that addresses these problems. Our key idea and optimization is the co-design of basic fault-tolerance mechanisms such as failure detection, notification, and recovery between MPI and the resource manager in contrast to current approaches on which these mechanisms are implemented in MPI only. We demonstrate EReinit in three HPC programs and show that it is up to four times more efficient than existing solutions at 4,096 processes.

Research Organization:
Lawrence Livermore National Laboratory (LLNL), Livermore, CA (United States)
Sponsoring Organization:
USDOE National Nuclear Security Administration (NNSA); National Science Foundation (NSF)
Grant/Contract Number:
AC52-07NA27344; CCF-1565414; CNS-1419123; CNS-1513120; ACI-1450440; IIS-1447804
OSTI ID:
1708993
Alternate ID(s):
OSTI ID: 1464593
Report Number(s):
LLNL-JRNL-706037; 841204
Journal Information:
Concurrency and Computation. Practice and Experience, Vol. 32, Issue 3; ISSN 1532-0626
Publisher:
WileyCopyright Statement
Country of Publication:
United States
Language:
English
Citation Metrics:
Cited by: 13 works
Citation information provided by
Web of Science

References (45)

An analysis of algorithm-based fault tolerance techniques journal April 1988
Toward resilient algorithms and applications conference January 2013
The Open Run-Time Environment (OpenRTE): A transparent multicluster environment for high-performance computing journal February 2008
Local recovery and failure masking for stencil-based applications at extreme scales
  • Gamell, Marc; Teranishi, Keita; Heroux, Michael A.
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '15 https://doi.org/10.1145/2807591.2807672
conference January 2015
A bridging model for parallel computation journal August 1990
Toward Local Failure Local Recovery Resilience Model using MPI-ULFM conference January 2014
Algorithm-based fault tolerance applied to high performance computing journal April 2009
An evaluation of User-Level Failure Mitigation support in MPI journal May 2013
Evaluating and extending user-level fault tolerance in MPI applications journal July 2016
Evaluating the viability of process replication reliability for exascale systems
  • Ferreira, Kurt; Stearley, Jon; Laros, James H.
  • Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11 https://doi.org/10.1145/2063384.2063443
conference January 2011
Berkeley lab checkpoint/restart (BLCR) for Linux clusters journal September 2006
The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing journal November 2005
SLURM: Simple Linux Utility for Resource Management book January 2003
Low-latency, concurrent checkpointing for parallel programs journal January 1994
Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation book January 2004
P N MPI tools : a whole lot greater than the sum of their parts conference January 2007
Recovery in distributed systems using asynchronous message logging and checkpointing conference January 1988
Interconnect agnostic checkpoint/restart in open MPI
  • Hursey, Joshua; Mattox, Timothy I.; Lumsdaine, Andrew
  • Proceedings of the 18th ACM international symposium on High performance distributed computing - HPDC '09 https://doi.org/10.1145/1551609.1551619
conference January 2009
A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems journal February 2013
Rethinking algorithm-based fault tolerance with a cooperative software-hardware approach
  • Li, Dong; Chen, Zizhong; Wu, Panruo
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '13 https://doi.org/10.1145/2503210.2503226
conference January 2013
Distributed snapshots: determining global states of distributed systems journal February 1985
PMI Extensions for Scalable MPI Startup conference January 2014
A survey of rollback-recovery protocols in message-passing systems journal September 2002
FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World book January 2000
Algorithm-based fault tolerance on a hypercube multiprocessor journal January 1990
Toward Exascale Resilience journal September 2009
Design and Evaluation of FA-MPI, a Transactional Resilience Scheme for Non-blocking MPI
  • Hassani, Amin; Skjellum, Anthony; Brightwell, Ron
  • 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) https://doi.org/10.1109/DSN.2014.78
conference June 2014
Exploring Automatic, Online Failure Recovery for Scientific Applications at Extreme Scales
  • Gamell, Marc; Katz, Daniel S.; Kolla, Hemanth
  • SC14: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2014.78
conference November 2014
FMI: Fault Tolerant Messaging Interface for Fast and Transparent Recovery
  • Sato, Kento; Moody, Adam; Mohror, Kathryn
  • 2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2014 IEEE 28th International Parallel and Distributed Processing Symposium https://doi.org/10.1109/IPDPS.2014.126
conference May 2014
HARNESS and fault tolerant MPI journal October 2001
Simplifying the Recovery Model of User-Level Failure Mitigation conference November 2014
Exploring Traditional and Emerging Parallel Programming Models Using a Proxy Application
  • Karlin, Ian; Bhatele, Abhinav; Keasler, Jeff
  • 2013 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on Parallel and Distributed Processing https://doi.org/10.1109/IPDPS.2013.115
conference May 2013
Failure Detection and Propagation in HPC systems
  • Bosilca, George; Bouteiller, Aurelien; Guermouche, Amina
  • SC16: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2016.26
conference November 2016
Non-Blocking PMI Extensions for Fast MPI Startup conference May 2015
Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System
  • Moody, Adam; Bronevetsky, Greg; Mohror, Kathryn
  • 2010 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2010.18
conference November 2010
Designing and Evaluating MPI-2 Dynamic Process Management Support for InfiniBand conference September 2009
The Design and Implementation of Checkpoint/Restart Process Fault Tolerance for Open MPI conference March 2007
Replication-Based Fault Tolerance for MPI Applications journal July 2009
Power-Check: An Energy-Efficient Checkpointing Framework for HPC Clusters
  • Chandrasekar, Raghunath Raja; Venkatesh, Akshay; Hamidouche, Khaled
  • 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid) https://doi.org/10.1109/CCGrid.2015.169
conference May 2015
Coordinated checkpoint versus message log for fault tolerant MPI journal January 2004
Blocking vs. Non-Blocking Coordinated Checkpointing for Large-Scale Fault Tolerant MPI conference November 2006
Algorithm-Based Fault Tolerance for Matrix Operations journal June 1984
MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes conference January 2002
An Analysis Of Algorithm-Based Fault Tolerance Techniques conference April 1986
Toward Resilient Algorithms and Applications preprint January 2014

Cited By (2)

Foreword to the Special Issue of the Workshop on Exascale MPI (ExaMPI 2017)
  • Skjellum, Anthony; Bangalore, Purushotham V.; Grant, Ryan E.
  • Concurrency and Computation: Practice and Experience, Vol. 32, Issue 3 https://doi.org/10.1002/cpe.5459
journal July 2019
Application health monitoring for extreme‐scale resiliency using cooperative fault management journal July 2019

Similar Records

MPI Stages: Checkpointing MPI State for Bulk Synchronous Applications
Journal Article · Mon Jan 01 00:00:00 EST 2018 · EuroMPI'18 Proceedings of the 25th European MPI Users' Group Meeting, Barcelona, Spain, September 23 - 26, 2018 · OSTI ID:1708993

Building a Fault Tolerant MPI Application: A Ring Communication Example
Conference · Sat Jan 01 00:00:00 EST 2011 · OSTI ID:1708993

Combining Partial Redundancy and Checkpointing for HPC
Conference · Sun Jan 01 00:00:00 EST 2012 · OSTI ID:1708993