DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: EReinit: Scalable and efficient fault-tolerance for bulk-synchronous MPI applications

Abstract

Scientists from many different fields have been developing Bulk-Synchronous MPI applications to simulate and study a wide variety of scientific phenomena. Since failure rates are expected to increase in larger-scale future HPC systems, providing efficient fault-tolerance mechanisms for this class of applications is paramount. The global-restart model has been proposed to decrease the time of failure recovery in Bulk-Synchronous applications by allowing a fast reinitialization of MPI. However, the current implementations of this model have several drawbacks: they lack efficiency; their scalability have not been shown; and they require the use of the MPI profiling interface, which precludes the use of tools. Here, we present EReinit, an implementation of the global-restart model that addresses these problems. Our key idea and optimization is the co-design of basic fault-tolerance mechanisms such as failure detection, notification, and recovery between MPI and the resource manager in contrast to current approaches on which these mechanisms are implemented in MPI only. We demonstrate EReinit in three HPC programs and show that it is up to four times more efficient than existing solutions at 4,096 processes.

Authors:
 [1]; ORCiD logo [2];  [2];  [2];  [1];  [3];  [1]
  1. The Ohio State Univ., Columbus, OH (United States)
  2. Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)
  3. Technical Univ. of Munich (Germany)
Publication Date:
Research Org.:
Lawrence Livermore National Laboratory (LLNL), Livermore, CA (United States)
Sponsoring Org.:
USDOE National Nuclear Security Administration (NNSA); National Science Foundation (NSF)
OSTI Identifier:
1708993
Alternate Identifier(s):
OSTI ID: 1464593
Report Number(s):
LLNL-JRNL-706037
Journal ID: ISSN 1532-0626; 841204
Grant/Contract Number:  
AC52-07NA27344; CCF-1565414; CNS-1419123; CNS-1513120; ACI-1450440; IIS-1447804
Resource Type:
Accepted Manuscript
Journal Name:
Concurrency and Computation. Practice and Experience
Additional Journal Information:
Journal Volume: 32; Journal Issue: 3; Journal ID: ISSN 1532-0626
Publisher:
Wiley
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING; fault tolerance; high-performance computing; MPI; resilience

Citation Formats

Chakraborty, Sourav, Laguna, Ignacio, Emani, Murali, Mohror, Kathryn, Panda, Dhabaleswar K., Schulz, Martin, and Subramoni, Hari. EReinit: Scalable and efficient fault-tolerance for bulk-synchronous MPI applications. United States: N. p., 2018. Web. doi:10.1002/cpe.4863.
Chakraborty, Sourav, Laguna, Ignacio, Emani, Murali, Mohror, Kathryn, Panda, Dhabaleswar K., Schulz, Martin, & Subramoni, Hari. EReinit: Scalable and efficient fault-tolerance for bulk-synchronous MPI applications. United States. https://doi.org/10.1002/cpe.4863
Chakraborty, Sourav, Laguna, Ignacio, Emani, Murali, Mohror, Kathryn, Panda, Dhabaleswar K., Schulz, Martin, and Subramoni, Hari. Tue . "EReinit: Scalable and efficient fault-tolerance for bulk-synchronous MPI applications". United States. https://doi.org/10.1002/cpe.4863. https://www.osti.gov/servlets/purl/1708993.
@article{osti_1708993,
title = {EReinit: Scalable and efficient fault-tolerance for bulk-synchronous MPI applications},
author = {Chakraborty, Sourav and Laguna, Ignacio and Emani, Murali and Mohror, Kathryn and Panda, Dhabaleswar K. and Schulz, Martin and Subramoni, Hari},
abstractNote = {Scientists from many different fields have been developing Bulk-Synchronous MPI applications to simulate and study a wide variety of scientific phenomena. Since failure rates are expected to increase in larger-scale future HPC systems, providing efficient fault-tolerance mechanisms for this class of applications is paramount. The global-restart model has been proposed to decrease the time of failure recovery in Bulk-Synchronous applications by allowing a fast reinitialization of MPI. However, the current implementations of this model have several drawbacks: they lack efficiency; their scalability have not been shown; and they require the use of the MPI profiling interface, which precludes the use of tools. Here, we present EReinit, an implementation of the global-restart model that addresses these problems. Our key idea and optimization is the co-design of basic fault-tolerance mechanisms such as failure detection, notification, and recovery between MPI and the resource manager in contrast to current approaches on which these mechanisms are implemented in MPI only. We demonstrate EReinit in three HPC programs and show that it is up to four times more efficient than existing solutions at 4,096 processes.},
doi = {10.1002/cpe.4863},
journal = {Concurrency and Computation. Practice and Experience},
number = 3,
volume = 32,
place = {United States},
year = {Tue Aug 14 00:00:00 EDT 2018},
month = {Tue Aug 14 00:00:00 EDT 2018}
}

Journal Article:
Free Publicly Available Full Text
Publisher's Version of Record

Citation Metrics:
Cited by: 13 works
Citation information provided by
Web of Science

Save / Share:

Works referenced in this record:

An analysis of algorithm-based fault tolerance techniques
journal, April 1988


Toward resilient algorithms and applications
conference, January 2013

  • Heroux, Michael A.
  • Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale - FTXS '13
  • DOI: 10.1145/2465813.2465814

The Open Run-Time Environment (OpenRTE): A transparent multicluster environment for high-performance computing
journal, February 2008


Local recovery and failure masking for stencil-based applications at extreme scales
conference, January 2015

  • Gamell, Marc; Teranishi, Keita; Heroux, Michael A.
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '15
  • DOI: 10.1145/2807591.2807672

A bridging model for parallel computation
journal, August 1990


Toward Local Failure Local Recovery Resilience Model using MPI-ULFM
conference, January 2014

  • Teranishi, Keita; Heroux, Michael A.
  • Proceedings of the 21st European MPI Users' Group Meeting on - EuroMPI/ASIA '14
  • DOI: 10.1145/2642769.2642774

Algorithm-based fault tolerance applied to high performance computing
journal, April 2009

  • Bosilca, George; Delmas, Rémi; Dongarra, Jack
  • Journal of Parallel and Distributed Computing, Vol. 69, Issue 4
  • DOI: 10.1016/j.jpdc.2008.12.002

An evaluation of User-Level Failure Mitigation support in MPI
journal, May 2013


Evaluating and extending user-level fault tolerance in MPI applications
journal, July 2016

  • Laguna, Ignacio; Richards, David F.; Gamblin, Todd
  • The International Journal of High Performance Computing Applications, Vol. 30, Issue 3
  • DOI: 10.1177/1094342015623623

Evaluating the viability of process replication reliability for exascale systems
conference, January 2011

  • Ferreira, Kurt; Stearley, Jon; Laros, James H.
  • Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11
  • DOI: 10.1145/2063384.2063443

Berkeley lab checkpoint/restart (BLCR) for Linux clusters
journal, September 2006


The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing
journal, November 2005

  • Sankaran, Sriram; Squyres, Jeffrey M.; Barrett, Brian
  • The International Journal of High Performance Computing Applications, Vol. 19, Issue 4
  • DOI: 10.1177/1094342005056139

SLURM: Simple Linux Utility for Resource Management
book, January 2003

  • Yoo, Andy B.; Jette, Morris A.; Grondona, Mark
  • Job Scheduling Strategies for Parallel Processing
  • DOI: 10.1007/10968987_3

Low-latency, concurrent checkpointing for parallel programs
journal, January 1994

  • Kai Li, ; Naughton, J. F.; Plank, J. S.
  • IEEE Transactions on Parallel and Distributed Systems, Vol. 5, Issue 8
  • DOI: 10.1109/71.298215

Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation
book, January 2004

  • Gabriel, Edgar; Fagg, Graham E.; Bosilca, George
  • Recent Advances in Parallel Virtual Machine and Message Passing Interface
  • DOI: 10.1007/978-3-540-30218-6_19

P N MPI tools : a whole lot greater than the sum of their parts
conference, January 2007

  • Schulz, Martin; de Supinski, Bronis R.
  • Proceedings of the 2007 ACM/IEEE conference on Supercomputing - SC '07
  • DOI: 10.1145/1362622.1362663

Recovery in distributed systems using asynchronous message logging and checkpointing
conference, January 1988

  • Johnson, David B.; Zwaenepoel, Willy
  • Proceedings of the seventh annual ACM Symposium on Principles of distributed computing - PODC '88
  • DOI: 10.1145/62546.62575

Interconnect agnostic checkpoint/restart in open MPI
conference, January 2009

  • Hursey, Joshua; Mattox, Timothy I.; Lumsdaine, Andrew
  • Proceedings of the 18th ACM international symposium on High performance distributed computing - HPDC '09
  • DOI: 10.1145/1551609.1551619

A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems
journal, February 2013

  • Egwutuoha, Ifeanyi P.; Levy, David; Selic, Bran
  • The Journal of Supercomputing, Vol. 65, Issue 3
  • DOI: 10.1007/s11227-013-0884-0

Rethinking algorithm-based fault tolerance with a cooperative software-hardware approach
conference, January 2013

  • Li, Dong; Chen, Zizhong; Wu, Panruo
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '13
  • DOI: 10.1145/2503210.2503226

Distributed snapshots: determining global states of distributed systems
journal, February 1985

  • Chandy, K. Mani; Lamport, Leslie
  • ACM Transactions on Computer Systems, Vol. 3, Issue 1
  • DOI: 10.1145/214451.214456

PMI Extensions for Scalable MPI Startup
conference, January 2014

  • Chakraborty, S.; Subramoni, H.; Perkins, J.
  • Proceedings of the 21st European MPI Users' Group Meeting on - EuroMPI/ASIA '14
  • DOI: 10.1145/2642769.2642780

A survey of rollback-recovery protocols in message-passing systems
journal, September 2002

  • Elnozahy, E. N. (Mootaz); Alvisi, Lorenzo; Wang, Yi-Min
  • ACM Computing Surveys, Vol. 34, Issue 3
  • DOI: 10.1145/568522.568525

FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World
book, January 2000

  • Fagg, Graham E.; Dongarra, Jack J.
  • Recent Advances in Parallel Virtual Machine and Message Passing Interface
  • DOI: 10.1007/3-540-45255-9_47

Algorithm-based fault tolerance on a hypercube multiprocessor
journal, January 1990

  • Banerjee, P.; Rahmeh, J. T.; Stunkel, C.
  • IEEE Transactions on Computers, Vol. 39, Issue 9
  • DOI: 10.1109/12.57055

Toward Exascale Resilience
journal, September 2009

  • Cappello, Franck; Geist, Al; Gropp, Bill
  • The International Journal of High Performance Computing Applications, Vol. 23, Issue 4
  • DOI: 10.1177/1094342009347767

Design and Evaluation of FA-MPI, a Transactional Resilience Scheme for Non-blocking MPI
conference, June 2014

  • Hassani, Amin; Skjellum, Anthony; Brightwell, Ron
  • 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)
  • DOI: 10.1109/DSN.2014.78

Exploring Automatic, Online Failure Recovery for Scientific Applications at Extreme Scales
conference, November 2014

  • Gamell, Marc; Katz, Daniel S.; Kolla, Hemanth
  • SC14: International Conference for High Performance Computing, Networking, Storage and Analysis
  • DOI: 10.1109/SC.2014.78

FMI: Fault Tolerant Messaging Interface for Fast and Transparent Recovery
conference, May 2014

  • Sato, Kento; Moody, Adam; Mohror, Kathryn
  • 2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2014 IEEE 28th International Parallel and Distributed Processing Symposium
  • DOI: 10.1109/IPDPS.2014.126

HARNESS and fault tolerant MPI
journal, October 2001


Simplifying the Recovery Model of User-Level Failure Mitigation
conference, November 2014

  • Bland, Wesley; Raffenetti, Kenneth; Balaji, Pavan
  • 2014 Workshop on Exascale MPI at Supercomputing Conference (ExaMPI)
  • DOI: 10.1109/ExaMPI.2014.4

Exploring Traditional and Emerging Parallel Programming Models Using a Proxy Application
conference, May 2013

  • Karlin, Ian; Bhatele, Abhinav; Keasler, Jeff
  • 2013 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on Parallel and Distributed Processing
  • DOI: 10.1109/IPDPS.2013.115

Failure Detection and Propagation in HPC systems
conference, November 2016

  • Bosilca, George; Bouteiller, Aurelien; Guermouche, Amina
  • SC16: International Conference for High Performance Computing, Networking, Storage and Analysis
  • DOI: 10.1109/SC.2016.26

Non-Blocking PMI Extensions for Fast MPI Startup
conference, May 2015

  • Chakraborty, Sourav; Subramoni, Hari; Moody, Adam
  • 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)
  • DOI: 10.1109/CCGrid.2015.151

Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System
conference, November 2010

  • Moody, Adam; Bronevetsky, Greg; Mohror, Kathryn
  • 2010 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
  • DOI: 10.1109/SC.2010.18

Designing and Evaluating MPI-2 Dynamic Process Management Support for InfiniBand
conference, September 2009

  • Gangadharappa, Tejus; Koop, Matthew; Panda, Dhabaleswar K.
  • 2009 International Conference on Parallel Processing Workshops (ICPPW)
  • DOI: 10.1109/ICPPW.2009.77

The Design and Implementation of Checkpoint/Restart Process Fault Tolerance for Open MPI
conference, March 2007

  • Hursey, Joshua; Squyres, Jeffrey M.; Mattox, Timothy I.
  • 2007 IEEE International Parallel and Distributed Processing Symposium
  • DOI: 10.1109/IPDPS.2007.370605

Replication-Based Fault Tolerance for MPI Applications
journal, July 2009

  • Walters, J. P.; Chaudhary, V.
  • IEEE Transactions on Parallel and Distributed Systems, Vol. 20, Issue 7
  • DOI: 10.1109/TPDS.2008.172

Power-Check: An Energy-Efficient Checkpointing Framework for HPC Clusters
conference, May 2015

  • Chandrasekar, Raghunath Raja; Venkatesh, Akshay; Hamidouche, Khaled
  • 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)
  • DOI: 10.1109/CCGrid.2015.169

Coordinated checkpoint versus message log for fault tolerant MPI
journal, January 2004

  • Lemarinier, Pierre; Bouteiller, Aurelien; Krawezik, Geraud
  • International Journal of High Performance Computing and Networking, Vol. 2, Issue 2/3/4
  • DOI: 10.1504/IJHPCN.2004.008899

Blocking vs. Non-Blocking Coordinated Checkpointing for Large-Scale Fault Tolerant MPI
conference, November 2006

  • Coti, Camille; Herault, Thomas; Lemarinier, Pierre
  • SC 2006 Proceedings Supercomputing 2006, ACM/IEEE SC 2006 Conference (SC'06)
  • DOI: 10.1109/SC.2006.15

Algorithm-Based Fault Tolerance for Matrix Operations
journal, June 1984

  • Kuang-Hua Huang, ; Abraham, Jacob A.
  • IEEE Transactions on Computers, Vol. C-33, Issue 6
  • DOI: 10.1109/TC.1984.1676475

MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes
conference, January 2002

  • Bosilca, G.; Bouteiller, A.; Cappello, F.
  • ACM/IEEE SC 2002 Conference (SC'02)
  • DOI: 10.1109/SC.2002.10048

An Analysis Of Algorithm-Based Fault Tolerance Techniques
conference, April 1986

  • Luk, Franklin T.; Park, Haesun
  • 30th Annual Technical Symposium, SPIE Proceedings
  • DOI: 10.1117/12.936896

Toward Resilient Algorithms and Applications
preprint, January 2014


Works referencing / citing this record:

Foreword to the Special Issue of the Workshop on Exascale MPI (ExaMPI 2017)
journal, July 2019

  • Skjellum, Anthony; Bangalore, Purushotham V.; Grant, Ryan E.
  • Concurrency and Computation: Practice and Experience, Vol. 32, Issue 3
  • DOI: 10.1002/cpe.5459

Application health monitoring for extreme‐scale resiliency using cooperative fault management
journal, July 2019

  • Agarwal, Pratul K.; Naughton, Thomas; Park, Byung H.
  • Concurrency and Computation: Practice and Experience, Vol. 32, Issue 2
  • DOI: 10.1002/cpe.5449