EReinit: Scalable and efficient fault-tolerance for bulk-synchronous MPI applications
Abstract
Scientists from many different fields have been developing Bulk-Synchronous MPI applications to simulate and study a wide variety of scientific phenomena. Since failure rates are expected to increase in larger-scale future HPC systems, providing efficient fault-tolerance mechanisms for this class of applications is paramount. The global-restart model has been proposed to decrease the time of failure recovery in Bulk-Synchronous applications by allowing a fast reinitialization of MPI. However, the current implementations of this model have several drawbacks: they lack efficiency; their scalability have not been shown; and they require the use of the MPI profiling interface, which precludes the use of tools. Here, we present EReinit, an implementation of the global-restart model that addresses these problems. Our key idea and optimization is the co-design of basic fault-tolerance mechanisms such as failure detection, notification, and recovery between MPI and the resource manager in contrast to current approaches on which these mechanisms are implemented in MPI only. We demonstrate EReinit in three HPC programs and show that it is up to four times more efficient than existing solutions at 4,096 processes.
- Authors:
-
- The Ohio State Univ., Columbus, OH (United States)
- Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)
- Technical Univ. of Munich (Germany)
- Publication Date:
- Research Org.:
- Lawrence Livermore National Laboratory (LLNL), Livermore, CA (United States)
- Sponsoring Org.:
- USDOE National Nuclear Security Administration (NNSA); National Science Foundation (NSF)
- OSTI Identifier:
- 1708993
- Alternate Identifier(s):
- OSTI ID: 1464593
- Report Number(s):
- LLNL-JRNL-706037
Journal ID: ISSN 1532-0626; 841204
- Grant/Contract Number:
- AC52-07NA27344; CCF-1565414; CNS-1419123; CNS-1513120; ACI-1450440; IIS-1447804
- Resource Type:
- Accepted Manuscript
- Journal Name:
- Concurrency and Computation. Practice and Experience
- Additional Journal Information:
- Journal Volume: 32; Journal Issue: 3; Journal ID: ISSN 1532-0626
- Publisher:
- Wiley
- Country of Publication:
- United States
- Language:
- English
- Subject:
- 97 MATHEMATICS AND COMPUTING; fault tolerance; high-performance computing; MPI; resilience
Citation Formats
Chakraborty, Sourav, Laguna, Ignacio, Emani, Murali, Mohror, Kathryn, Panda, Dhabaleswar K., Schulz, Martin, and Subramoni, Hari. EReinit: Scalable and efficient fault-tolerance for bulk-synchronous MPI applications. United States: N. p., 2018.
Web. doi:10.1002/cpe.4863.
Chakraborty, Sourav, Laguna, Ignacio, Emani, Murali, Mohror, Kathryn, Panda, Dhabaleswar K., Schulz, Martin, & Subramoni, Hari. EReinit: Scalable and efficient fault-tolerance for bulk-synchronous MPI applications. United States. https://doi.org/10.1002/cpe.4863
Chakraborty, Sourav, Laguna, Ignacio, Emani, Murali, Mohror, Kathryn, Panda, Dhabaleswar K., Schulz, Martin, and Subramoni, Hari. Tue .
"EReinit: Scalable and efficient fault-tolerance for bulk-synchronous MPI applications". United States. https://doi.org/10.1002/cpe.4863. https://www.osti.gov/servlets/purl/1708993.
@article{osti_1708993,
title = {EReinit: Scalable and efficient fault-tolerance for bulk-synchronous MPI applications},
author = {Chakraborty, Sourav and Laguna, Ignacio and Emani, Murali and Mohror, Kathryn and Panda, Dhabaleswar K. and Schulz, Martin and Subramoni, Hari},
abstractNote = {Scientists from many different fields have been developing Bulk-Synchronous MPI applications to simulate and study a wide variety of scientific phenomena. Since failure rates are expected to increase in larger-scale future HPC systems, providing efficient fault-tolerance mechanisms for this class of applications is paramount. The global-restart model has been proposed to decrease the time of failure recovery in Bulk-Synchronous applications by allowing a fast reinitialization of MPI. However, the current implementations of this model have several drawbacks: they lack efficiency; their scalability have not been shown; and they require the use of the MPI profiling interface, which precludes the use of tools. Here, we present EReinit, an implementation of the global-restart model that addresses these problems. Our key idea and optimization is the co-design of basic fault-tolerance mechanisms such as failure detection, notification, and recovery between MPI and the resource manager in contrast to current approaches on which these mechanisms are implemented in MPI only. We demonstrate EReinit in three HPC programs and show that it is up to four times more efficient than existing solutions at 4,096 processes.},
doi = {10.1002/cpe.4863},
journal = {Concurrency and Computation. Practice and Experience},
number = 3,
volume = 32,
place = {United States},
year = {Tue Aug 14 00:00:00 EDT 2018},
month = {Tue Aug 14 00:00:00 EDT 2018}
}
Web of Science
Works referenced in this record:
An analysis of algorithm-based fault tolerance techniques
journal, April 1988
- Luk, Franklin T.; Park, Haesun
- Journal of Parallel and Distributed Computing, Vol. 5, Issue 2
Toward resilient algorithms and applications
conference, January 2013
- Heroux, Michael A.
- Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale - FTXS '13
The Open Run-Time Environment (OpenRTE): A transparent multicluster environment for high-performance computing
journal, February 2008
- Castain, R. H.; Woodall, T. S.; Daniel, D. J.
- Future Generation Computer Systems, Vol. 24, Issue 2
Local recovery and failure masking for stencil-based applications at extreme scales
conference, January 2015
- Gamell, Marc; Teranishi, Keita; Heroux, Michael A.
- Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '15
A bridging model for parallel computation
journal, August 1990
- Valiant, Leslie G.
- Communications of the ACM, Vol. 33, Issue 8
Toward Local Failure Local Recovery Resilience Model using MPI-ULFM
conference, January 2014
- Teranishi, Keita; Heroux, Michael A.
- Proceedings of the 21st European MPI Users' Group Meeting on - EuroMPI/ASIA '14
Algorithm-based fault tolerance applied to high performance computing
journal, April 2009
- Bosilca, George; Delmas, Rémi; Dongarra, Jack
- Journal of Parallel and Distributed Computing, Vol. 69, Issue 4
An evaluation of User-Level Failure Mitigation support in MPI
journal, May 2013
- Bland, Wesley; Bouteiller, Aurelien; Herault, Thomas
- Computing, Vol. 95, Issue 12
Evaluating and extending user-level fault tolerance in MPI applications
journal, July 2016
- Laguna, Ignacio; Richards, David F.; Gamblin, Todd
- The International Journal of High Performance Computing Applications, Vol. 30, Issue 3
Evaluating the viability of process replication reliability for exascale systems
conference, January 2011
- Ferreira, Kurt; Stearley, Jon; Laros, James H.
- Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11
Berkeley lab checkpoint/restart (BLCR) for Linux clusters
journal, September 2006
- Hargrove, Paul H.; Duell, Jason C.
- Journal of Physics: Conference Series, Vol. 46
The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing
journal, November 2005
- Sankaran, Sriram; Squyres, Jeffrey M.; Barrett, Brian
- The International Journal of High Performance Computing Applications, Vol. 19, Issue 4
SLURM: Simple Linux Utility for Resource Management
book, January 2003
- Yoo, Andy B.; Jette, Morris A.; Grondona, Mark
- Job Scheduling Strategies for Parallel Processing
Low-latency, concurrent checkpointing for parallel programs
journal, January 1994
- Kai Li, ; Naughton, J. F.; Plank, J. S.
- IEEE Transactions on Parallel and Distributed Systems, Vol. 5, Issue 8
Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation
book, January 2004
- Gabriel, Edgar; Fagg, Graham E.; Bosilca, George
- Recent Advances in Parallel Virtual Machine and Message Passing Interface
P N MPI tools : a whole lot greater than the sum of their parts
conference, January 2007
- Schulz, Martin; de Supinski, Bronis R.
- Proceedings of the 2007 ACM/IEEE conference on Supercomputing - SC '07
Recovery in distributed systems using asynchronous message logging and checkpointing
conference, January 1988
- Johnson, David B.; Zwaenepoel, Willy
- Proceedings of the seventh annual ACM Symposium on Principles of distributed computing - PODC '88
Interconnect agnostic checkpoint/restart in open MPI
conference, January 2009
- Hursey, Joshua; Mattox, Timothy I.; Lumsdaine, Andrew
- Proceedings of the 18th ACM international symposium on High performance distributed computing - HPDC '09
A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems
journal, February 2013
- Egwutuoha, Ifeanyi P.; Levy, David; Selic, Bran
- The Journal of Supercomputing, Vol. 65, Issue 3
Rethinking algorithm-based fault tolerance with a cooperative software-hardware approach
conference, January 2013
- Li, Dong; Chen, Zizhong; Wu, Panruo
- Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '13
Distributed snapshots: determining global states of distributed systems
journal, February 1985
- Chandy, K. Mani; Lamport, Leslie
- ACM Transactions on Computer Systems, Vol. 3, Issue 1
PMI Extensions for Scalable MPI Startup
conference, January 2014
- Chakraborty, S.; Subramoni, H.; Perkins, J.
- Proceedings of the 21st European MPI Users' Group Meeting on - EuroMPI/ASIA '14
A survey of rollback-recovery protocols in message-passing systems
journal, September 2002
- Elnozahy, E. N. (Mootaz); Alvisi, Lorenzo; Wang, Yi-Min
- ACM Computing Surveys, Vol. 34, Issue 3
FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World
book, January 2000
- Fagg, Graham E.; Dongarra, Jack J.
- Recent Advances in Parallel Virtual Machine and Message Passing Interface
Algorithm-based fault tolerance on a hypercube multiprocessor
journal, January 1990
- Banerjee, P.; Rahmeh, J. T.; Stunkel, C.
- IEEE Transactions on Computers, Vol. 39, Issue 9
Toward Exascale Resilience
journal, September 2009
- Cappello, Franck; Geist, Al; Gropp, Bill
- The International Journal of High Performance Computing Applications, Vol. 23, Issue 4
Design and Evaluation of FA-MPI, a Transactional Resilience Scheme for Non-blocking MPI
conference, June 2014
- Hassani, Amin; Skjellum, Anthony; Brightwell, Ron
- 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)
Exploring Automatic, Online Failure Recovery for Scientific Applications at Extreme Scales
conference, November 2014
- Gamell, Marc; Katz, Daniel S.; Kolla, Hemanth
- SC14: International Conference for High Performance Computing, Networking, Storage and Analysis
FMI: Fault Tolerant Messaging Interface for Fast and Transparent Recovery
conference, May 2014
- Sato, Kento; Moody, Adam; Mohror, Kathryn
- 2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2014 IEEE 28th International Parallel and Distributed Processing Symposium
HARNESS and fault tolerant MPI
journal, October 2001
- Fagg, Graham E.; Bukovsky, Antonin; Dongarra, Jack J.
- Parallel Computing, Vol. 27, Issue 11
Simplifying the Recovery Model of User-Level Failure Mitigation
conference, November 2014
- Bland, Wesley; Raffenetti, Kenneth; Balaji, Pavan
- 2014 Workshop on Exascale MPI at Supercomputing Conference (ExaMPI)
Exploring Traditional and Emerging Parallel Programming Models Using a Proxy Application
conference, May 2013
- Karlin, Ian; Bhatele, Abhinav; Keasler, Jeff
- 2013 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on Parallel and Distributed Processing
Failure Detection and Propagation in HPC systems
conference, November 2016
- Bosilca, George; Bouteiller, Aurelien; Guermouche, Amina
- SC16: International Conference for High Performance Computing, Networking, Storage and Analysis
Non-Blocking PMI Extensions for Fast MPI Startup
conference, May 2015
- Chakraborty, Sourav; Subramoni, Hari; Moody, Adam
- 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)
Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System
conference, November 2010
- Moody, Adam; Bronevetsky, Greg; Mohror, Kathryn
- 2010 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Designing and Evaluating MPI-2 Dynamic Process Management Support for InfiniBand
conference, September 2009
- Gangadharappa, Tejus; Koop, Matthew; Panda, Dhabaleswar K.
- 2009 International Conference on Parallel Processing Workshops (ICPPW)
The Design and Implementation of Checkpoint/Restart Process Fault Tolerance for Open MPI
conference, March 2007
- Hursey, Joshua; Squyres, Jeffrey M.; Mattox, Timothy I.
- 2007 IEEE International Parallel and Distributed Processing Symposium
Replication-Based Fault Tolerance for MPI Applications
journal, July 2009
- Walters, J. P.; Chaudhary, V.
- IEEE Transactions on Parallel and Distributed Systems, Vol. 20, Issue 7
Power-Check: An Energy-Efficient Checkpointing Framework for HPC Clusters
conference, May 2015
- Chandrasekar, Raghunath Raja; Venkatesh, Akshay; Hamidouche, Khaled
- 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)
Coordinated checkpoint versus message log for fault tolerant MPI
journal, January 2004
- Lemarinier, Pierre; Bouteiller, Aurelien; Krawezik, Geraud
- International Journal of High Performance Computing and Networking, Vol. 2, Issue 2/3/4
Blocking vs. Non-Blocking Coordinated Checkpointing for Large-Scale Fault Tolerant MPI
conference, November 2006
- Coti, Camille; Herault, Thomas; Lemarinier, Pierre
- SC 2006 Proceedings Supercomputing 2006, ACM/IEEE SC 2006 Conference (SC'06)
Algorithm-Based Fault Tolerance for Matrix Operations
journal, June 1984
- Kuang-Hua Huang, ; Abraham, Jacob A.
- IEEE Transactions on Computers, Vol. C-33, Issue 6
MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes
conference, January 2002
- Bosilca, G.; Bouteiller, A.; Cappello, F.
- ACM/IEEE SC 2002 Conference (SC'02)
An Analysis Of Algorithm-Based Fault Tolerance Techniques
conference, April 1986
- Luk, Franklin T.; Park, Haesun
- 30th Annual Technical Symposium, SPIE Proceedings
Works referencing / citing this record:
Foreword to the Special Issue of the Workshop on Exascale MPI (ExaMPI 2017)
journal, July 2019
- Skjellum, Anthony; Bangalore, Purushotham V.; Grant, Ryan E.
- Concurrency and Computation: Practice and Experience, Vol. 32, Issue 3
Application health monitoring for extreme‐scale resiliency using cooperative fault management
journal, July 2019
- Agarwal, Pratul K.; Naughton, Thomas; Park, Byung H.
- Concurrency and Computation: Practice and Experience, Vol. 32, Issue 2