EReinit: Scalable and efficient fault-tolerance for bulk-synchronous MPI applications

Chakraborty, Sourav; Laguna, Ignacio; Emani, Murali; Mohror, Kathryn; Panda, Dhabaleswar K.; Schulz, Martin; Subramoni, Hari

doi:10.1002/cpe.4863

Title: EReinit: Scalable and efficient fault-tolerance for bulk-synchronous MPI applications

Journal Article · Tue Aug 14 00:00:00 EDT 2018 · Concurrency and Computation. Practice and Experience

DOI:https://doi.org/10.1002/cpe.4863· OSTI ID:1708993

Chakraborty, Sourav ^[1];

^[2]; Emani, Murali ^[2]; Mohror, Kathryn ^[2]; Panda, Dhabaleswar K. ^[1]; Schulz, Martin ^[3]; Subramoni, Hari ^[1]

The Ohio State Univ., Columbus, OH (United States)
Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)
Technical Univ. of Munich (Germany)

Scientists from many different fields have been developing Bulk-Synchronous MPI applications to simulate and study a wide variety of scientific phenomena. Since failure rates are expected to increase in larger-scale future HPC systems, providing efficient fault-tolerance mechanisms for this class of applications is paramount. The global-restart model has been proposed to decrease the time of failure recovery in Bulk-Synchronous applications by allowing a fast reinitialization of MPI. However, the current implementations of this model have several drawbacks: they lack efficiency; their scalability have not been shown; and they require the use of the MPI profiling interface, which precludes the use of tools. Here, we present EReinit, an implementation of the global-restart model that addresses these problems. Our key idea and optimization is the co-design of basic fault-tolerance mechanisms such as failure detection, notification, and recovery between MPI and the resource manager in contrast to current approaches on which these mechanisms are implemented in MPI only. We demonstrate EReinit in three HPC programs and show that it is up to four times more efficient than existing solutions at 4,096 processes.

View Accepted Manuscript (DOE)

View Accepted Manuscript (Publisher)

Cite

Export

Save

Research Organization:: Lawrence Livermore National Laboratory (LLNL), Livermore, CA (United States)

Sponsoring Organization:: USDOE National Nuclear Security Administration (NNSA); National Science Foundation (NSF)

Grant/Contract Number:: AC52-07NA27344; CCF-1565414; CNS-1419123; CNS-1513120; ACI-1450440; IIS-1447804

OSTI ID:: 1708993

Alternate ID(s):: OSTI ID: 1464593

Report Number(s):: LLNL-JRNL-706037; 841204

Journal Information:: Concurrency and Computation. Practice and Experience, Vol. 32, Issue 3; ISSN 1532-0626

Publisher:: WileyCopyright Statement

Country of Publication:: United States

Language:: English

Citation Metrics:

Cited by: 13 works

Citation information provided by
Web of Science

References (45)

An analysis of algorithm-based fault tolerance techniques Luk, Franklin T.; Park, Haesun Journal of Parallel and Distributed Computing, Vol. 5, Issue 2 https://doi.org/10.1016/0743-7315(88)90027-5	journal	April 1988
Toward resilient algorithms and applications Heroux, Michael A. Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale - FTXS '13 https://doi.org/10.1145/2465813.2465814	conference	January 2013
The Open Run-Time Environment (OpenRTE): A transparent multicluster environment for high-performance computing Castain, R. H.; Woodall, T. S.; Daniel, D. J. Future Generation Computer Systems, Vol. 24, Issue 2 https://doi.org/10.1016/j.future.2007.03.010	journal	February 2008
Local recovery and failure masking for stencil-based applications at extreme scales Gamell, Marc; Teranishi, Keita; Heroux, Michael A. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '15 https://doi.org/10.1145/2807591.2807672	conference	January 2015
A bridging model for parallel computation Valiant, Leslie G. Communications of the ACM, Vol. 33, Issue 8 https://doi.org/10.1145/79173.79181	journal	August 1990
Toward Local Failure Local Recovery Resilience Model using MPI-ULFM Teranishi, Keita; Heroux, Michael A. Proceedings of the 21st European MPI Users' Group Meeting on - EuroMPI/ASIA '14 https://doi.org/10.1145/2642769.2642774	conference	January 2014
Algorithm-based fault tolerance applied to high performance computing Bosilca, George; Delmas, Rémi; Dongarra, Jack Journal of Parallel and Distributed Computing, Vol. 69, Issue 4 https://doi.org/10.1016/j.jpdc.2008.12.002	journal	April 2009
An evaluation of User-Level Failure Mitigation support in MPI Bland, Wesley; Bouteiller, Aurelien; Herault, Thomas Computing, Vol. 95, Issue 12 https://doi.org/10.1007/s00607-013-0331-3	journal	May 2013
Evaluating and extending user-level fault tolerance in MPI applications Laguna, Ignacio; Richards, David F.; Gamblin, Todd The International Journal of High Performance Computing Applications, Vol. 30, Issue 3 https://doi.org/10.1177/1094342015623623	journal	July 2016
Evaluating the viability of process replication reliability for exascale systems Ferreira, Kurt; Stearley, Jon; Laros, James H. Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11 https://doi.org/10.1145/2063384.2063443	conference	January 2011
Berkeley lab checkpoint/restart (BLCR) for Linux clusters Hargrove, Paul H.; Duell, Jason C. Journal of Physics: Conference Series, Vol. 46 https://doi.org/10.1088/1742-6596/46/1/067	journal	September 2006
The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing Sankaran, Sriram; Squyres, Jeffrey M.; Barrett, Brian The International Journal of High Performance Computing Applications, Vol. 19, Issue 4 https://doi.org/10.1177/1094342005056139	journal	November 2005
SLURM: Simple Linux Utility for Resource Management Yoo, Andy B.; Jette, Morris A.; Grondona, Mark Job Scheduling Strategies for Parallel Processing https://doi.org/10.1007/10968987_3	book	January 2003
Low-latency, concurrent checkpointing for parallel programs IEEE Transactions on Parallel and Distributed Systems, Vol. 5, Issue 8 https://doi.org/10.1109/71.298215	journal	January 1994
Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation Gabriel, Edgar; Fagg, Graham E.; Bosilca, George Recent Advances in Parallel Virtual Machine and Message Passing Interface https://doi.org/10.1007/978-3-540-30218-6_19	book	January 2004
P N MPI tools : a whole lot greater than the sum of their parts Schulz, Martin; de Supinski, Bronis R. Proceedings of the 2007 ACM/IEEE conference on Supercomputing - SC '07 https://doi.org/10.1145/1362622.1362663	conference	January 2007
Recovery in distributed systems using asynchronous message logging and checkpointing Johnson, David B.; Zwaenepoel, Willy Proceedings of the seventh annual ACM Symposium on Principles of distributed computing - PODC '88 https://doi.org/10.1145/62546.62575	conference	January 1988
Interconnect agnostic checkpoint/restart in open MPI Hursey, Joshua; Mattox, Timothy I.; Lumsdaine, Andrew Proceedings of the 18th ACM international symposium on High performance distributed computing - HPDC '09 https://doi.org/10.1145/1551609.1551619	conference	January 2009
A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems Egwutuoha, Ifeanyi P.; Levy, David; Selic, Bran The Journal of Supercomputing, Vol. 65, Issue 3 https://doi.org/10.1007/s11227-013-0884-0	journal	February 2013
Rethinking algorithm-based fault tolerance with a cooperative software-hardware approach Li, Dong; Chen, Zizhong; Wu, Panruo Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '13 https://doi.org/10.1145/2503210.2503226	conference	January 2013
Distributed snapshots: determining global states of distributed systems Chandy, K. Mani; Lamport, Leslie ACM Transactions on Computer Systems, Vol. 3, Issue 1 https://doi.org/10.1145/214451.214456	journal	February 1985
PMI Extensions for Scalable MPI Startup Chakraborty, S.; Subramoni, H.; Perkins, J. Proceedings of the 21st European MPI Users' Group Meeting on - EuroMPI/ASIA '14 https://doi.org/10.1145/2642769.2642780	conference	January 2014
A survey of rollback-recovery protocols in message-passing systems Elnozahy, E. N. (Mootaz); Alvisi, Lorenzo; Wang, Yi-Min ACM Computing Surveys, Vol. 34, Issue 3 https://doi.org/10.1145/568522.568525	journal	September 2002
FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World Fagg, Graham E.; Dongarra, Jack J. Recent Advances in Parallel Virtual Machine and Message Passing Interface https://doi.org/10.1007/3-540-45255-9_47	book	January 2000
Algorithm-based fault tolerance on a hypercube multiprocessor Banerjee, P.; Rahmeh, J. T.; Stunkel, C. IEEE Transactions on Computers, Vol. 39, Issue 9 https://doi.org/10.1109/12.57055	journal	January 1990
Toward Exascale Resilience Cappello, Franck; Geist, Al; Gropp, Bill The International Journal of High Performance Computing Applications, Vol. 23, Issue 4 https://doi.org/10.1177/1094342009347767	journal	September 2009
Design and Evaluation of FA-MPI, a Transactional Resilience Scheme for Non-blocking MPI Hassani, Amin; Skjellum, Anthony; Brightwell, Ron 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) https://doi.org/10.1109/DSN.2014.78	conference	June 2014
Exploring Automatic, Online Failure Recovery for Scientific Applications at Extreme Scales Gamell, Marc; Katz, Daniel S.; Kolla, Hemanth SC14: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2014.78	conference	November 2014
FMI: Fault Tolerant Messaging Interface for Fast and Transparent Recovery Sato, Kento; Moody, Adam; Mohror, Kathryn 2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2014 IEEE 28th International Parallel and Distributed Processing Symposium https://doi.org/10.1109/IPDPS.2014.126	conference	May 2014
HARNESS and fault tolerant MPI Fagg, Graham E.; Bukovsky, Antonin; Dongarra, Jack J. Parallel Computing, Vol. 27, Issue 11 https://doi.org/10.1016/S0167-8191(01)00100-4	journal	October 2001
Simplifying the Recovery Model of User-Level Failure Mitigation Bland, Wesley; Raffenetti, Kenneth; Balaji, Pavan 2014 Workshop on Exascale MPI at Supercomputing Conference (ExaMPI) https://doi.org/10.1109/ExaMPI.2014.4	conference	November 2014
Exploring Traditional and Emerging Parallel Programming Models Using a Proxy Application Karlin, Ian; Bhatele, Abhinav; Keasler, Jeff 2013 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on Parallel and Distributed Processing https://doi.org/10.1109/IPDPS.2013.115	conference	May 2013
Failure Detection and Propagation in HPC systems Bosilca, George; Bouteiller, Aurelien; Guermouche, Amina SC16: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2016.26	conference	November 2016
Non-Blocking PMI Extensions for Fast MPI Startup Chakraborty, Sourav; Subramoni, Hari; Moody, Adam 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid) https://doi.org/10.1109/CCGrid.2015.151	conference	May 2015
Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System Moody, Adam; Bronevetsky, Greg; Mohror, Kathryn 2010 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2010.18	conference	November 2010
Designing and Evaluating MPI-2 Dynamic Process Management Support for InfiniBand Gangadharappa, Tejus; Koop, Matthew; Panda, Dhabaleswar K. 2009 International Conference on Parallel Processing Workshops (ICPPW) https://doi.org/10.1109/ICPPW.2009.77	conference	September 2009
The Design and Implementation of Checkpoint/Restart Process Fault Tolerance for Open MPI Hursey, Joshua; Squyres, Jeffrey M.; Mattox, Timothy I. 2007 IEEE International Parallel and Distributed Processing Symposium https://doi.org/10.1109/IPDPS.2007.370605	conference	March 2007
Replication-Based Fault Tolerance for MPI Applications Walters, J. P.; Chaudhary, V. IEEE Transactions on Parallel and Distributed Systems, Vol. 20, Issue 7 https://doi.org/10.1109/TPDS.2008.172	journal	July 2009
Power-Check: An Energy-Efficient Checkpointing Framework for HPC Clusters Chandrasekar, Raghunath Raja; Venkatesh, Akshay; Hamidouche, Khaled 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid) https://doi.org/10.1109/CCGrid.2015.169	conference	May 2015
Coordinated checkpoint versus message log for fault tolerant MPI Lemarinier, Pierre; Bouteiller, Aurelien; Krawezik, Geraud International Journal of High Performance Computing and Networking, Vol. 2, Issue 2/3/4 https://doi.org/10.1504/IJHPCN.2004.008899	journal	January 2004
Blocking vs. Non-Blocking Coordinated Checkpointing for Large-Scale Fault Tolerant MPI Coti, Camille; Herault, Thomas; Lemarinier, Pierre SC 2006 Proceedings Supercomputing 2006, ACM/IEEE SC 2006 Conference (SC'06) https://doi.org/10.1109/SC.2006.15	conference	November 2006
Algorithm-Based Fault Tolerance for Matrix Operations IEEE Transactions on Computers, Vol. C-33, Issue 6 https://doi.org/10.1109/TC.1984.1676475	journal	June 1984
MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes Bosilca, G.; Bouteiller, A.; Cappello, F. ACM/IEEE SC 2002 Conference (SC'02) https://doi.org/10.1109/SC.2002.10048	conference	January 2002
An Analysis Of Algorithm-Based Fault Tolerance Techniques Luk, Franklin T.; Park, Haesun 30th Annual Technical Symposium, SPIE Proceedings https://doi.org/10.1117/12.936896	conference	April 1986
Toward Resilient Algorithms and Applications Heroux, Michael A. arXiv https://doi.org/10.48550/arxiv.1402.3809	preprint	January 2014

Cited By (2)

Foreword to the Special Issue of the Workshop on Exascale MPI (ExaMPI 2017) Skjellum, Anthony; Bangalore, Purushotham V.; Grant, Ryan E. Concurrency and Computation: Practice and Experience, Vol. 32, Issue 3 https://doi.org/10.1002/cpe.5459	journal	July 2019
Application health monitoring for extreme‐scale resiliency using cooperative fault management Agarwal, Pratul K.; Naughton, Thomas; Park, Byung H. Concurrency and Computation: Practice and Experience, Vol. 32, Issue 2 https://doi.org/10.1002/cpe.5449	journal	July 2019

Similar Records

MPI Stages: Checkpointing MPI State for Bulk Synchronous Applications

Journal Article · Mon Jan 01 00:00:00 EST 2018 · EuroMPI'18 Proceedings of the 25th European MPI Users' Group Meeting, Barcelona, Spain, September 23 - 26, 2018 · OSTI ID:1708993

Sultana, Nawrin; Skjellum, Anthony; Laguna, Ignacio; +3 more

Building a Fault Tolerant MPI Application: A Ring Communication Example

Conference · Sat Jan 01 00:00:00 EST 2011 · OSTI ID:1708993

Hursey, Joshua J; Graham, Richard L

Combining Partial Redundancy and Checkpointing for HPC

Conference · Sun Jan 01 00:00:00 EST 2012 · OSTI ID:1708993

Elliott, James; Kharbas, Kishor H; Fiala, David J; +3 more

Related Subjects

97 MATHEMATICS AND COMPUTING
fault tolerance
high-performance computing
MPI
resilience

Title: EReinit: Scalable and efficient fault-tolerance for bulk-synchronous MPI applications

Citation Formats

References (45)

Cited By (2)

Similar Records

Related Subjects