Shrink or Substitute: Handling Process Failures in HPC Systems Using In-Situ Recovery

Ashraf, Rizwan; Hukerikar, Saurabh; Engelmann, Christian

doi:10.1109/PDP2018.2018.00032

Shrink or Substitute: Handling Process Failures in HPC Systems Using In-Situ Recovery

Conference · Wed Feb 28 23:00:00 EST 2018

DOI:https://doi.org/10.1109/PDP2018.2018.00032· OSTI ID:1454399

^[1]; ^[1]; ^[1]

ORNL

Efficient utilization of today's high-performance computing (HPC) systems with complex software and hardware components requires that the HPC applications are designed to tolerate process failures at runtime. With low mean-time-to-failure (MTTF) of current and future HPC systems, long running simulations on these systems requires capabilities for gracefully handling process failures by the applications themselves. In this paper, we explore the use of fault tolerance extensions to Message Passing Interface (MPI) called user-level failure mitigation (ULFM) for handling process failures without the need to discard the progress made by the application. We explore two alternative recovery strategies, which use ULFM along with application-driven in-memory checkpointing. In the first case, the application is recovered with only the surviving processes, and in the second case, spares are used to replace the failed processes, such that the original configuration of the application is restored. Our experimental results demonstrate that graceful degradation is a viable alternative for recovery in environments where spares may not be available.

View Conference

Research Organization:: Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)

Sponsoring Organization:: USDOE; USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR) (SC-21)

DOE Contract Number:: AC05-00OR22725

OSTI ID:: 1454399

Country of Publication:: United States

Language:: English

References (17)

Post-failure recovery of MPI communication capability: Design and rationale Bland, Wesley; Bouteiller, Aurelien; Herault, Thomas The International Journal of High Performance Computing Applications, Vol. 27, Issue 3 https://doi.org/10.1177/1094342013488238	journal	June 2013
Evaluating the Impact of SDC on the GMRES Iterative Solver Elliott, James; Hoemmen, Mark; Mueller, Frank 2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2014 IEEE 28th International Parallel and Distributed Processing Symposium https://doi.org/10.1109/IPDPS.2014.123	conference	May 2014
System-Level Scalable Checkpoint-Restart for Petascale Computing Cao, Jiajun; Arya, Kapil; Garg, Rohan 2016 IEEE 22nd International Conference on Parallel and Distributed Systems (ICPADS) https://doi.org/10.1109/ICPADS.2016.0125	conference	December 2016
Evaluating User-Level Fault Tolerance for MPI Applications Laguna, Ignacio; Richards, David F.; Gamblin, Todd Proceedings of the 21st European MPI Users' Group Meeting on - EuroMPI/ASIA '14 https://doi.org/10.1145/2642769.2642775	conference	January 2014
Failure Detection and Propagation in HPC systems Bosilca, George; Bouteiller, Aurelien; Guermouche, Amina SC16: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2016.26	conference	November 2016
Algorithm-Based Fault Tolerance for Matrix Operations No authors listed IEEE Transactions on Computers, Vol. C-33, Issue 6 https://doi.org/10.1109/TC.1984.1676475	journal	June 1984
A first order approximation to the optimum checkpoint interval Young, John W. Communications of the ACM, Vol. 17, Issue 9 https://doi.org/10.1145/361147.361115	journal	September 1974
Fault tolerant preconditioned conjugate gradient for sparse linear system solution Shantharam, Manu; Srinivasmurthy, Sowmyalatha; Raghavan, Padma Proceedings of the 26th ACM international conference on Supercomputing - ICS '12 https://doi.org/10.1145/2304576.2304588	conference	January 2012
GMRES: A Generalized Minimal Residual Algorithm for Solving Nonsymmetric Linear Systems Saad, Youcef; Schultz, Martin H. SIAM Journal on Scientific and Statistical Computing, Vol. 7, Issue 3 https://doi.org/10.1137/0907058	journal	July 1986
A Flexible Inner-Outer Preconditioned GMRES Algorithm Saad, Youcef SIAM Journal on Scientific Computing, Vol. 14, Issue 2 https://doi.org/10.1137/0914028	journal	March 1993
Assessing resilient versus stop-and-restart fault-tolerant solutions in MPI applications Losada, Nuria; Martín, María J.; González, Patricia The Journal of Supercomputing, Vol. 73, Issue 1 https://doi.org/10.1007/s11227-016-1863-z	journal	September 2016
Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing Fiala, David J.; Mueller, Frank; Engelmann, Christian https://doi.org/10.2172/1081941		July 2012
Redundant Execution of HPC Applications with MR-MPI Engelmann, Christian; Böhm, Swen Parallel and Distributed Computing and Networks / Software Engineering, Parallel and Distributed Computing and Networks / 720: Software Engineering https://doi.org/10.2316/P.2011.719-031	conference	January 2011
Failures in large scale systems: long-term measurement, analysis, and implications Gupta, Saurabh; Patel, Tirthak; Engelmann, Christian Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '17 https://doi.org/10.1145/3126908.3126937	conference	January 2017
Toward Local Failure Local Recovery Resilience Model using MPI-ULFM Teranishi, Keita; Heroux, Michael A. Proceedings of the 21st European MPI Users' Group Meeting on - EuroMPI/ASIA '14 https://doi.org/10.1145/2642769.2642774	conference	January 2014
Exploring Automatic, Online Failure Recovery for Scientific Applications at Extreme Scales Gamell, Marc; Katz, Daniel S.; Kolla, Hemanth SC14: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2014.78	conference	November 2014
An overview of the Trilinos project Heroux, Michael A.; Phipps, Eric T.; Salinger, Andrew G. ACM Transactions on Mathematical Software, Vol. 31, Issue 3 https://doi.org/10.1145/1089014.1089021	journal	September 2005

Similar Records

Supporting the Development of Resilient Message Passing Applications using Simulation

Conference · Tue Dec 31 23:00:00 EST 2013 · OSTI ID:1131524

Adding Fault Tolerance to NPB Benchmarks Using ULFM

Conference · Thu Dec 31 23:00:00 EST 2015 · OSTI ID:1271876

Adding Fault Tolerance to NPB Benchmarks Using ULFM

Conference · Thu Dec 31 23:00:00 EST 2015 · OSTI ID:1567438

Shrink or Substitute: Handling Process Failures in HPC Systems Using In-Situ Recovery

Citation Formats

References (17)

Similar Records

Related Subjects