Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Shrink or Substitute: Handling Process Failures in HPC Systems Using In-Situ Recovery

Conference ·

Efficient utilization of today's high-performance computing (HPC) systems with complex software and hardware components requires that the HPC applications are designed to tolerate process failures at runtime. With low mean-time-to-failure (MTTF) of current and future HPC systems, long running simulations on these systems requires capabilities for gracefully handling process failures by the applications themselves. In this paper, we explore the use of fault tolerance extensions to Message Passing Interface (MPI) called user-level failure mitigation (ULFM) for handling process failures without the need to discard the progress made by the application. We explore two alternative recovery strategies, which use ULFM along with application-driven in-memory checkpointing. In the first case, the application is recovered with only the surviving processes, and in the second case, spares are used to replace the failed processes, such that the original configuration of the application is restored. Our experimental results demonstrate that graceful degradation is a viable alternative for recovery in environments where spares may not be available.

Research Organization:
Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
Sponsoring Organization:
USDOE; USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR) (SC-21)
DOE Contract Number:
AC05-00OR22725
OSTI ID:
1454399
Country of Publication:
United States
Language:
English

References (17)

Post-failure recovery of MPI communication capability: Design and rationale journal June 2013
Evaluating the Impact of SDC on the GMRES Iterative Solver
  • Elliott, James; Hoemmen, Mark; Mueller, Frank
  • 2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2014 IEEE 28th International Parallel and Distributed Processing Symposium https://doi.org/10.1109/IPDPS.2014.123
conference May 2014
System-Level Scalable Checkpoint-Restart for Petascale Computing conference December 2016
Evaluating User-Level Fault Tolerance for MPI Applications conference January 2014
Failure Detection and Propagation in HPC systems
  • Bosilca, George; Bouteiller, Aurelien; Guermouche, Amina
  • SC16: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2016.26
conference November 2016
Algorithm-Based Fault Tolerance for Matrix Operations journal June 1984
A first order approximation to the optimum checkpoint interval journal September 1974
Fault tolerant preconditioned conjugate gradient for sparse linear system solution conference January 2012
GMRES: A Generalized Minimal Residual Algorithm for Solving Nonsymmetric Linear Systems journal July 1986
A Flexible Inner-Outer Preconditioned GMRES Algorithm journal March 1993
Assessing resilient versus stop-and-restart fault-tolerant solutions in MPI applications journal September 2016
Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing July 2012
Redundant Execution of HPC Applications with MR-MPI
  • Engelmann, Christian; Böhm, Swen
  • Parallel and Distributed Computing and Networks / Software Engineering, Parallel and Distributed Computing and Networks / 720: Software Engineering https://doi.org/10.2316/P.2011.719-031
conference January 2011
Failures in large scale systems: long-term measurement, analysis, and implications
  • Gupta, Saurabh; Patel, Tirthak; Engelmann, Christian
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '17 https://doi.org/10.1145/3126908.3126937
conference January 2017
Toward Local Failure Local Recovery Resilience Model using MPI-ULFM conference January 2014
Exploring Automatic, Online Failure Recovery for Scientific Applications at Extreme Scales
  • Gamell, Marc; Katz, Daniel S.; Kolla, Hemanth
  • SC14: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2014.78
conference November 2014
An overview of the Trilinos project journal September 2005

Similar Records

Supporting the Development of Resilient Message Passing Applications using Simulation
Conference · Tue Dec 31 23:00:00 EST 2013 · OSTI ID:1131524

Adding Fault Tolerance to NPB Benchmarks Using ULFM
Conference · Thu Dec 31 23:00:00 EST 2015 · OSTI ID:1271876

Adding Fault Tolerance to NPB Benchmarks Using ULFM
Conference · Thu Dec 31 23:00:00 EST 2015 · OSTI ID:1567438

Related Subjects