Shrink or Substitute: Handling Process Failures in HPC Systems Using In-Situ Recovery
- ORNL
Efficient utilization of today's high-performance computing (HPC) systems with complex software and hardware components requires that the HPC applications are designed to tolerate process failures at runtime. With low mean-time-to-failure (MTTF) of current and future HPC systems, long running simulations on these systems requires capabilities for gracefully handling process failures by the applications themselves. In this paper, we explore the use of fault tolerance extensions to Message Passing Interface (MPI) called user-level failure mitigation (ULFM) for handling process failures without the need to discard the progress made by the application. We explore two alternative recovery strategies, which use ULFM along with application-driven in-memory checkpointing. In the first case, the application is recovered with only the surviving processes, and in the second case, spares are used to replace the failed processes, such that the original configuration of the application is restored. Our experimental results demonstrate that graceful degradation is a viable alternative for recovery in environments where spares may not be available.
- Research Organization:
- Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
- Sponsoring Organization:
- USDOE; USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR) (SC-21)
- DOE Contract Number:
- AC05-00OR22725
- OSTI ID:
- 1454399
- Country of Publication:
- United States
- Language:
- English
Post-failure recovery of MPI communication capability: Design and rationale
|
journal | June 2013 |
Evaluating the Impact of SDC on the GMRES Iterative Solver
|
conference | May 2014 |
System-Level Scalable Checkpoint-Restart for Petascale Computing
|
conference | December 2016 |
Evaluating User-Level Fault Tolerance for MPI Applications
|
conference | January 2014 |
Failure Detection and Propagation in HPC systems
|
conference | November 2016 |
Algorithm-Based Fault Tolerance for Matrix Operations
|
journal | June 1984 |
A first order approximation to the optimum checkpoint interval
|
journal | September 1974 |
Fault tolerant preconditioned conjugate gradient for sparse linear system solution
|
conference | January 2012 |
GMRES: A Generalized Minimal Residual Algorithm for Solving Nonsymmetric Linear Systems
|
journal | July 1986 |
A Flexible Inner-Outer Preconditioned GMRES Algorithm
|
journal | March 1993 |
Assessing resilient versus stop-and-restart fault-tolerant solutions in MPI applications
|
journal | September 2016 |
| Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing | July 2012 | |
Redundant Execution of HPC Applications with MR-MPI
|
conference | January 2011 |
Failures in large scale systems: long-term measurement, analysis, and implications
|
conference | January 2017 |
Toward Local Failure Local Recovery Resilience Model using MPI-ULFM
|
conference | January 2014 |
Exploring Automatic, Online Failure Recovery for Scientific Applications at Extreme Scales
|
conference | November 2014 |
An overview of the Trilinos project
|
journal | September 2005 |
Similar Records
Adding Fault Tolerance to NPB Benchmarks Using ULFM
Adding Fault Tolerance to NPB Benchmarks Using ULFM