Performance Efficient Multiresilience Using Checkpoint Recovery in Iterative Algorithms
- ORNL
In this paper, we address the design challenge of building multiresilient iterative high-performance computing (HPC) applications. Multiresilience in HPC applications is the ability to tolerate and maintain forward progress in the presence of both soft errors and process failures. We address the challenge by proposing performance models which are useful to design performance efficient and resilient iterative applications. The models consider the interaction between soft error and process failure resilience solutions. We experimented with a linear solver application with two distinct kinds of soft error detectors: one detector has high overhead and high accuracy, whereas the second has low overhead and low accuracy. We show how both can be leveraged for verifying the integrity of checkpointed state used to recover from both soft errors and process failures. Our results show the performance efficiency and resiliency benefit of employing the low overhead detector with high frequency within the checkpoint interval, so that timely soft error recovery can take place, resulting in less re-computed work.
- Research Organization:
- Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
- Sponsoring Organization:
- USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)
- DOE Contract Number:
- AC05-00OR22725
- OSTI ID:
- 1493144
- Resource Relation:
- Journal Volume: 11339; Conference: 11th Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids - Turin, , Italy - 8/27/2018 8:00:00 AM-8/31/2018 8:00:00 AM
- Country of Publication:
- United States
- Language:
- English
An algorithmic approach to error localization and partial recomputation for low-overhead fault tolerance
|
June 2013 | |
Optimal Resilience Patterns to Cope with Fail-Stop and Silent Errors
|
May 2016 | |
Soft error vulnerability of iterative linear algebra methods
|
January 2008 | |
Evaluating the Impact of SDC on the GMRES Iterative Solver
|
May 2014 | |
Exploiting asynchrony from exact forward recovery for DUE in iterative solvers
|
November 2015 | |
A higher order estimate of the optimum checkpoint interval for restart dumps
|
February 2006 | |
Algorithm-based recovery for iterative methods without checkpointing
|
January 2011 | |
Post-failure recovery of MPI communication capability: Design and rationale
|
June 2013 | |
Pattern-based Modeling of Multiresilience Solutions for High-Performance Computing
|
March 2018 | |
System-Level Scalable Checkpoint-Restart for Petascale Computing
|
December 2016 |
Similar Records
Holistic Measurement Driven Resilience: Combining Operational Fault and Failure Measurements and Fault Injection for Quantifying Fault Detection, Propagation and Impact. Final report
DeepFreeze: Towards Scalable Asynchronous Checkpointing of Deep Learning Models