BonVoision: Leveraging Spatial Data Smoothness For Recovery From Memory Soft Errors
- BATTELLE (PACIFIC NW LAB)
- University of British Columbia
The increasing soft error rates in memory systems raise an emerging concern for modern computing systems. As a result, detectable but uncorrectable errors (DUEs) become potentially more frequent and affect HPC applications. Today, upon encountering a DUE, HPC applications crash, incurring significant performance, storage, and energy overheads. In this paper, we propose a technique to continue application execution past a DUE through the repair of the corrupted memory data by leveraging spatial data smoothness. We present BonVoision, a run-time system that intercepts DUE events, analyzes the binary to identify data elements in the structural neighborhood of the event, and fixes the corrupted data elements by interpolating from the values in their neighborhood. Our evaluation demonstrates that BonVoision incurs negligible overhead and outperforms other recovery strategies by a factor of 2×, on average. We demonstrate that BonVoision also improves the efficiency of existing checkpointing/restart schemes by approximately increasing the optimal checkpoint interval by 23%.
- Research Organization:
- Pacific Northwest National Lab. (PNNL), Richland, WA (United States)
- Sponsoring Organization:
- USDOE
- DOE Contract Number:
- AC05-76RL01830
- OSTI ID:
- 1574892
- Report Number(s):
- PNNL-SA-143140
- Resource Relation:
- Conference: Proceedings of the ACM International Conference on Supercomputing (ICS 2019), June 26-28, 2019, Phoenix, AZ
- Country of Publication:
- United States
- Language:
- English
Exploiting Spatial Smoothness in HPC Applications to Detect Silent Data Corruption
|
conference | August 2015 |
Fault Tolerant One-sided Matrix Decompositions on Heterogeneous Systems with GPUs
|
conference | November 2018 |
GPU-ABFT: Optimizing Algorithm-Based Fault Tolerance for Heterogeneous Systems with GPUs
|
conference | August 2016 |
Online Algorithm-Based Fault Tolerance for Cholesky Decomposition on Heterogeneous Systems with GPUs
|
conference | May 2016 |
Versioned Distributed Arrays for Resilience in Scientific Applications: Global View Resilience
|
journal | January 2015 |
An Evaluation of Threaded Models for a Classical MD Proxy Application
|
conference | November 2014 |
Hardware-Software Integrated Diagnosis for Intermittent Hardware Faults
|
conference | June 2014 |
A higher order estimate of the optimum checkpoint interval for restart dumps
|
journal | February 2006 |
Checkpoint/restart in practice: When ‘simple is better’
|
conference | September 2014 |
Improving Application Resilience by Extending Error Correction with Contextual Information
|
conference | November 2018 |
MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes
|
conference | January 2002 |
LetGo: A Lightweight Continuous Framework for HPC Applications Under Failures
|
conference | January 2017 |
ePVF: An Enhanced Program Vulnerability Factor Methodology for Cross-Layer Resilience Analysis
|
conference | June 2016 |
Software-Defined Error-Correcting Codes
|
conference | June 2016 |
Algorithm-Based Fault Tolerance for Matrix Operations
|
journal | June 1984 |
Improving Application Resilience to Memory Errors with Lightweight Compression
|
conference | November 2016 |
System implications of memory reliability in exascale computing
|
conference | January 2011 |
Correcting soft errors online in fast fourier transform
|
conference | January 2017 |
Lessons Learned from the Analysis of System Failures at Petascale: The Case of Blue Waters
|
conference | June 2014 |
Correctness Field Testing of Production and Decommissioned High Performance Computing Platforms at Los Alamos National Laboratory
|
conference | November 2014 |
Single event upset at ground level
|
journal | January 1996 |
Context-aware resiliency: Unequal message protection for random-access memories
|
conference | November 2017 |
Fault tolerant preconditioned conjugate gradient for sparse linear system solution
|
conference | January 2012 |
Algorithmic approaches to low overhead fault detection for sparse linear algebra
|
conference | June 2012 |
Investigating the Interplay between Energy Efficiency and Resilience in High Performance Computing
|
conference | May 2015 |
Hybrid Checkpointing for MPI Jobs in HPC Environments
|
conference | December 2010 |
Quantifying the Accuracy of High-Level Fault Injection Techniques for Hardware Faults
|
conference | June 2014 |
Similar Records
An Efficient Silent Data Corruption Detection Method with Error-Feedback Control and Even Sampling for HPC Applications
McrEngine: A Scalable Checkpointing System Using Data-Aware Aggregation and Compression