skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: BonVoision: Leveraging Spatial Data Smoothness For Recovery From Memory Soft Errors

Conference ·

The increasing soft error rates in memory systems raise an emerging concern for modern computing systems. As a result, detectable but uncorrectable errors (DUEs) become potentially more frequent and affect HPC applications. Today, upon encountering a DUE, HPC applications crash, incurring significant performance, storage, and energy overheads. In this paper, we propose a technique to continue application execution past a DUE through the repair of the corrupted memory data by leveraging spatial data smoothness. We present BonVoision, a run-time system that intercepts DUE events, analyzes the binary to identify data elements in the structural neighborhood of the event, and fixes the corrupted data elements by interpolating from the values in their neighborhood. Our evaluation demonstrates that BonVoision incurs negligible overhead and outperforms other recovery strategies by a factor of 2×, on average. We demonstrate that BonVoision also improves the efficiency of existing checkpointing/restart schemes by approximately increasing the optimal checkpoint interval by 23%.

Research Organization:
Pacific Northwest National Lab. (PNNL), Richland, WA (United States)
Sponsoring Organization:
USDOE
DOE Contract Number:
AC05-76RL01830
OSTI ID:
1574892
Report Number(s):
PNNL-SA-143140
Resource Relation:
Conference: Proceedings of the ACM International Conference on Supercomputing (ICS 2019), June 26-28, 2019, Phoenix, AZ
Country of Publication:
United States
Language:
English

References (27)

Exploiting Spatial Smoothness in HPC Applications to Detect Silent Data Corruption
  • Bautista-Gomez, Leonardo; Cappello, Franck
  • 2015 IEEE 17th International Conference on High-Performance Computing and Communications; 2015 IEEE 7th International Symposium on Cyberspace Safety and Security; and 2015 IEEE 12th International Conference on Embedded Software and Systems, 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conference on Embedded Software and Systems https://doi.org/10.1109/HPCC-CSS-ICESS.2015.9
conference August 2015
Fault Tolerant One-sided Matrix Decompositions on Heterogeneous Systems with GPUs conference November 2018
GPU-ABFT: Optimizing Algorithm-Based Fault Tolerance for Heterogeneous Systems with GPUs conference August 2016
Online Algorithm-Based Fault Tolerance for Cholesky Decomposition on Heterogeneous Systems with GPUs conference May 2016
Versioned Distributed Arrays for Resilience in Scientific Applications: Global View Resilience journal January 2015
An Evaluation of Threaded Models for a Classical MD Proxy Application conference November 2014
Hardware-Software Integrated Diagnosis for Intermittent Hardware Faults
  • Dadashi, Majid; Rashid, Layali; Pattabiraman, Karthik
  • 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) https://doi.org/10.1109/DSN.2014.1
conference June 2014
A higher order estimate of the optimum checkpoint interval for restart dumps journal February 2006
Checkpoint/restart in practice: When ‘simple is better’ conference September 2014
Improving Application Resilience by Extending Error Correction with Contextual Information conference November 2018
MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes conference January 2002
LetGo: A Lightweight Continuous Framework for HPC Applications Under Failures
  • Fang, Bo; Guan, Qiang; Debardeleben, Nathan
  • Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing - HPDC '17 https://doi.org/10.1145/3078597.3078609
conference January 2017
ePVF: An Enhanced Program Vulnerability Factor Methodology for Cross-Layer Resilience Analysis conference June 2016
Software-Defined Error-Correcting Codes
  • Gottscho, Mark; Schoeny, Clayton; Dolecek, Lara
  • 2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshop (DSN-W) https://doi.org/10.1109/DSN-W.2016.67
conference June 2016
Algorithm-Based Fault Tolerance for Matrix Operations journal June 1984
Improving Application Resilience to Memory Errors with Lightweight Compression
  • Levy, Scott; Ferreira, Kurt B.; Bridges, Patrick G.
  • SC16: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2016.27
conference November 2016
System implications of memory reliability in exascale computing
  • Li, Sheng; Chen, Ke; Hsieh, Ming-Yu
  • Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11 https://doi.org/10.1145/2063384.2063445
conference January 2011
Correcting soft errors online in fast fourier transform
  • Liang, Xin; Chen, Zizhong; Chen, Jieyang
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '17 https://doi.org/10.1145/3126908.3126915
conference January 2017
Lessons Learned from the Analysis of System Failures at Petascale: The Case of Blue Waters
  • Martino, Catello Di; Kalbarczyk, Zbigniew; Iyer, Ravishankar K.
  • 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) https://doi.org/10.1109/DSN.2014.62
conference June 2014
Correctness Field Testing of Production and Decommissioned High Performance Computing Platforms at Los Alamos National Laboratory
  • Michalak, Sarah E.; Rust, William N.; Dal, John T.
  • SC14: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2014.55
conference November 2014
Single event upset at ground level journal January 1996
Context-aware resiliency: Unequal message protection for random-access memories conference November 2017
Fault tolerant preconditioned conjugate gradient for sparse linear system solution conference January 2012
Algorithmic approaches to low overhead fault detection for sparse linear algebra
  • Sloan, Joseph; Kumar, Rakesh; Bronevetsky, Greg
  • 2012 42nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012) https://doi.org/10.1109/DSN.2012.6263938
conference June 2012
Investigating the Interplay between Energy Efficiency and Resilience in High Performance Computing conference May 2015
Hybrid Checkpointing for MPI Jobs in HPC Environments conference December 2010
Quantifying the Accuracy of High-Level Fault Injection Techniques for Hardware Faults
  • Wei, Jiesheng; Thomas, Anna; Li, Guanpeng
  • 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) https://doi.org/10.1109/DSN.2014.2
conference June 2014