skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: BonVoision: Leveraging Spatial Data Smoothness For Recovery From Memory Soft Errors

Abstract

The increasing soft error rates in memory systems raise an emerging concern for modern computing systems. As a result, detectable but uncorrectable errors (DUEs) become potentially more frequent and affect HPC applications. Today, upon encountering a DUE, HPC applications crash, incurring significant performance, storage, and energy overheads. In this paper, we propose a technique to continue application execution past a DUE through the repair of the corrupted memory data by leveraging spatial data smoothness. We present BonVoision, a run-time system that intercepts DUE events, analyzes the binary to identify data elements in the structural neighborhood of the event, and fixes the corrupted data elements by interpolating from the values in their neighborhood. Our evaluation demonstrates that BonVoision incurs negligible overhead and outperforms other recovery strategies by a factor of 2×, on average. We demonstrate that BonVoision also improves the efficiency of existing checkpointing/restart schemes by approximately increasing the optimal checkpoint interval by 23%.

Authors:
 [1];  [2];  [2];  [2];  [1]
  1. BATTELLE (PACIFIC NW LAB)
  2. University of British Columbia
Publication Date:
Research Org.:
Pacific Northwest National Lab. (PNNL), Richland, WA (United States)
Sponsoring Org.:
USDOE
OSTI Identifier:
1574892
Report Number(s):
PNNL-SA-143140
DOE Contract Number:  
AC05-76RL01830
Resource Type:
Conference
Resource Relation:
Conference: Proceedings of the ACM International Conference on Supercomputing (ICS 2019), June 26-28, 2019, Phoenix, AZ
Country of Publication:
United States
Language:
English

Citation Formats

Fang, Bo, Halawa, Hassan, Pattabiram, Karthik, Ripeanu, Matei, and Krishnamoorthy, Sriram. BonVoision: Leveraging Spatial Data Smoothness For Recovery From Memory Soft Errors. United States: N. p., 2019. Web. doi:10.1145/3330345.3330388.
Fang, Bo, Halawa, Hassan, Pattabiram, Karthik, Ripeanu, Matei, & Krishnamoorthy, Sriram. BonVoision: Leveraging Spatial Data Smoothness For Recovery From Memory Soft Errors. United States. https://doi.org/10.1145/3330345.3330388
Fang, Bo, Halawa, Hassan, Pattabiram, Karthik, Ripeanu, Matei, and Krishnamoorthy, Sriram. 2019. "BonVoision: Leveraging Spatial Data Smoothness For Recovery From Memory Soft Errors". United States. https://doi.org/10.1145/3330345.3330388.
@article{osti_1574892,
title = {BonVoision: Leveraging Spatial Data Smoothness For Recovery From Memory Soft Errors},
author = {Fang, Bo and Halawa, Hassan and Pattabiram, Karthik and Ripeanu, Matei and Krishnamoorthy, Sriram},
abstractNote = {The increasing soft error rates in memory systems raise an emerging concern for modern computing systems. As a result, detectable but uncorrectable errors (DUEs) become potentially more frequent and affect HPC applications. Today, upon encountering a DUE, HPC applications crash, incurring significant performance, storage, and energy overheads. In this paper, we propose a technique to continue application execution past a DUE through the repair of the corrupted memory data by leveraging spatial data smoothness. We present BonVoision, a run-time system that intercepts DUE events, analyzes the binary to identify data elements in the structural neighborhood of the event, and fixes the corrupted data elements by interpolating from the values in their neighborhood. Our evaluation demonstrates that BonVoision incurs negligible overhead and outperforms other recovery strategies by a factor of 2×, on average. We demonstrate that BonVoision also improves the efficiency of existing checkpointing/restart schemes by approximately increasing the optimal checkpoint interval by 23%.},
doi = {10.1145/3330345.3330388},
url = {https://www.osti.gov/biblio/1574892}, journal = {},
number = ,
volume = ,
place = {United States},
year = {2019},
month = {6}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share:

Works referenced in this record:

Exploiting Spatial Smoothness in HPC Applications to Detect Silent Data Corruption
conference, August 2015

  • Bautista-Gomez, Leonardo; Cappello, Franck
  • 2015 IEEE 17th International Conference on High-Performance Computing and Communications; 2015 IEEE 7th International Symposium on Cyberspace Safety and Security; and 2015 IEEE 12th International Conference on Embedded Software and Systems, 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conference on Embedded Software and Systems
  • https://doi.org/10.1109/HPCC-CSS-ICESS.2015.9

Fault Tolerant One-sided Matrix Decompositions on Heterogeneous Systems with GPUs
conference, November 2018


GPU-ABFT: Optimizing Algorithm-Based Fault Tolerance for Heterogeneous Systems with GPUs
conference, August 2016


Online Algorithm-Based Fault Tolerance for Cholesky Decomposition on Heterogeneous Systems with GPUs
conference, May 2016


Versioned Distributed Arrays for Resilience in Scientific Applications: Global View Resilience
journal, January 2015


An Evaluation of Threaded Models for a Classical MD Proxy Application
conference, November 2014


Hardware-Software Integrated Diagnosis for Intermittent Hardware Faults
conference, June 2014

  • Dadashi, Majid; Rashid, Layali; Pattabiraman, Karthik
  • 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)
  • https://doi.org/10.1109/DSN.2014.1

A higher order estimate of the optimum checkpoint interval for restart dumps
journal, February 2006


Checkpoint/restart in practice: When ‘simple is better’
conference, September 2014


Improving Application Resilience by Extending Error Correction with Contextual Information
conference, November 2018


MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes
conference, January 2002


LetGo: A Lightweight Continuous Framework for HPC Applications Under Failures
conference, January 2017

  • Fang, Bo; Guan, Qiang; Debardeleben, Nathan
  • Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing - HPDC '17
  • https://doi.org/10.1145/3078597.3078609

ePVF: An Enhanced Program Vulnerability Factor Methodology for Cross-Layer Resilience Analysis
conference, June 2016


Software-Defined Error-Correcting Codes
conference, June 2016

  • Gottscho, Mark; Schoeny, Clayton; Dolecek, Lara
  • 2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshop (DSN-W)
  • https://doi.org/10.1109/DSN-W.2016.67

Algorithm-Based Fault Tolerance for Matrix Operations
journal, June 1984


Improving Application Resilience to Memory Errors with Lightweight Compression
conference, November 2016

  • Levy, Scott; Ferreira, Kurt B.; Bridges, Patrick G.
  • SC16: International Conference for High Performance Computing, Networking, Storage and Analysis
  • https://doi.org/10.1109/SC.2016.27

System implications of memory reliability in exascale computing
conference, January 2011

  • Li, Sheng; Chen, Ke; Hsieh, Ming-Yu
  • Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11
  • https://doi.org/10.1145/2063384.2063445

Correcting soft errors online in fast fourier transform
conference, January 2017

  • Liang, Xin; Chen, Zizhong; Chen, Jieyang
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '17
  • https://doi.org/10.1145/3126908.3126915

Lessons Learned from the Analysis of System Failures at Petascale: The Case of Blue Waters
conference, June 2014

  • Martino, Catello Di; Kalbarczyk, Zbigniew; Iyer, Ravishankar K.
  • 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)
  • https://doi.org/10.1109/DSN.2014.62

Correctness Field Testing of Production and Decommissioned High Performance Computing Platforms at Los Alamos National Laboratory
conference, November 2014

  • Michalak, Sarah E.; Rust, William N.; Dal, John T.
  • SC14: International Conference for High Performance Computing, Networking, Storage and Analysis
  • https://doi.org/10.1109/SC.2014.55

Single event upset at ground level
journal, January 1996


Context-aware resiliency: Unequal message protection for random-access memories
conference, November 2017


Fault tolerant preconditioned conjugate gradient for sparse linear system solution
conference, January 2012


Algorithmic approaches to low overhead fault detection for sparse linear algebra
conference, June 2012

  • Sloan, Joseph; Kumar, Rakesh; Bronevetsky, Greg
  • 2012 42nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012)
  • https://doi.org/10.1109/DSN.2012.6263938

Investigating the Interplay between Energy Efficiency and Resilience in High Performance Computing
conference, May 2015


Hybrid Checkpointing for MPI Jobs in HPC Environments
conference, December 2010


Quantifying the Accuracy of High-Level Fault Injection Techniques for Hardware Faults
conference, June 2014

  • Wei, Jiesheng; Thomas, Anna; Li, Guanpeng
  • 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)
  • https://doi.org/10.1109/DSN.2014.2