skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: BonVoision: Leveraging Spatial Data Smoothness For Recovery From Memory Soft Errors

Abstract

The increasing soft error rates in memory systems raise an emerging concern for modern computing systems. As a result, detectable but uncorrectable errors (DUEs) become potentially more frequent and affect HPC applications. Today, upon encountering a DUE, HPC applications crash, incurring significant performance, storage, and energy overheads. In this paper, we propose a technique to continue application execution past a DUE through the repair of the corrupted memory data by leveraging spatial data smoothness. We present BonVoision, a run-time system that intercepts DUE events, analyzes the binary to identify data elements in the structural neighborhood of the event, and fixes the corrupted data elements by interpolating from the values in their neighborhood. Our evaluation demonstrates that BonVoision incurs negligible overhead and outperforms other recovery strategies by a factor of 2×, on average. We demonstrate that BonVoision also improves the efficiency of existing checkpointing/restart schemes by approximately increasing the optimal checkpoint interval by 23%.

Authors:
 [1];  [2];  [2];  [2];  [1]
  1. BATTELLE (PACIFIC NW LAB)
  2. University of British Columbia
Publication Date:
Research Org.:
Pacific Northwest National Lab. (PNNL), Richland, WA (United States)
Sponsoring Org.:
USDOE
OSTI Identifier:
1574892
Report Number(s):
PNNL-SA-143140
DOE Contract Number:  
AC05-76RL01830
Resource Type:
Conference
Resource Relation:
Conference: Proceedings of the ACM International Conference on Supercomputing (ICS 2019), June 26-28, 2019, Phoenix, AZ
Country of Publication:
United States
Language:
English

Citation Formats

Fang, Bo, Halawa, Hassan, Pattabiram, Karthik, Ripeanu, Matei, and Krishnamoorthy, Sriram. BonVoision: Leveraging Spatial Data Smoothness For Recovery From Memory Soft Errors. United States: N. p., 2019. Web. doi:10.1145/3330345.3330388.
Fang, Bo, Halawa, Hassan, Pattabiram, Karthik, Ripeanu, Matei, & Krishnamoorthy, Sriram. BonVoision: Leveraging Spatial Data Smoothness For Recovery From Memory Soft Errors. United States. doi:10.1145/3330345.3330388.
Fang, Bo, Halawa, Hassan, Pattabiram, Karthik, Ripeanu, Matei, and Krishnamoorthy, Sriram. Wed . "BonVoision: Leveraging Spatial Data Smoothness For Recovery From Memory Soft Errors". United States. doi:10.1145/3330345.3330388.
@article{osti_1574892,
title = {BonVoision: Leveraging Spatial Data Smoothness For Recovery From Memory Soft Errors},
author = {Fang, Bo and Halawa, Hassan and Pattabiram, Karthik and Ripeanu, Matei and Krishnamoorthy, Sriram},
abstractNote = {The increasing soft error rates in memory systems raise an emerging concern for modern computing systems. As a result, detectable but uncorrectable errors (DUEs) become potentially more frequent and affect HPC applications. Today, upon encountering a DUE, HPC applications crash, incurring significant performance, storage, and energy overheads. In this paper, we propose a technique to continue application execution past a DUE through the repair of the corrupted memory data by leveraging spatial data smoothness. We present BonVoision, a run-time system that intercepts DUE events, analyzes the binary to identify data elements in the structural neighborhood of the event, and fixes the corrupted data elements by interpolating from the values in their neighborhood. Our evaluation demonstrates that BonVoision incurs negligible overhead and outperforms other recovery strategies by a factor of 2×, on average. We demonstrate that BonVoision also improves the efficiency of existing checkpointing/restart schemes by approximately increasing the optimal checkpoint interval by 23%.},
doi = {10.1145/3330345.3330388},
journal = {},
number = ,
volume = ,
place = {United States},
year = {2019},
month = {6}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share:

Works referenced in this record:

Versioned Distributed Arrays for Resilience in Scientific Applications: Global View Resilience
journal, January 2015


A higher order estimate of the optimum checkpoint interval for restart dumps
journal, February 2006


Algorithm-Based Fault Tolerance for Matrix Operations
journal, June 1984

  • Kuang-Hua Huang, ; Abraham, Jacob A.
  • IEEE Transactions on Computers, Vol. C-33, Issue 6
  • DOI: 10.1109/TC.1984.1676475

Single event upset at ground level
journal, January 1996

  • Normand, E.
  • IEEE Transactions on Nuclear Science, Vol. 43, Issue 6
  • DOI: 10.1109/23.556861