skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Compiler-Directed Soft Error Detection and Recovery to Avoid DUE and SDC via Tail-DMR

Abstract

This article presents Clover, a compiler-directed soft error detection and recovery scheme for lightweight soft error resilience. The compiler carefully generates soft-error-tolerant code based on idempotent processing without explicit checkpoints. During program execution, Clover relies on a small number of acoustic wave detectors deployed in the processor to identify soft errors by sensing the wave made by a particle strike. To cope with DUEs (detected unrecoverable errors) caused by the sensing latency of error detection, Clover leverages a novel selective instruction duplication technique called tail-DMR (dual modular redundancy) that provides a region-level error containment. Once a soft error is detected by either the sensors or the tail-DMR, Clover takes care of the error as in the case of exception handling. To recover from the error, Clover simply redirects program control to the beginning of the code region where the error is detected. The experimental results demonstrate that the average runtime overhead is only 26%, which is a 75% reduction compared to that of the state-of-the-art soft error resilience technique. In addition, this article evaluates an alternative technique called tail-wait, comparing it to Clover. According to the evaluation with the different processor configurations and the various error detection latencies, Clover turnsmore » out to be a superior technique, achieving 1.06 to 3.49× speedup over the tail-wait.« less

Authors:
 [1];  [1];  [1];  [2]
  1. Virginia Polytechnic Inst. and State Univ. (Virginia Tech), Blacksburg, VA (United States)
  2. Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Publication Date:
Research Org.:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF); UT-Battelle LLC/ORNL, Oak Ridge, TN (Unted States)
Sponsoring Org.:
USDOE Office of Science (SC)
OSTI Identifier:
1565622
DOE Contract Number:  
AC05-00OR22725
Resource Type:
Journal Article
Journal Name:
ACM Transactions on Embedded Computing Systems
Additional Journal Information:
Journal Volume: 16; Journal Issue: 2; Journal ID: ISSN 1539-9087
Publisher:
Association for Computing Machinery (ACM)
Country of Publication:
United States
Language:
English
Subject:
Computer Science

Citation Formats

Liu, Qingrui, Jung, Changhee, Lee, Dongyoon, and Tiwari, Devesh. Compiler-Directed Soft Error Detection and Recovery to Avoid DUE and SDC via Tail-DMR. United States: N. p., 2016. Web. doi:10.1145/2930667.
Liu, Qingrui, Jung, Changhee, Lee, Dongyoon, & Tiwari, Devesh. Compiler-Directed Soft Error Detection and Recovery to Avoid DUE and SDC via Tail-DMR. United States. doi:10.1145/2930667.
Liu, Qingrui, Jung, Changhee, Lee, Dongyoon, and Tiwari, Devesh. Mon . "Compiler-Directed Soft Error Detection and Recovery to Avoid DUE and SDC via Tail-DMR". United States. doi:10.1145/2930667.
@article{osti_1565622,
title = {Compiler-Directed Soft Error Detection and Recovery to Avoid DUE and SDC via Tail-DMR},
author = {Liu, Qingrui and Jung, Changhee and Lee, Dongyoon and Tiwari, Devesh},
abstractNote = {This article presents Clover, a compiler-directed soft error detection and recovery scheme for lightweight soft error resilience. The compiler carefully generates soft-error-tolerant code based on idempotent processing without explicit checkpoints. During program execution, Clover relies on a small number of acoustic wave detectors deployed in the processor to identify soft errors by sensing the wave made by a particle strike. To cope with DUEs (detected unrecoverable errors) caused by the sensing latency of error detection, Clover leverages a novel selective instruction duplication technique called tail-DMR (dual modular redundancy) that provides a region-level error containment. Once a soft error is detected by either the sensors or the tail-DMR, Clover takes care of the error as in the case of exception handling. To recover from the error, Clover simply redirects program control to the beginning of the code region where the error is detected. The experimental results demonstrate that the average runtime overhead is only 26%, which is a 75% reduction compared to that of the state-of-the-art soft error resilience technique. In addition, this article evaluates an alternative technique called tail-wait, comparing it to Clover. According to the evaluation with the different processor configurations and the various error detection latencies, Clover turns out to be a superior technique, achieving 1.06 to 3.49× speedup over the tail-wait.},
doi = {10.1145/2930667},
journal = {ACM Transactions on Embedded Computing Systems},
issn = {1539-9087},
number = 2,
volume = 16,
place = {United States},
year = {2016},
month = {12}
}

Works referenced in this record:

The gem5 simulator
journal, August 2011

  • Binkert, Nathan; Sardashti, Somayeh; Sen, Rathijit
  • ACM SIGARCH Computer Architecture News, Vol. 39, Issue 2
  • DOI: 10.1145/2024716.2024718

Trends and challenges in VLSI circuit reliability
journal, July 2003


Shoestring: probabilistic soft error reliability on the cheap
journal, March 2010

  • Feng, Shuguang; Gupta, Shantanu; Ansari, Amin
  • ACM SIGARCH Computer Architecture News, Vol. 38, Issue 1
  • DOI: 10.1145/1735970.1736063

UnSync-CMP: Multicore CMP Architecture for Energy-Efficient Soft-Error Reliability
journal, January 2014

  • Jeyapaul, Reiley; Rhisheekesan, Abhishek
  • IEEE Transactions on Parallel and Distributed Systems, Vol. 25, Issue 1
  • DOI: 10.1109/TPDS.2013.14

Adaptive execution techniques of parallel programs for multiprocessors
journal, May 2010

  • Lee, Jaejin; Park, Jung-Ho; Kim, Honggyu
  • Journal of Parallel and Distributed Computing, Vol. 70, Issue 5
  • DOI: 10.1016/j.jpdc.2009.10.008

Epipe: A low-cost fault-tolerance technique considering WCET constraints
journal, November 2013


The Use of Triple-Modular Redundancy to Improve Computer Reliability
journal, April 1962

  • Lyons, R. E.; Vanderkulk, W.
  • IBM Journal of Research and Development, Vol. 6, Issue 2
  • DOI: 10.1147/rd.62.0200

Cross-Layer Software Dependability on Unreliable Hardware
journal, January 2016

  • Rehman, Semeen; Chen, Kuan-Hsun; Kriebel, Florian
  • IEEE Transactions on Computers, Vol. 65, Issue 1
  • DOI: 10.1109/TC.2015.2417554

Reliability-Driven Software Transformations for Unreliable Hardware
journal, November 2014

  • Rehman, Semeen; Kriebel, Florian; Shafique, Muhammad
  • IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol. 33, Issue 11
  • DOI: 10.1109/TCAD.2014.2341894

Automatic Instruction-Level Software-Only Recovery
journal, January 2007

  • Reis, George A.; Chang, Jonathan; August, David I.
  • IEEE Micro, Vol. 27, Issue 1
  • DOI: 10.1109/MM.2007.4

An Experimental Study of Soft Errors in Microprocessors
journal, November 2005

  • Saggese, G. P.; Wang, N. J.; Kalbarczyk, Z. T.
  • IEEE Micro, Vol. 25, Issue 6
  • DOI: 10.1109/MM.2005.104

Relyzer: Application Resiliency Analyzer for Transient Faults
journal, May 2013

  • Sastry Hari, Siva Kumar; Adve, Sarita V.; Naeimi, Helia
  • IEEE Micro, Vol. 33, Issue 3
  • DOI: 10.1109/MM.2013.30

Avoiding core's DUE & SDC via acoustic wave detectors and tailored error containment and recovery
journal, October 2014

  • Upasani, Gaurang; Vera, Xavier; González, Antonio
  • ACM SIGARCH Computer Architecture News, Vol. 42, Issue 3
  • DOI: 10.1145/2678373.2665682

A Case for Acoustic Wave Detectors for Soft-Errors
journal, January 2016

  • Upasani, Gaurang; Vera, Xavier; Gonzalez, Antonio
  • IEEE Transactions on Computers, Vol. 65, Issue 1
  • DOI: 10.1109/TC.2015.2419652

Implications of the Power Wall: Dim Cores and Reconfigurable Logic
journal, September 2013


ReStore: Symptom-Based Soft Error Detection in Microprocessors
journal, July 2006

  • Wang, N. J.; Patel, S. J.
  • IEEE Transactions on Dependable and Secure Computing, Vol. 3, Issue 3
  • DOI: 10.1109/TDSC.2006.40

Virtualized and flexible ECC for main memory
journal, March 2010

  • Yoon, Doe Hyun; Erez, Mattan
  • ACM SIGARCH Computer Architecture News, Vol. 38, Issue 1
  • DOI: 10.1145/1735970.1736064