skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Compiler-Directed Soft Error Detection and Recovery to Avoid DUE and SDC via Tail-DMR

Journal Article · · ACM Transactions on Embedded Computing Systems
DOI:https://doi.org/10.1145/2930667· OSTI ID:1565622
 [1];  [1];  [1];  [2]
  1. Virginia Polytechnic Inst. and State Univ. (Virginia Tech), Blacksburg, VA (United States)
  2. Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)

This article presents Clover, a compiler-directed soft error detection and recovery scheme for lightweight soft error resilience. The compiler carefully generates soft-error-tolerant code based on idempotent processing without explicit checkpoints. During program execution, Clover relies on a small number of acoustic wave detectors deployed in the processor to identify soft errors by sensing the wave made by a particle strike. To cope with DUEs (detected unrecoverable errors) caused by the sensing latency of error detection, Clover leverages a novel selective instruction duplication technique called tail-DMR (dual modular redundancy) that provides a region-level error containment. Once a soft error is detected by either the sensors or the tail-DMR, Clover takes care of the error as in the case of exception handling. To recover from the error, Clover simply redirects program control to the beginning of the code region where the error is detected. The experimental results demonstrate that the average runtime overhead is only 26%, which is a 75% reduction compared to that of the state-of-the-art soft error resilience technique. In addition, this article evaluates an alternative technique called tail-wait, comparing it to Clover. According to the evaluation with the different processor configurations and the various error detection latencies, Clover turns out to be a superior technique, achieving 1.06 to 3.49× speedup over the tail-wait.

Research Organization:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF); UT-Battelle LLC/ORNL, Oak Ridge, TN (United States)
Sponsoring Organization:
USDOE Office of Science (SC)
DOE Contract Number:
AC05-00OR22725
OSTI ID:
1565622
Journal Information:
ACM Transactions on Embedded Computing Systems, Vol. 16, Issue 2; ISSN 1539-9087
Publisher:
Association for Computing Machinery (ACM)
Country of Publication:
United States
Language:
English

References (54)

DIVA: a reliable substrate for deep submicron microarchitecture design
  • Austin, T. M.
  • MICRO-32. 32nd Annual ACM/IEEE International Symposium on Microarchitecture, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture https://doi.org/10.1109/MICRO.1999.809458
conference January 1999
The gem5 simulator journal August 2011
End-to-end register data-flow continuous self-test conference January 2009
Boosting efficiency of fault detection and recovery throughapplication-specific comparison and checkpointing conference January 2013
Assuring application-level correctness against soft errors conference November 2011
Trends and challenges in VLSI circuit reliability journal July 2003
Idempotent code generation: Implementation, analysis, and evaluation conference February 2013
Static analysis and compiler design for idempotent processing
  • de Kruijf, Marc A.; Sankaralingam, Karthikeyan; Jha, Somesh
  • Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation - PLDI '12 https://doi.org/10.1145/2254064.2254120
conference January 2012
Shoestring: probabilistic soft error reliability on the cheap journal March 2010
MiBench: A free, commercially representative embedded benchmark suite
  • Guthaus, M. R.; Ringenberg, J. S.; Ernst, D.
  • Proceedings of the Fourth Annual IEEE International Workshop on Workload Characterization. WWC-4 (Cat. No.01EX538) https://doi.org/10.1109/WWC.2001.990739
conference January 2001
Hard Data on Soft Errors: A Large-Scale Assessment of Real-World Error Rates in GPGPU conference May 2010
Reliable on-chip systems in the nano-era: lessons learnt and future trends conference January 2013
UnSync-CMP: Multicore CMP Architecture for Energy-Efficient Soft-Error Reliability journal January 2014
Automated memory leak detection for production use conference January 2014
Adaptive execution techniques for SMT multiprocessor architectures conference January 2005
Near-threshold voltage (NTV) design: opportunities and challenges conference January 2012
Harnessing Soft Computations for Low-Budget Fault Tolerance conference December 2014
Low cost control flow protection using abstract control signatures
  • Khudia, Daya Shanker; Mahlke, Scott
  • Proceedings of the 14th ACM SIGPLAN/SIGBED conference on Languages, compilers and tools for embedded systems - LCTES '13 https://doi.org/10.1145/2491899.2465568
conference January 2013
Efficient soft error protection for commodity embedded microprocessors using profile information
  • Khudia, Daya Shanker; Wright, Griffin; Mahlke, Scott
  • Proceedings of the 13th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, Tools and Theory for Embedded Systems - LCTES '12 https://doi.org/10.1145/2248418.2248433
conference January 2012
Balancing reliability, cost, and performance tradeoffs with FreeFault conference February 2015
Bamboo ECC: Strong, safe, and flexible codes for reliable computer memory conference February 2015
LLVM: A compilation framework for lifelong program analysis & transformation conference January 2004
Adaptive execution techniques of parallel programs for multiprocessors journal May 2010
Detecting memory leaks through introspective dynamic behavior modelling using machine learning conference January 2014
Epipe: A low-cost fault-tolerance technique considering WCET constraints journal November 2013
Online Estimation of Architectural Vulnerability Factor for Soft Errors
  • Li, Xiaodong; Adve, Sarita V.; Bose, Pradip
  • 2008 35th International Symposium on Computer Architecture (ISCA), 2008 International Symposium on Computer Architecture https://doi.org/10.1109/ISCA.2008.9
conference June 2008
Lightweight hardware support for transparent consistency-aware checkpointing in intermittent energy-harvesting systems conference August 2016
Compiler-Directed Lightweight Checkpointing for Fine-Grained Guaranteed Soft Error Recovery
  • Liu, Qingrui; Jung, Changhee; Lee, Dongyoon
  • SC16: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2016.19
conference November 2016
Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory
  • Luo, Yixin; Govindan, Sriram; Sharma, Bikash
  • 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) https://doi.org/10.1109/DSN.2014.50
conference June 2014
The Use of Triple-Modular Redundancy to Improve Computer Reliability journal April 1962
The Soft Error Problem: An Architectural Perspective conference January 2005
Perturbation-based Fault Screening conference February 2007
Cross-Layer Software Dependability on Unreliable Hardware journal January 2016
Reliability-Driven Software Transformations for Unreliable Hardware journal November 2014
dTune: Leveraging Reliable Code Generation for Adaptive Dependability Tuning under Process Variation and Aging-Induced Effects conference January 2014
Reliable software for unreliable hardware: embedded code generation aiming at reliability
  • Rehman, Semeen; Shafique, Muhammad; Kriebel, Florian
  • Proceedings of the seventh IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis - CODES+ISSS '11 https://doi.org/10.1145/2039370.2039408
conference January 2011
Automatic Instruction-Level Software-Only Recovery journal January 2007
SWIFT: Software Implemented Fault Tolerance conference January 2005
AR-SMT: a microarchitectural approach to fault tolerance in microprocessors
  • Rotenberg, E.
  • 29th Annual International Symposium on Fault-Tolerant Computing, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352) https://doi.org/10.1109/FTCS.1999.781037
conference January 1999
An Experimental Study of Soft Errors in Microprocessors journal November 2005
Using likely program invariants to detect hardware errors conference June 2008
Relyzer: Application Resiliency Analyzer for Transient Faults journal May 2013
The EDA Challenges in the Dark Silicon Era: Temperature, Reliability, and Variability Perspectives
  • Shafique, Muhammad; Garg, Siddharth; Henkel, Jörg
  • Proceedings of the The 51st Annual Design Automation Conference on Design Automation Conference - DAC '14 https://doi.org/10.1145/2593069.2593229
conference January 2014
Exploiting program-level masking and error propagation for constrained reliability optimization conference January 2013
Is dark silicon useful?: harnessing the four horsemen of the coming dark silicon apocalypse conference January 2012
Setting an error detection infrastructure with low cost acoustic wave detectors
  • Upasani, Gaurang; Vera, Xavier; Gonzalez, Antonio
  • 2012 ACM/IEEE 39th International Symposium on Computer Architecture (ISCA), 2012 39th Annual International Symposium on Computer Architecture (ISCA) https://doi.org/10.1109/ISCA.2012.6237029
conference June 2012
Reducing DUE-FIT of caches by exploiting acoustic wave detectors for error recovery conference July 2013
Avoiding core's DUE & SDC via acoustic wave detectors and tailored error containment and recovery journal October 2014
Framework for economical error recovery in embedded cores conference July 2014
A Case for Acoustic Wave Detectors for Soft-Errors journal January 2016
Implications of the Power Wall: Dim Cores and Reconfigurable Logic journal September 2013
ReStore: Symptom-Based Soft Error Detection in Microprocessors journal July 2006
Virtualized and flexible ECC for main memory journal March 2010
Space-efficient multi-versioning for input-adaptive feedback-driven program optimizations
  • Zhou, Mingzhou; Shen, Xipeng; Gao, Yaoqing
  • Proceedings of the 2014 ACM International Conference on Object Oriented Programming Systems Languages & Applications - OOPSLA '14 https://doi.org/10.1145/2660193.2660229
conference January 2014

Similar Records

Clover: Compiler directed lightweight soft error resilience
Journal Article · Fri May 01 00:00:00 EDT 2015 · SIGPLAN · OSTI ID:1565622

Resiliency in numerical algorithm design for extreme scale simulations
Journal Article · Fri Dec 10 00:00:00 EST 2021 · International Journal of High Performance Computing Applications · OSTI ID:1565622

Evaluating Application Resilience with XRay
Technical Report · Thu May 07 00:00:00 EDT 2015 · OSTI ID:1565622

Related Subjects