skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Clover: Compiler directed lightweight soft error resilience

Journal Article · · SIGPLAN
 [1];  [1];  [1];  [2]
  1. Virginia Polytechnic Inst. and State Univ. (Virginia Tech), Blacksburg, VA (United States)
  2. Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)

This paper presents Clover, a compiler directed soft error detection and recovery scheme for lightweight soft error resilience. The compiler carefully generates soft error tolerant code based on idem-potent processing without explicit checkpoint. During program execution, Clover relies on a small number of acoustic wave detectors deployed in the processor to identify soft errors by sensing the wave made by a particle strike. To cope with DUE (detected unrecoverable errors) caused by the sensing latency of error detection, Clover leverages a novel selective instruction duplication technique called tail-DMR (dual modular redundancy). Once a soft error is detected by either the sensor or the tail-DMR, Clover takes care of the error as in the case of exception handling. To recover from the error, Clover simply redirects program control to the beginning of the code region where the error is detected. Lastly, the experiment results demonstrate that the average runtime overhead is only 26%, which is a 75% reduction compared to that of the state-of-the-art soft error resilience technique.

Research Organization:
Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF)
Sponsoring Organization:
USDOE Office of Science (SC)
Grant/Contract Number:
AC05-00OR22725
OSTI ID:
1261518
Journal Information:
SIGPLAN, Vol. 50, Issue 5; Conference: 16. ACM SIGPLAN/SIGBED Conference on Languages, Compilers and Tools for Embedded Systems (LCTES 2015), Portland, OR (United States), 18-19 Jun 2015; ISSN 0362-1340
Publisher:
ACMCopyright Statement
Country of Publication:
United States
Language:
English
Citation Metrics:
Cited by: 18 works
Citation information provided by
Web of Science

References (32)

Static analysis and compiler design for idempotent processing
  • de Kruijf, Marc A.; Sankaralingam, Karthikeyan; Jha, Somesh
  • Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation - PLDI '12 https://doi.org/10.1145/2254064.2254120
conference January 2012
SWIFT: Software Implemented Fault Tolerance conference January 2005
Is dark silicon useful?: harnessing the four horsemen of the coming dark silicon apocalypse conference January 2012
The Use of Triple-Modular Redundancy to Improve Computer Reliability journal April 1962
Reliable on-chip systems in the nano-era: lessons learnt and future trends conference January 2013
Trends and challenges in VLSI circuit reliability journal July 2003
Perturbation-based Fault Screening conference February 2007
An Experimental Study of Soft Errors in Microprocessors journal November 2005
Encore: low-cost, fine-grained transient fault recovery conference January 2011
Automatic Instruction-Level Software-Only Recovery journal January 2007
Argus: Low-Cost, Comprehensive Error Detection in Simple Cores conference December 2007
Shoestring: probabilistic soft error reliability on the cheap journal March 2010
UnSync-CMP: Multicore CMP Architecture for Energy-Efficient Soft-Error Reliability journal January 2014
The gem5 simulator journal August 2011
End-to-end register data-flow continuous self-test conference January 2009
Avoiding core's DUE & SDC via acoustic wave detectors and tailored error containment and recovery journal October 2014
Boosting efficiency of fault detection and recovery throughapplication-specific comparison and checkpointing conference January 2013
Assuring application-level correctness against soft errors conference November 2011
Implications of the Power Wall: Dim Cores and Reconfigurable Logic journal September 2013
Idempotent code generation: Implementation, analysis, and evaluation conference February 2013
Hard Data on Soft Errors: A Large-Scale Assessment of Real-World Error Rates in GPGPU conference May 2010
Harnessing Soft Computations for Low-Budget Fault Tolerance conference December 2014
Framework for economical error recovery in embedded cores conference July 2014
Design and Evaluation of Hybrid Fault-Detection Systems journal May 2005
Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory
  • Luo, Yixin; Govindan, Sriram; Sharma, Bikash
  • 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) https://doi.org/10.1109/DSN.2014.50
conference June 2014
The EDA Challenges in the Dark Silicon Era: Temperature, Reliability, and Variability Perspectives
  • Shafique, Muhammad; Garg, Siddharth; Henkel, Jörg
  • Proceedings of the The 51st Annual Design Automation Conference on Design Automation Conference - DAC '14 https://doi.org/10.1145/2593069.2593229
conference January 2014
Low cost control flow protection using abstract control signatures
  • Khudia, Daya Shanker; Mahlke, Scott
  • Proceedings of the 14th ACM SIGPLAN/SIGBED conference on Languages, compilers and tools for embedded systems - LCTES '13 https://doi.org/10.1145/2491899.2465568
conference January 2013
Near-threshold voltage (NTV) design: opportunities and challenges conference January 2012
40th Annual IEEE/ACM International Symposium on Microarchitecture - Table of Contents conference December 2007
The Soft Error Problem: An Architectural Perspective conference January 2005
Design and Evaluation of Hybrid Fault-Detection Systems conference June 2005
ReStore: Symptom-Based Soft Error Detection in Microprocessors journal July 2006

Cited By (1)

Compiler Directed Speculative Intermittent Computation preprint January 2020

Similar Records

Compiler-Directed Soft Error Detection and Recovery to Avoid DUE and SDC via Tail-DMR
Journal Article · Mon Dec 19 00:00:00 EST 2016 · ACM Transactions on Embedded Computing Systems · OSTI ID:1261518

Resiliency in numerical algorithm design for extreme scale simulations
Journal Article · Fri Dec 10 00:00:00 EST 2021 · International Journal of High Performance Computing Applications · OSTI ID:1261518

Design for a Soft Error Resilient Dynamic Task-Based Runtime, In: 2015 IEEE International Parallel and Distributed Processing Symposium
Conference · Fri May 01 00:00:00 EDT 2015 · 2015 IEEE 29TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS) · OSTI ID:1261518