DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Clover: Compiler directed lightweight soft error resilience

Abstract

This paper presents Clover, a compiler directed soft error detection and recovery scheme for lightweight soft error resilience. The compiler carefully generates soft error tolerant code based on idem-potent processing without explicit checkpoint. During program execution, Clover relies on a small number of acoustic wave detectors deployed in the processor to identify soft errors by sensing the wave made by a particle strike. To cope with DUE (detected unrecoverable errors) caused by the sensing latency of error detection, Clover leverages a novel selective instruction duplication technique called tail-DMR (dual modular redundancy). Once a soft error is detected by either the sensor or the tail-DMR, Clover takes care of the error as in the case of exception handling. To recover from the error, Clover simply redirects program control to the beginning of the code region where the error is detected. Lastly, the experiment results demonstrate that the average runtime overhead is only 26%, which is a 75% reduction compared to that of the state-of-the-art soft error resilience technique.

Authors:
 [1];  [1];  [1];  [2]
  1. Virginia Polytechnic Inst. and State Univ. (Virginia Tech), Blacksburg, VA (United States)
  2. Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Publication Date:
Research Org.:
Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF)
Sponsoring Org.:
USDOE Office of Science (SC)
OSTI Identifier:
1261518
Grant/Contract Number:  
AC05-00OR22725
Resource Type:
Accepted Manuscript
Journal Name:
SIGPLAN
Additional Journal Information:
Journal Volume: 50; Journal Issue: 5; Conference: 16. ACM SIGPLAN/SIGBED Conference on Languages, Compilers and Tools for Embedded Systems (LCTES 2015), Portland, OR (United States), 18-19 Jun 2015; Journal ID: ISSN 0362-1340
Publisher:
ACM
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING; soft error resilience; compilers; tail-DMR frontier; idempotent processing; acoustic wave detectors

Citation Formats

Liu, Qingrui, Lee, Dongyoon, Jung, Changhee, and Tiwari, Devesh. Clover: Compiler directed lightweight soft error resilience. United States: N. p., 2015. Web. doi:10.1145/2670529.2754959.
Liu, Qingrui, Lee, Dongyoon, Jung, Changhee, & Tiwari, Devesh. Clover: Compiler directed lightweight soft error resilience. United States. https://doi.org/10.1145/2670529.2754959
Liu, Qingrui, Lee, Dongyoon, Jung, Changhee, and Tiwari, Devesh. Fri . "Clover: Compiler directed lightweight soft error resilience". United States. https://doi.org/10.1145/2670529.2754959. https://www.osti.gov/servlets/purl/1261518.
@article{osti_1261518,
title = {Clover: Compiler directed lightweight soft error resilience},
author = {Liu, Qingrui and Lee, Dongyoon and Jung, Changhee and Tiwari, Devesh},
abstractNote = {This paper presents Clover, a compiler directed soft error detection and recovery scheme for lightweight soft error resilience. The compiler carefully generates soft error tolerant code based on idem-potent processing without explicit checkpoint. During program execution, Clover relies on a small number of acoustic wave detectors deployed in the processor to identify soft errors by sensing the wave made by a particle strike. To cope with DUE (detected unrecoverable errors) caused by the sensing latency of error detection, Clover leverages a novel selective instruction duplication technique called tail-DMR (dual modular redundancy). Once a soft error is detected by either the sensor or the tail-DMR, Clover takes care of the error as in the case of exception handling. To recover from the error, Clover simply redirects program control to the beginning of the code region where the error is detected. Lastly, the experiment results demonstrate that the average runtime overhead is only 26%, which is a 75% reduction compared to that of the state-of-the-art soft error resilience technique.},
doi = {10.1145/2670529.2754959},
journal = {SIGPLAN},
number = 5,
volume = 50,
place = {United States},
year = {Fri May 01 00:00:00 EDT 2015},
month = {Fri May 01 00:00:00 EDT 2015}
}

Journal Article:
Free Publicly Available Full Text
Publisher's Version of Record

Citation Metrics:
Cited by: 18 works
Citation information provided by
Web of Science

Save / Share:

Works referenced in this record:

Static analysis and compiler design for idempotent processing
conference, January 2012

  • de Kruijf, Marc A.; Sankaralingam, Karthikeyan; Jha, Somesh
  • Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation - PLDI '12
  • DOI: 10.1145/2254064.2254120

SWIFT: Software Implemented Fault Tolerance
conference, January 2005

  • Reis, G. A.; Chang, J.; Vachharajani, N.
  • International Symposium on Code Generation and Optimization
  • DOI: 10.1109/CGO.2005.34

Is dark silicon useful?: harnessing the four horsemen of the coming dark silicon apocalypse
conference, January 2012

  • Taylor, Michael B.
  • Proceedings of the 49th Annual Design Automation Conference on - DAC '12
  • DOI: 10.1145/2228360.2228567

The Use of Triple-Modular Redundancy to Improve Computer Reliability
journal, April 1962

  • Lyons, R. E.; Vanderkulk, W.
  • IBM Journal of Research and Development, Vol. 6, Issue 2
  • DOI: 10.1147/rd.62.0200

Reliable on-chip systems in the nano-era: lessons learnt and future trends
conference, January 2013

  • Henkel, Jörg; Bauer, Lars; Dutt, Nikil
  • Proceedings of the 50th Annual Design Automation Conference on - DAC '13
  • DOI: 10.1145/2463209.2488857

Trends and challenges in VLSI circuit reliability
journal, July 2003


Perturbation-based Fault Screening
conference, February 2007

  • Racunas, Paul; Constantinides, Kypros; Manne, Srilatha
  • 2007 IEEE 13th International Symposium on High Performance Computer Architecture
  • DOI: 10.1109/HPCA.2007.346195

An Experimental Study of Soft Errors in Microprocessors
journal, November 2005

  • Saggese, G. P.; Wang, N. J.; Kalbarczyk, Z. T.
  • IEEE Micro, Vol. 25, Issue 6
  • DOI: 10.1109/MM.2005.104

Encore: low-cost, fine-grained transient fault recovery
conference, January 2011

  • Feng, Shuguang; Gupta, Shantanu; Ansari, Amin
  • Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture - MICRO-44 '11
  • DOI: 10.1145/2155620.2155667

Automatic Instruction-Level Software-Only Recovery
journal, January 2007

  • Reis, George A.; Chang, Jonathan; August, David I.
  • IEEE Micro, Vol. 27, Issue 1
  • DOI: 10.1109/MM.2007.4

Argus: Low-Cost, Comprehensive Error Detection in Simple Cores
conference, December 2007

  • Meixner, Albert; Bauer, Michael E.; Sorin, Daniel
  • 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007)
  • DOI: 10.1109/MICRO.2007.18

Shoestring: probabilistic soft error reliability on the cheap
journal, March 2010

  • Feng, Shuguang; Gupta, Shantanu; Ansari, Amin
  • ACM SIGARCH Computer Architecture News, Vol. 38, Issue 1
  • DOI: 10.1145/1735970.1736063

UnSync-CMP: Multicore CMP Architecture for Energy-Efficient Soft-Error Reliability
journal, January 2014

  • Jeyapaul, Reiley; Rhisheekesan, Abhishek
  • IEEE Transactions on Parallel and Distributed Systems, Vol. 25, Issue 1
  • DOI: 10.1109/TPDS.2013.14

The gem5 simulator
journal, August 2011

  • Binkert, Nathan; Sardashti, Somayeh; Sen, Rathijit
  • ACM SIGARCH Computer Architecture News, Vol. 39, Issue 2
  • DOI: 10.1145/2024716.2024718

End-to-end register data-flow continuous self-test
conference, January 2009

  • Carretero, Javier; Chaparro, Pedro; Vera, Xavier
  • Proceedings of the 36th annual international symposium on Computer architecture - ISCA '09
  • DOI: 10.1145/1555754.1555770

Avoiding core's DUE & SDC via acoustic wave detectors and tailored error containment and recovery
journal, October 2014

  • Upasani, Gaurang; Vera, Xavier; González, Antonio
  • ACM SIGARCH Computer Architecture News, Vol. 42, Issue 3
  • DOI: 10.1145/2678373.2665682

Boosting efficiency of fault detection and recovery throughapplication-specific comparison and checkpointing
conference, January 2013

  • Chen, Hao; Yang, Chengmo
  • Proceedings of the 14th ACM SIGPLAN/SIGBED conference on Languages, compilers and tools for embedded systems - LCTES '13
  • DOI: 10.1145/2491899.2465562

Assuring application-level correctness against soft errors
conference, November 2011

  • Cong, Jason; Gururaj, Karthik
  • 2011 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)
  • DOI: 10.1109/ICCAD.2011.6105319

Implications of the Power Wall: Dim Cores and Reconfigurable Logic
journal, September 2013


Idempotent code generation: Implementation, analysis, and evaluation
conference, February 2013

  • de Kruijf, M.; Sankaralingam, K.
  • Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)
  • DOI: 10.1109/CGO.2013.6495002

Hard Data on Soft Errors: A Large-Scale Assessment of Real-World Error Rates in GPGPU
conference, May 2010

  • Haque, Imran S.; Pande, Vijay S.
  • 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
  • DOI: 10.1109/CCGRID.2010.84

Harnessing Soft Computations for Low-Budget Fault Tolerance
conference, December 2014

  • Khudia, Daya Shanker; Mahlke, Scott
  • 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)
  • DOI: 10.1109/MICRO.2014.33

Framework for economical error recovery in embedded cores
conference, July 2014

  • Upasani, Gaurang; Vera, Xavier; Gonzalez, Antonio
  • 2014 IEEE 20th International On-Line Testing Symposium (IOLTS)
  • DOI: 10.1109/IOLTS.2014.6873687

Design and Evaluation of Hybrid Fault-Detection Systems
journal, May 2005

  • Reis, George A.; Chang, Jonathan; Vachharajani, Neil
  • ACM SIGARCH Computer Architecture News, Vol. 33, Issue 2
  • DOI: 10.1145/1080695.1069983

Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory
conference, June 2014

  • Luo, Yixin; Govindan, Sriram; Sharma, Bikash
  • 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)
  • DOI: 10.1109/DSN.2014.50

The EDA Challenges in the Dark Silicon Era: Temperature, Reliability, and Variability Perspectives
conference, January 2014

  • Shafique, Muhammad; Garg, Siddharth; Henkel, Jörg
  • Proceedings of the The 51st Annual Design Automation Conference on Design Automation Conference - DAC '14
  • DOI: 10.1145/2593069.2593229

Low cost control flow protection using abstract control signatures
conference, January 2013

  • Khudia, Daya Shanker; Mahlke, Scott
  • Proceedings of the 14th ACM SIGPLAN/SIGBED conference on Languages, compilers and tools for embedded systems - LCTES '13
  • DOI: 10.1145/2491899.2465568

Near-threshold voltage (NTV) design: opportunities and challenges
conference, January 2012

  • Kaul, Himanshu; Anders, Mark; Hsu, Steven
  • Proceedings of the 49th Annual Design Automation Conference on - DAC '12
  • DOI: 10.1145/2228360.2228572

40th Annual IEEE/ACM International Symposium on Microarchitecture - Table of Contents
conference, December 2007

  • ,
  • 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007)
  • DOI: 10.1109/MICRO.2007.8

The Soft Error Problem: An Architectural Perspective
conference, January 2005

  • Mukherjee, S. S.; Emer, J.; Reinhardt, S. K.
  • 11th International Symposium on High-Performance Computer Architecture
  • DOI: 10.1109/HPCA.2005.37

Design and Evaluation of Hybrid Fault-Detection Systems
conference, June 2005

  • Reis, G. A.; Chang, J.; Vachharajani, N.
  • 32nd International Symposium on Computer Architecture (ISCA'05)
  • DOI: 10.1109/ISCA.2005.21

ReStore: Symptom-Based Soft Error Detection in Microprocessors
journal, July 2006

  • Wang, N. J.; Patel, S. J.
  • IEEE Transactions on Dependable and Secure Computing, Vol. 3, Issue 3
  • DOI: 10.1109/TDSC.2006.40

Works referencing / citing this record:

Compiler Directed Speculative Intermittent Computation
preprint, January 2020