Clover: Compiler directed lightweight soft error resilience
Abstract
This paper presents Clover, a compiler directed soft error detection and recovery scheme for lightweight soft error resilience. The compiler carefully generates soft error tolerant code based on idem-potent processing without explicit checkpoint. During program execution, Clover relies on a small number of acoustic wave detectors deployed in the processor to identify soft errors by sensing the wave made by a particle strike. To cope with DUE (detected unrecoverable errors) caused by the sensing latency of error detection, Clover leverages a novel selective instruction duplication technique called tail-DMR (dual modular redundancy). Once a soft error is detected by either the sensor or the tail-DMR, Clover takes care of the error as in the case of exception handling. To recover from the error, Clover simply redirects program control to the beginning of the code region where the error is detected. Lastly, the experiment results demonstrate that the average runtime overhead is only 26%, which is a 75% reduction compared to that of the state-of-the-art soft error resilience technique.
- Authors:
-
- Virginia Polytechnic Inst. and State Univ. (Virginia Tech), Blacksburg, VA (United States)
- Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
- Publication Date:
- Research Org.:
- Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF)
- Sponsoring Org.:
- USDOE Office of Science (SC)
- OSTI Identifier:
- 1261518
- Grant/Contract Number:
- AC05-00OR22725
- Resource Type:
- Accepted Manuscript
- Journal Name:
- SIGPLAN
- Additional Journal Information:
- Journal Volume: 50; Journal Issue: 5; Conference: 16. ACM SIGPLAN/SIGBED Conference on Languages, Compilers and Tools for Embedded Systems (LCTES 2015), Portland, OR (United States), 18-19 Jun 2015; Journal ID: ISSN 0362-1340
- Publisher:
- ACM
- Country of Publication:
- United States
- Language:
- English
- Subject:
- 97 MATHEMATICS AND COMPUTING; soft error resilience; compilers; tail-DMR frontier; idempotent processing; acoustic wave detectors
Citation Formats
Liu, Qingrui, Lee, Dongyoon, Jung, Changhee, and Tiwari, Devesh. Clover: Compiler directed lightweight soft error resilience. United States: N. p., 2015.
Web. doi:10.1145/2670529.2754959.
Liu, Qingrui, Lee, Dongyoon, Jung, Changhee, & Tiwari, Devesh. Clover: Compiler directed lightweight soft error resilience. United States. https://doi.org/10.1145/2670529.2754959
Liu, Qingrui, Lee, Dongyoon, Jung, Changhee, and Tiwari, Devesh. Fri .
"Clover: Compiler directed lightweight soft error resilience". United States. https://doi.org/10.1145/2670529.2754959. https://www.osti.gov/servlets/purl/1261518.
@article{osti_1261518,
title = {Clover: Compiler directed lightweight soft error resilience},
author = {Liu, Qingrui and Lee, Dongyoon and Jung, Changhee and Tiwari, Devesh},
abstractNote = {This paper presents Clover, a compiler directed soft error detection and recovery scheme for lightweight soft error resilience. The compiler carefully generates soft error tolerant code based on idem-potent processing without explicit checkpoint. During program execution, Clover relies on a small number of acoustic wave detectors deployed in the processor to identify soft errors by sensing the wave made by a particle strike. To cope with DUE (detected unrecoverable errors) caused by the sensing latency of error detection, Clover leverages a novel selective instruction duplication technique called tail-DMR (dual modular redundancy). Once a soft error is detected by either the sensor or the tail-DMR, Clover takes care of the error as in the case of exception handling. To recover from the error, Clover simply redirects program control to the beginning of the code region where the error is detected. Lastly, the experiment results demonstrate that the average runtime overhead is only 26%, which is a 75% reduction compared to that of the state-of-the-art soft error resilience technique.},
doi = {10.1145/2670529.2754959},
journal = {SIGPLAN},
number = 5,
volume = 50,
place = {United States},
year = {Fri May 01 00:00:00 EDT 2015},
month = {Fri May 01 00:00:00 EDT 2015}
}
Web of Science
Works referenced in this record:
Static analysis and compiler design for idempotent processing
conference, January 2012
- de Kruijf, Marc A.; Sankaralingam, Karthikeyan; Jha, Somesh
- Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation - PLDI '12
SWIFT: Software Implemented Fault Tolerance
conference, January 2005
- Reis, G. A.; Chang, J.; Vachharajani, N.
- International Symposium on Code Generation and Optimization
Is dark silicon useful?: harnessing the four horsemen of the coming dark silicon apocalypse
conference, January 2012
- Taylor, Michael B.
- Proceedings of the 49th Annual Design Automation Conference on - DAC '12
The Use of Triple-Modular Redundancy to Improve Computer Reliability
journal, April 1962
- Lyons, R. E.; Vanderkulk, W.
- IBM Journal of Research and Development, Vol. 6, Issue 2
Reliable on-chip systems in the nano-era: lessons learnt and future trends
conference, January 2013
- Henkel, Jörg; Bauer, Lars; Dutt, Nikil
- Proceedings of the 50th Annual Design Automation Conference on - DAC '13
Trends and challenges in VLSI circuit reliability
journal, July 2003
- Constantinescu, C.
- IEEE Micro, Vol. 23, Issue 4
Perturbation-based Fault Screening
conference, February 2007
- Racunas, Paul; Constantinides, Kypros; Manne, Srilatha
- 2007 IEEE 13th International Symposium on High Performance Computer Architecture
An Experimental Study of Soft Errors in Microprocessors
journal, November 2005
- Saggese, G. P.; Wang, N. J.; Kalbarczyk, Z. T.
- IEEE Micro, Vol. 25, Issue 6
Encore: low-cost, fine-grained transient fault recovery
conference, January 2011
- Feng, Shuguang; Gupta, Shantanu; Ansari, Amin
- Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture - MICRO-44 '11
Automatic Instruction-Level Software-Only Recovery
journal, January 2007
- Reis, George A.; Chang, Jonathan; August, David I.
- IEEE Micro, Vol. 27, Issue 1
Argus: Low-Cost, Comprehensive Error Detection in Simple Cores
conference, December 2007
- Meixner, Albert; Bauer, Michael E.; Sorin, Daniel
- 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007)
Shoestring: probabilistic soft error reliability on the cheap
journal, March 2010
- Feng, Shuguang; Gupta, Shantanu; Ansari, Amin
- ACM SIGARCH Computer Architecture News, Vol. 38, Issue 1
UnSync-CMP: Multicore CMP Architecture for Energy-Efficient Soft-Error Reliability
journal, January 2014
- Jeyapaul, Reiley; Rhisheekesan, Abhishek
- IEEE Transactions on Parallel and Distributed Systems, Vol. 25, Issue 1
The gem5 simulator
journal, August 2011
- Binkert, Nathan; Sardashti, Somayeh; Sen, Rathijit
- ACM SIGARCH Computer Architecture News, Vol. 39, Issue 2
End-to-end register data-flow continuous self-test
conference, January 2009
- Carretero, Javier; Chaparro, Pedro; Vera, Xavier
- Proceedings of the 36th annual international symposium on Computer architecture - ISCA '09
Avoiding core's DUE & SDC via acoustic wave detectors and tailored error containment and recovery
journal, October 2014
- Upasani, Gaurang; Vera, Xavier; González, Antonio
- ACM SIGARCH Computer Architecture News, Vol. 42, Issue 3
Boosting efficiency of fault detection and recovery throughapplication-specific comparison and checkpointing
conference, January 2013
- Chen, Hao; Yang, Chengmo
- Proceedings of the 14th ACM SIGPLAN/SIGBED conference on Languages, compilers and tools for embedded systems - LCTES '13
Assuring application-level correctness against soft errors
conference, November 2011
- Cong, Jason; Gururaj, Karthik
- 2011 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)
Implications of the Power Wall: Dim Cores and Reconfigurable Logic
journal, September 2013
- Wang, Liang; Skadron, Kevin
- IEEE Micro, Vol. 33, Issue 5
Idempotent code generation: Implementation, analysis, and evaluation
conference, February 2013
- de Kruijf, M.; Sankaralingam, K.
- Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)
Hard Data on Soft Errors: A Large-Scale Assessment of Real-World Error Rates in GPGPU
conference, May 2010
- Haque, Imran S.; Pande, Vijay S.
- 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Harnessing Soft Computations for Low-Budget Fault Tolerance
conference, December 2014
- Khudia, Daya Shanker; Mahlke, Scott
- 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)
Framework for economical error recovery in embedded cores
conference, July 2014
- Upasani, Gaurang; Vera, Xavier; Gonzalez, Antonio
- 2014 IEEE 20th International On-Line Testing Symposium (IOLTS)
Design and Evaluation of Hybrid Fault-Detection Systems
journal, May 2005
- Reis, George A.; Chang, Jonathan; Vachharajani, Neil
- ACM SIGARCH Computer Architecture News, Vol. 33, Issue 2
Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory
conference, June 2014
- Luo, Yixin; Govindan, Sriram; Sharma, Bikash
- 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)
The EDA Challenges in the Dark Silicon Era: Temperature, Reliability, and Variability Perspectives
conference, January 2014
- Shafique, Muhammad; Garg, Siddharth; Henkel, Jörg
- Proceedings of the The 51st Annual Design Automation Conference on Design Automation Conference - DAC '14
Low cost control flow protection using abstract control signatures
conference, January 2013
- Khudia, Daya Shanker; Mahlke, Scott
- Proceedings of the 14th ACM SIGPLAN/SIGBED conference on Languages, compilers and tools for embedded systems - LCTES '13
Near-threshold voltage (NTV) design: opportunities and challenges
conference, January 2012
- Kaul, Himanshu; Anders, Mark; Hsu, Steven
- Proceedings of the 49th Annual Design Automation Conference on - DAC '12
40th Annual IEEE/ACM International Symposium on Microarchitecture - Table of Contents
conference, December 2007
- ,
- 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007)
The Soft Error Problem: An Architectural Perspective
conference, January 2005
- Mukherjee, S. S.; Emer, J.; Reinhardt, S. K.
- 11th International Symposium on High-Performance Computer Architecture
Design and Evaluation of Hybrid Fault-Detection Systems
conference, June 2005
- Reis, G. A.; Chang, J.; Vachharajani, N.
- 32nd International Symposium on Computer Architecture (ISCA'05)
ReStore: Symptom-Based Soft Error Detection in Microprocessors
journal, July 2006
- Wang, N. J.; Patel, S. J.
- IEEE Transactions on Dependable and Secure Computing, Vol. 3, Issue 3
Works referencing / citing this record:
Compiler Directed Speculative Intermittent Computation
preprint, January 2020
- Choi, Jongouk; Liu, Qingrui; Jung, Changhee
- arXiv