Clover: Compiler directed lightweight soft error resilience
- Virginia Polytechnic Inst. and State Univ. (Virginia Tech), Blacksburg, VA (United States)
- Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
This paper presents Clover, a compiler directed soft error detection and recovery scheme for lightweight soft error resilience. The compiler carefully generates soft error tolerant code based on idem-potent processing without explicit checkpoint. During program execution, Clover relies on a small number of acoustic wave detectors deployed in the processor to identify soft errors by sensing the wave made by a particle strike. To cope with DUE (detected unrecoverable errors) caused by the sensing latency of error detection, Clover leverages a novel selective instruction duplication technique called tail-DMR (dual modular redundancy). Once a soft error is detected by either the sensor or the tail-DMR, Clover takes care of the error as in the case of exception handling. To recover from the error, Clover simply redirects program control to the beginning of the code region where the error is detected. Lastly, the experiment results demonstrate that the average runtime overhead is only 26%, which is a 75% reduction compared to that of the state-of-the-art soft error resilience technique.
- Research Organization:
- Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF)
- Sponsoring Organization:
- USDOE Office of Science (SC)
- Grant/Contract Number:
- AC05-00OR22725
- OSTI ID:
- 1261518
- Journal Information:
- SIGPLAN, Vol. 50, Issue 5; Conference: 16. ACM SIGPLAN/SIGBED Conference on Languages, Compilers and Tools for Embedded Systems (LCTES 2015), Portland, OR (United States), 18-19 Jun 2015; ISSN 0362-1340
- Publisher:
- ACMCopyright Statement
- Country of Publication:
- United States
- Language:
- English
Web of Science
Compiler Directed Speculative Intermittent Computation | preprint | January 2020 |
Similar Records
Resiliency in numerical algorithm design for extreme scale simulations
Design for a Soft Error Resilient Dynamic Task-Based Runtime, In: 2015 IEEE International Parallel and Distributed Processing Symposium