DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: FailAmp: Relativization Transformation for Soft Error Detection in Structured Address Generation

Journal Article · · ACM Transactions on Architecture and Code Optimization
DOI: https://doi.org/10.1145/3369381 · OSTI ID:1599993
 [1];  [1];  [1];  [2];  [3];  [1];  [1]
  1. Univ. of Utah, Salt Lake City, UT (United States)
  2. Microsoft Corporation, Redmond, WA (United States)
  3. Pacific Northwest National Lab. (PNNL), Richland, WA (United States)

We present FailAmp, a novel LLVM program transformation algorithm that makes programs employing structured index calculations more robust against soft-errors. Without FailAmp, an offset error can go undetected; with FailAmp, all subsequent offsets are relativized, building on the faulty one. FailAmp can exploit ISAs such as ARM to further reduce overheads. We verify correctness properties of FailAMP using an SMT solver, and present a thorough evaluation using many HPC benchmarks under a fault injection campaign. FailAmp provides full soft-error detection for address calculation while incurring an average overhead of around 5%.

Research Organization:
Pacific Northwest National Laboratory (PNNL), Richland, WA (United States)
Sponsoring Organization:
USDOE
Grant/Contract Number:
AC05-76RL01830
OSTI ID:
1599993
Report Number(s):
PNNL-SA--148816
Journal Information:
ACM Transactions on Architecture and Code Optimization, Journal Name: ACM Transactions on Architecture and Code Optimization Journal Issue: 4 Vol. 16; ISSN 1544-3566
Publisher:
Association for Computing MachineryCopyright Statement
Country of Publication:
United States
Language:
English

References (20)

A framework for evaluating comprehensive fault resilience mechanisms in numerical programs journal April 2015
Spatial Support Vector Regression to Detect Silent Errors in the Exascale Era conference May 2016
Low Cost Transient Fault Protection Using Loop Output Prediction conference June 2018
An algorithmic approach to error localization and partial recomputation for low-overhead fault tolerance conference June 2013
Lazy Checkpointing: Exploiting Temporal Locality in Failures to Mitigate Checkpointing Overheads on Extreme-Scale Systems
  • Tiwari, Devesh; Gupta, Saurabh; Vazhkudai, Sudharshan S.
  • 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) https://doi.org/10.1109/DSN.2014.101
conference June 2014
Understanding and Exploiting Spatial Properties of System Failures on Extreme-Scale HPC Systems
  • Gupta, Saurabh; Tiwari, Devesh; Jantzi, Christopher
  • 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) https://doi.org/10.1109/DSN.2015.52
conference June 2015
X-Gene™: 64-bit ARM CPU and SoC conference August 2012
A large-scale study of soft-errors on GPUs in the field conference March 2016
PRESAGE: Protecting Structured Address Generation against Soft Errors
  • Sharma, Vishal Chandra; Gopalakrishnan, Ganesh; Krishnamoorthy, Sriram
  • 2016 IEEE 23rd International Conference on High Performance Computing (HiPC) https://doi.org/10.1109/HiPC.2016.037
conference December 2016
Towards Resiliency Evaluation of Vector Programs
  • Sharma, Vishal Chandra; Gopalakrishnan, Ganesh; Krishnamoorthy, Sriram
  • 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) https://doi.org/10.1109/IPDPSW.2016.187
conference May 2016
Partial Redundancy in HPC Systems with Non-Uniform Node Reliabilities conference November 2018
ReStore: Symptom-Based Soft Error Detection in Microprocessors journal July 2006
Adaptive Impact-Driven Detection of Silent Data Corruption for HPC Applications journal October 2016
Quantitative evaluation of soft error injection techniques for robust system design conference January 2013
Compiler-assisted detection of transient memory errors
  • Tavarageri, Sanket; Krishnamoorthy, Sriram; Sadayappan, P.
  • Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation - PLDI '14 https://doi.org/10.1145/2594291.2594298
conference January 2013
Understanding the propagation of transient errors in HPC applications
  • Ashraf, Rizwan A.; Gioiosa, Roberto; Kestor, Gokcen
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '15 https://doi.org/10.1145/2807591.2807670
conference January 2015
Failures in large scale systems: long-term measurement, analysis, and implications
  • Gupta, Saurabh; Patel, Tirthak; Engelmann, Christian
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '17 https://doi.org/10.1145/3126908.3126937
conference January 2017
Experimental and analytical study of Xeon Phi reliability
  • Oliveira, Daniel; Pilla, Laércio; DeBardeleben, Nathan
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '17 https://doi.org/10.1145/3126908.3126960
conference January 2017
Soft-error resilience of the IBM POWER6 processor journal May 2008
LULESH 2.0 Updates and Changes report July 2013