FailAmp: Relativization Transformation for Soft Error Detection in Structured Address Generation
Abstract
We present FailAmp, a novel LLVM program transformation algorithm that makes programs employing structured index calculations more robust against soft-errors. Without FailAmp, an offset error can go undetected; with FailAmp, all subsequent offsets are relativized, building on the faulty one. FailAmp can exploit ISAs such as ARM to further reduce overheads. We verify correctness properties of FailAMP using an SMT solver, and present a thorough evaluation using many HPC benchmarks under a fault injection campaign. FailAmp provides full soft-error detection for address calculation while incurring an average overhead of around 5%.
- Authors:
-
- Univ. of Utah, Salt Lake City, UT (United States)
- Microsoft Corporation, Redmond, WA (United States)
- Pacific Northwest National Lab. (PNNL), Richland, WA (United States)
- Publication Date:
- Research Org.:
- Pacific Northwest National Lab. (PNNL), Richland, WA (United States)
- Sponsoring Org.:
- USDOE
- OSTI Identifier:
- 1599993
- Report Number(s):
- PNNL-SA-148816
Journal ID: ISSN 1544-3566
- Grant/Contract Number:
- AC05-76RL01830
- Resource Type:
- Accepted Manuscript
- Journal Name:
- ACM Transactions on Architecture and Code Optimization
- Additional Journal Information:
- Journal Volume: 16; Journal Issue: 4; Journal ID: ISSN 1544-3566
- Publisher:
- Association for Computing Machinery
- Country of Publication:
- United States
- Language:
- English
- Subject:
- 97 MATHEMATICS AND COMPUTING; Soft error detection; failure amplification; structured address generation; LLVM transformation
Citation Formats
Briggs, Ian, Das, Arnab, Baranowski, Mark, Sharma, Vishal, Krishnamoorthy, Sriram, Rakamaric, Zvonimir, and Gopalakrishnan, Ganesh. FailAmp: Relativization Transformation for Soft Error Detection in Structured Address Generation. United States: N. p., 2019.
Web. doi:10.1145/3369381.
Briggs, Ian, Das, Arnab, Baranowski, Mark, Sharma, Vishal, Krishnamoorthy, Sriram, Rakamaric, Zvonimir, & Gopalakrishnan, Ganesh. FailAmp: Relativization Transformation for Soft Error Detection in Structured Address Generation. United States. https://doi.org/10.1145/3369381
Briggs, Ian, Das, Arnab, Baranowski, Mark, Sharma, Vishal, Krishnamoorthy, Sriram, Rakamaric, Zvonimir, and Gopalakrishnan, Ganesh. Wed .
"FailAmp: Relativization Transformation for Soft Error Detection in Structured Address Generation". United States. https://doi.org/10.1145/3369381. https://www.osti.gov/servlets/purl/1599993.
@article{osti_1599993,
title = {FailAmp: Relativization Transformation for Soft Error Detection in Structured Address Generation},
author = {Briggs, Ian and Das, Arnab and Baranowski, Mark and Sharma, Vishal and Krishnamoorthy, Sriram and Rakamaric, Zvonimir and Gopalakrishnan, Ganesh},
abstractNote = {We present FailAmp, a novel LLVM program transformation algorithm that makes programs employing structured index calculations more robust against soft-errors. Without FailAmp, an offset error can go undetected; with FailAmp, all subsequent offsets are relativized, building on the faulty one. FailAmp can exploit ISAs such as ARM to further reduce overheads. We verify correctness properties of FailAMP using an SMT solver, and present a thorough evaluation using many HPC benchmarks under a fault injection campaign. FailAmp provides full soft-error detection for address calculation while incurring an average overhead of around 5%.},
doi = {10.1145/3369381},
journal = {ACM Transactions on Architecture and Code Optimization},
number = 4,
volume = 16,
place = {United States},
year = {Wed Dec 18 00:00:00 EST 2019},
month = {Wed Dec 18 00:00:00 EST 2019}
}
Free Publicly Available Full Text
Publisher's Version of Record
Other availability
Cited by: 1 work
Citation information provided by
Web of Science
Web of Science
Save to My Library
You must Sign In or Create an Account in order to save documents to your library.
Works referenced in this record:
Partial Redundancy in HPC Systems with Non-Uniform Node Reliabilities
conference, November 2018
- Hussain, Zaeem; Znati, Taieb; Melhem, Rami
- SC18: International Conference for High Performance Computing, Networking, Storage and Analysis
X-Gene™: 64-bit ARM CPU and SoC
conference, August 2012
- Gopi, Paramesh; Singh, Gaurav; Favor, Greg
- 2012 IEEE Hot Chips 24 Symposium (HCS)
Spatial Support Vector Regression to Detect Silent Errors in the Exascale Era
conference, May 2016
- Subasi, Omer; Di, Sheng; Bautista-Gomez, Leonardo
- 2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)
PRESAGE: Protecting Structured Address Generation against Soft Errors
conference, December 2016
- Sharma, Vishal Chandra; Gopalakrishnan, Ganesh; Krishnamoorthy, Sriram
- 2016 IEEE 23rd International Conference on High Performance Computing (HiPC)
Failures in large scale systems: long-term measurement, analysis, and implications
conference, January 2017
- Gupta, Saurabh; Patel, Tirthak; Engelmann, Christian
- Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '17
Lazy Checkpointing: Exploiting Temporal Locality in Failures to Mitigate Checkpointing Overheads on Extreme-Scale Systems
conference, June 2014
- Tiwari, Devesh; Gupta, Saurabh; Vazhkudai, Sudharshan S.
- 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)
Adaptive Impact-Driven Detection of Silent Data Corruption for HPC Applications
journal, October 2016
- Di, Sheng; Cappello, Franck
- IEEE Transactions on Parallel and Distributed Systems, Vol. 27, Issue 10
Soft-error resilience of the IBM POWER6 processor
journal, May 2008
- Sanda, P. N.; Kellington, J. W.; Kudva, P.
- IBM Journal of Research and Development, Vol. 52, Issue 3
Understanding the propagation of transient errors in HPC applications
conference, January 2015
- Ashraf, Rizwan A.; Gioiosa, Roberto; Kestor, Gokcen
- Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '15
A framework for evaluating comprehensive fault resilience mechanisms in numerical programs
journal, April 2015
- Chen, Sui; Bronevetsky, Greg; Li, Bin
- The Journal of Supercomputing, Vol. 71, Issue 8
Experimental and analytical study of Xeon Phi reliability
conference, January 2017
- Oliveira, Daniel; Pilla, Laércio; DeBardeleben, Nathan
- Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '17
An algorithmic approach to error localization and partial recomputation for low-overhead fault tolerance
conference, June 2013
- Sloan, Joseph; Kumar, Rakesh; Bronevetsky, Greg
- 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)
Compiler-assisted detection of transient memory errors
conference, January 2013
- Tavarageri, Sanket; Krishnamoorthy, Sriram; Sadayappan, P.
- Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation - PLDI '14
Low Cost Transient Fault Protection Using Loop Output Prediction
conference, June 2018
- Park, Sunghyun; Li, Shikai; Mahlke, Scott
- 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W)
Quantitative evaluation of soft error injection techniques for robust system design
conference, January 2013
- Cho, Hyungmin; Mirkhani, Shahrzad; Cher, Chen-Yong
- Proceedings of the 50th Annual Design Automation Conference on - DAC '13
A large-scale study of soft-errors on GPUs in the field
conference, March 2016
- Nie, Bin; Tiwari, Devesh; Gupta, Saurabh
- 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA)
ReStore: Symptom-Based Soft Error Detection in Microprocessors
journal, July 2006
- Wang, N. J.; Patel, S. J.
- IEEE Transactions on Dependable and Secure Computing, Vol. 3, Issue 3
Understanding and Exploiting Spatial Properties of System Failures on Extreme-Scale HPC Systems
conference, June 2015
- Gupta, Saurabh; Tiwari, Devesh; Jantzi, Christopher
- 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)
Towards Resiliency Evaluation of Vector Programs
conference, May 2016
- Sharma, Vishal Chandra; Gopalakrishnan, Ganesh; Krishnamoorthy, Sriram
- 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)