DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: FailAmp: Relativization Transformation for Soft Error Detection in Structured Address Generation

Abstract

We present FailAmp, a novel LLVM program transformation algorithm that makes programs employing structured index calculations more robust against soft-errors. Without FailAmp, an offset error can go undetected; with FailAmp, all subsequent offsets are relativized, building on the faulty one. FailAmp can exploit ISAs such as ARM to further reduce overheads. We verify correctness properties of FailAMP using an SMT solver, and present a thorough evaluation using many HPC benchmarks under a fault injection campaign. FailAmp provides full soft-error detection for address calculation while incurring an average overhead of around 5%.

Authors:
 [1];  [1];  [1];  [2];  [3];  [1];  [1]
  1. Univ. of Utah, Salt Lake City, UT (United States)
  2. Microsoft Corporation, Redmond, WA (United States)
  3. Pacific Northwest National Lab. (PNNL), Richland, WA (United States)
Publication Date:
Research Org.:
Pacific Northwest National Lab. (PNNL), Richland, WA (United States)
Sponsoring Org.:
USDOE
OSTI Identifier:
1599993
Report Number(s):
PNNL-SA-148816
Journal ID: ISSN 1544-3566
Grant/Contract Number:  
AC05-76RL01830
Resource Type:
Accepted Manuscript
Journal Name:
ACM Transactions on Architecture and Code Optimization
Additional Journal Information:
Journal Volume: 16; Journal Issue: 4; Journal ID: ISSN 1544-3566
Publisher:
Association for Computing Machinery
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING; Soft error detection; failure amplification; structured address generation; LLVM transformation

Citation Formats

Briggs, Ian, Das, Arnab, Baranowski, Mark, Sharma, Vishal, Krishnamoorthy, Sriram, Rakamaric, Zvonimir, and Gopalakrishnan, Ganesh. FailAmp: Relativization Transformation for Soft Error Detection in Structured Address Generation. United States: N. p., 2019. Web. doi:10.1145/3369381.
Briggs, Ian, Das, Arnab, Baranowski, Mark, Sharma, Vishal, Krishnamoorthy, Sriram, Rakamaric, Zvonimir, & Gopalakrishnan, Ganesh. FailAmp: Relativization Transformation for Soft Error Detection in Structured Address Generation. United States. https://doi.org/10.1145/3369381
Briggs, Ian, Das, Arnab, Baranowski, Mark, Sharma, Vishal, Krishnamoorthy, Sriram, Rakamaric, Zvonimir, and Gopalakrishnan, Ganesh. Wed . "FailAmp: Relativization Transformation for Soft Error Detection in Structured Address Generation". United States. https://doi.org/10.1145/3369381. https://www.osti.gov/servlets/purl/1599993.
@article{osti_1599993,
title = {FailAmp: Relativization Transformation for Soft Error Detection in Structured Address Generation},
author = {Briggs, Ian and Das, Arnab and Baranowski, Mark and Sharma, Vishal and Krishnamoorthy, Sriram and Rakamaric, Zvonimir and Gopalakrishnan, Ganesh},
abstractNote = {We present FailAmp, a novel LLVM program transformation algorithm that makes programs employing structured index calculations more robust against soft-errors. Without FailAmp, an offset error can go undetected; with FailAmp, all subsequent offsets are relativized, building on the faulty one. FailAmp can exploit ISAs such as ARM to further reduce overheads. We verify correctness properties of FailAMP using an SMT solver, and present a thorough evaluation using many HPC benchmarks under a fault injection campaign. FailAmp provides full soft-error detection for address calculation while incurring an average overhead of around 5%.},
doi = {10.1145/3369381},
journal = {ACM Transactions on Architecture and Code Optimization},
number = 4,
volume = 16,
place = {United States},
year = {Wed Dec 18 00:00:00 EST 2019},
month = {Wed Dec 18 00:00:00 EST 2019}
}

Journal Article:
Free Publicly Available Full Text
Publisher's Version of Record

Citation Metrics:
Cited by: 1 work
Citation information provided by
Web of Science

Save / Share:

Works referenced in this record:

Partial Redundancy in HPC Systems with Non-Uniform Node Reliabilities
conference, November 2018

  • Hussain, Zaeem; Znati, Taieb; Melhem, Rami
  • SC18: International Conference for High Performance Computing, Networking, Storage and Analysis
  • DOI: 10.1109/SC.2018.00047

X-Gene™: 64-bit ARM CPU and SoC
conference, August 2012


Spatial Support Vector Regression to Detect Silent Errors in the Exascale Era
conference, May 2016

  • Subasi, Omer; Di, Sheng; Bautista-Gomez, Leonardo
  • 2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)
  • DOI: 10.1109/CCGrid.2016.33

PRESAGE: Protecting Structured Address Generation against Soft Errors
conference, December 2016

  • Sharma, Vishal Chandra; Gopalakrishnan, Ganesh; Krishnamoorthy, Sriram
  • 2016 IEEE 23rd International Conference on High Performance Computing (HiPC)
  • DOI: 10.1109/HiPC.2016.037

Failures in large scale systems: long-term measurement, analysis, and implications
conference, January 2017

  • Gupta, Saurabh; Patel, Tirthak; Engelmann, Christian
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '17
  • DOI: 10.1145/3126908.3126937

Lazy Checkpointing: Exploiting Temporal Locality in Failures to Mitigate Checkpointing Overheads on Extreme-Scale Systems
conference, June 2014

  • Tiwari, Devesh; Gupta, Saurabh; Vazhkudai, Sudharshan S.
  • 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)
  • DOI: 10.1109/DSN.2014.101

Adaptive Impact-Driven Detection of Silent Data Corruption for HPC Applications
journal, October 2016

  • Di, Sheng; Cappello, Franck
  • IEEE Transactions on Parallel and Distributed Systems, Vol. 27, Issue 10
  • DOI: 10.1109/TPDS.2016.2517639

Soft-error resilience of the IBM POWER6 processor
journal, May 2008

  • Sanda, P. N.; Kellington, J. W.; Kudva, P.
  • IBM Journal of Research and Development, Vol. 52, Issue 3
  • DOI: 10.1147/rd.523.0275

Understanding the propagation of transient errors in HPC applications
conference, January 2015

  • Ashraf, Rizwan A.; Gioiosa, Roberto; Kestor, Gokcen
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '15
  • DOI: 10.1145/2807591.2807670

A framework for evaluating comprehensive fault resilience mechanisms in numerical programs
journal, April 2015


Experimental and analytical study of Xeon Phi reliability
conference, January 2017

  • Oliveira, Daniel; Pilla, Laércio; DeBardeleben, Nathan
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '17
  • DOI: 10.1145/3126908.3126960

An algorithmic approach to error localization and partial recomputation for low-overhead fault tolerance
conference, June 2013

  • Sloan, Joseph; Kumar, Rakesh; Bronevetsky, Greg
  • 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)
  • DOI: 10.1109/DSN.2013.6575309

Compiler-assisted detection of transient memory errors
conference, January 2013

  • Tavarageri, Sanket; Krishnamoorthy, Sriram; Sadayappan, P.
  • Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation - PLDI '14
  • DOI: 10.1145/2594291.2594298

Low Cost Transient Fault Protection Using Loop Output Prediction
conference, June 2018

  • Park, Sunghyun; Li, Shikai; Mahlke, Scott
  • 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W)
  • DOI: 10.1109/DSN-W.2018.00047

Quantitative evaluation of soft error injection techniques for robust system design
conference, January 2013

  • Cho, Hyungmin; Mirkhani, Shahrzad; Cher, Chen-Yong
  • Proceedings of the 50th Annual Design Automation Conference on - DAC '13
  • DOI: 10.1145/2463209.2488859

A large-scale study of soft-errors on GPUs in the field
conference, March 2016

  • Nie, Bin; Tiwari, Devesh; Gupta, Saurabh
  • 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA)
  • DOI: 10.1109/HPCA.2016.7446091

ReStore: Symptom-Based Soft Error Detection in Microprocessors
journal, July 2006

  • Wang, N. J.; Patel, S. J.
  • IEEE Transactions on Dependable and Secure Computing, Vol. 3, Issue 3
  • DOI: 10.1109/TDSC.2006.40

Understanding and Exploiting Spatial Properties of System Failures on Extreme-Scale HPC Systems
conference, June 2015

  • Gupta, Saurabh; Tiwari, Devesh; Jantzi, Christopher
  • 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)
  • DOI: 10.1109/DSN.2015.52

Towards Resiliency Evaluation of Vector Programs
conference, May 2016

  • Sharma, Vishal Chandra; Gopalakrishnan, Ganesh; Krishnamoorthy, Sriram
  • 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
  • DOI: 10.1109/IPDPSW.2016.187