DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: FPDetect: Efficient Reasoning About Stencil Programs Using Selective Direct Evaluation

Journal Article · · ACM Transactions on Architecture and Code Optimization
DOI: https://doi.org/10.1145/3402451 · OSTI ID:1673584

We present FPDetect, a low-overhead approach for detecting logical errors and soft errors affecting stencil computations without generating false positives. We develop an offline analysis that tightly estimates the number of floating-point bits preserved across stencil applications. This estimate rigorously bounds the values expected in the data space of the computation. Violations of this bound can be attributed with certainty to errors. FPDetect helps synthesize error detectors customized for user-specified levels of accuracy and coverage. FPDetect also enables overhead reduction techniques based on deploying these detectors coarsely in space and time. Experimental evaluations demonstrate the practicality of our approach.

Research Organization:
Pacific Northwest National Laboratory (PNNL), Richland, WA (United States)
Sponsoring Organization:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR); National Science Foundation (NSF)
Grant/Contract Number:
AC05-76RL01830; 66905; 1704715; 1817073; 1918497
OSTI ID:
1673584
Report Number(s):
PNNL-SA-153397
Journal Information:
ACM Transactions on Architecture and Code Optimization, Vol. 17, Issue 3; ISSN 1544-3566
Publisher:
Association for Computing MachineryCopyright Statement
Country of Publication:
United States
Language:
English

References (31)

A Gaussian Process Approach for Effective Soft Error Detection conference September 2017
Algorithm-Based Fault Tolerance for Matrix Operations journal June 1984
A Survey on Post-Silicon Functional Validation for Multicore Architectures journal November 2017
Spatial Support Vector Regression to Detect Silent Errors in the Exascale Era conference May 2016
Affine Arithmetic: Concepts and Applications journal December 2004
MACORD: Online Adaptive Machine Learning Framework for Silent Error Detection conference September 2017
Comparative analysis of soft-error detection strategies: a case study with iterative methods
  • Kestor, Gokcen; Mutlu, Burcu Ozcelik; Manzano, Joseph
  • CF '18: Computing Frontiers Conference, Proceedings of the 15th ACM International Conference on Computing Frontiers https://doi.org/10.1145/3203217.3203240
conference May 2018
Towards a Compiler for Reals journal May 2017
A Numerical Soft Fault Model for Iterative Linear Solvers
  • Elliott, James; Hoemmen, Mark; Mueller, Frank
  • Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing - HPDC '15 https://doi.org/10.1145/2749246.2749254
conference January 2015
Error Tolerance in Server Class Processors journal July 2011
What every computer scientist should know about floating-point arithmetic journal March 1991
A new golden age for computer architecture journal January 2019
Soft Errors in Advanced Computer Systems journal May 2005
Formal Verification of Floating-Point Programs conference June 2007
Certified Roundoff Error Bounds Using Semidefinite Programming journal March 2017
Exploiting data representation for fault tolerance journal May 2016
Rigorous Estimation of Floating-Point Round-Off Errors with Symbolic Taylor Expansions
  • Solovyev, Alexey; Baranowski, Marek S.; Briggs, Ian
  • ACM Transactions on Programming Languages and Systems, Vol. 41, Issue 1 https://doi.org/10.1145/3230733
journal March 2019
Rigorous floating-point mixed-precision tuning conference January 2017
Sound compilation of reals conference January 2014
Proofs of numerical programs when the compiler optimizes journal March 2011
Adaptive Impact-Driven Detection of Silent Data Corruption for HPC Applications journal October 2016
Silent Data Corruption Resilient Two-sided Matrix Factorizations
  • Wu, Panruo; DeBardeleben, Nathan; Guan, Qiang
  • PPoPP '17: 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming https://doi.org/10.1145/3018743.3018750
conference January 2017
Clover: Compiler Directed Lightweight Soft Error Resilience journal July 2015
GS-DMR: Low-overhead soft error detection scheme for stencil-based computation journal January 2015
Equivalence checking of static affine programs using widening to handle recurrences journal October 2012
Evaluating the Impact of SDC on the GMRES Iterative Solver
  • Elliott, James; Hoemmen, Mark; Mueller, Frank
  • 2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2014 IEEE 28th International Parallel and Distributed Processing Symposium https://doi.org/10.1109/IPDPS.2014.123
conference May 2014
Reliability lessons learned from GPU experience with the Titan supercomputer at Oak Ridge leadership computing facility
  • Tiwari, Devesh; Gupta, Saurabh; Gallarno, George
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '15 https://doi.org/10.1145/2807591.2807666
conference January 2015
New-Sum: A Novel Online ABFT Scheme For General Iterative Methods
  • Tao, Dingwen; Song, Shuaiwen Leon; Krishnamoorthy, Sriram
  • Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing - HPDC '16 https://doi.org/10.1145/2907294.2907306
conference January 2016
A priori worst-case error bounds for floating-point computations conference January 1997
FT-ScaLAPACK: correcting soft errors on-line for ScaLAPACK cholesky, QR, and LU factorization routines conference January 2014
Tolerating Soft Errors in Processor Cores Using CLEAR (Cross-Layer Exploration for Architecting Resilience) journal September 2018

Figures / Tables (23)