Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Exploiting data representation for fault tolerance

Journal Article · · Journal of Computational Science
 [1];  [2];  [2]
  1. Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
  2. North Carolina State Univ., Raleigh, NC (United States)
Incorrect computer hardware behavior may corrupt intermediate computations in numerical algorithms, possibly resulting in incorrect answers. Prior work models misbehaving hardware by randomly flipping bits in memory. We start by accepting this premise, and present an analytic model for the error introduced by a bit flip in an IEEE 754 floating-point number. We then relate this finding to the linear algebra concepts of normalization and matrix equilibration. In particular, we present a case study illustrating that normalizing both vector inputs of a dot product minimizes the probability of a single bit flip causing a large error in the dot product's result. Moreover, the absolute error is either less than one or very large, which allows detection of large errors. Then, we apply this to the GMRES iterative solver. We count all possible errors that can be introduced through faults in arithmetic in the computationally intensive orthogonalization phase of GMRES, and show that when the matrix is equilibrated, the absolute error is bounded above by one.
Research Organization:
Sandia National Laboratories (SNL-NM), Albuquerque, NM (United States)
Sponsoring Organization:
USDOE National Nuclear Security Administration (NNSA); USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)
Grant/Contract Number:
AC04-94AL85000
OSTI ID:
1240102
Alternate ID(s):
OSTI ID: 1328456
Report Number(s):
SAND--2016-0354J; 619163
Journal Information:
Journal of Computational Science, Journal Name: Journal of Computational Science; ISSN 1877-7503
Publisher:
ElsevierCopyright Statement
Country of Publication:
United States
Language:
English

References (7)

The principle of minimized iterations in the solution of the matrix eigenvalue problem journal January 1951
A detailed analysis of communication load balance on BlueGene supercomputer journal August 2009
Floating point fault tolerance with backward error assertions journal January 1995
Assessment of the Impact of Cosmic-Ray-Induced Neutrons on Hardware in the Roadrunner Supercomputer journal June 2012
Modified Gram-Schmidt (MGS), Least Squares, and Backward Stability of MGS-GMRES journal January 2006
GMRES: A Generalized Minimal Residual Algorithm for Solving Nonsymmetric Linear Systems journal July 1986
The university of Florida sparse matrix collection journal November 2011

Cited By (4)

Resilience and fault tolerance in high-performance computing for numerical weather and climate prediction journal February 2021
Resilience in Numerical Methods: A Position on Fault Models and Methodologies preprint January 2014
Multiscale Computing in the Exascale Era preprint January 2016
Multi-Objective Optimization for Size and Resilience of Spiking Neural Networks preprint January 2020

Similar Records

Quantifying the Impact of Single Bit Flips on Floating Point Arithmetic
Technical Report · Thu Aug 01 00:00:00 EDT 2013 · OSTI ID:1089338

Fault tolerance in an inner-outer solver: A GVR-enabled case study
Journal Article · Fri Apr 17 20:00:00 EDT 2015 · Lecture Notes in Computer Science · OSTI ID:1237365

SpotSDC: Revealing the Silent Data Corruption Propagation in High-Performance Computing Systems
Journal Article · Thu May 14 20:00:00 EDT 2020 · IEEE Transactions on Visualization and Computer Graphics · OSTI ID:1868154