skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Exploiting data representation for fault tolerance

Journal Article · · Journal of Computational Science
 [1];  [2];  [2]
  1. Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
  2. North Carolina State Univ., Raleigh, NC (United States)

Incorrect computer hardware behavior may corrupt intermediate computations in numerical algorithms, possibly resulting in incorrect answers. Prior work models misbehaving hardware by randomly flipping bits in memory. We start by accepting this premise, and present an analytic model for the error introduced by a bit flip in an IEEE 754 floating-point number. We then relate this finding to the linear algebra concepts of normalization and matrix equilibration. In particular, we present a case study illustrating that normalizing both vector inputs of a dot product minimizes the probability of a single bit flip causing a large error in the dot product's result. Moreover, the absolute error is either less than one or very large, which allows detection of large errors. Then, we apply this to the GMRES iterative solver. We count all possible errors that can be introduced through faults in arithmetic in the computationally intensive orthogonalization phase of GMRES, and show that when the matrix is equilibrated, the absolute error is bounded above by one.

Research Organization:
Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
Sponsoring Organization:
USDOE National Nuclear Security Administration (NNSA); USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)
Grant/Contract Number:
AC04-94AL85000
OSTI ID:
1240102
Alternate ID(s):
OSTI ID: 1328456
Report Number(s):
SAND-2016-0354J; 619163
Journal Information:
Journal of Computational Science, Journal Name: Journal of Computational Science; ISSN 1877-7503
Publisher:
ElsevierCopyright Statement
Country of Publication:
United States
Language:
English
Citation Metrics:
Cited by: 12 works
Citation information provided by
Web of Science

References (7)

The principle of minimized iterations in the solution of the matrix eigenvalue problem journal January 1951
Assessment of the Impact of Cosmic-Ray-Induced Neutrons on Hardware in the Roadrunner Supercomputer journal June 2012
Modified Gram-Schmidt (MGS), Least Squares, and Backward Stability of MGS-GMRES journal January 2006
Floating point fault tolerance with backward error assertions journal January 1995
A detailed analysis of communication load balance on BlueGene supercomputer journal August 2009
GMRES: A Generalized Minimal Residual Algorithm for Solving Nonsymmetric Linear Systems journal July 1986
The university of Florida sparse matrix collection journal November 2011

Cited By (4)

Multi-Objective Optimization for Size and Resilience of Spiking Neural Networks preprint January 2020
Multiscale Computing in the Exascale Era preprint January 2016
Resilience in Numerical Methods: A Position on Fault Models and Methodologies preprint January 2014
Resilience and fault tolerance in high-performance computing for numerical weather and climate prediction journal February 2021

Similar Records

Quantifying the Impact of Single Bit Flips on Floating Point Arithmetic
Technical Report · Thu Aug 01 00:00:00 EDT 2013 · OSTI ID:1240102

Fault tolerance in an inner-outer solver: A GVR-enabled case study
Journal Article · Sat Apr 18 00:00:00 EDT 2015 · Lecture Notes in Computer Science · OSTI ID:1240102

Double-Precision Floating-Point Cores V1.9
Software · Sat Oct 15 00:00:00 EDT 2005 · OSTI ID:1240102