DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Exploiting data representation for fault tolerance

Abstract

Incorrect computer hardware behavior may corrupt intermediate computations in numerical algorithms, possibly resulting in incorrect answers. Prior work models misbehaving hardware by randomly flipping bits in memory. We start by accepting this premise, and present an analytic model for the error introduced by a bit flip in an IEEE 754 floating-point number. We then relate this finding to the linear algebra concepts of normalization and matrix equilibration. In particular, we present a case study illustrating that normalizing both vector inputs of a dot product minimizes the probability of a single bit flip causing a large error in the dot product's result. Moreover, the absolute error is either less than one or very large, which allows detection of large errors. Then, we apply this to the GMRES iterative solver. We count all possible errors that can be introduced through faults in arithmetic in the computationally intensive orthogonalization phase of GMRES, and show that when the matrix is equilibrated, the absolute error is bounded above by one.

Authors:
 [1];  [2];  [3];  [2]
  1. Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
  2. North Carolina State Univ., Raleigh, NC (United States)
  3. (SNL-NM), Albuquerque, NM (United States)
Publication Date:
Research Org.:
Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
Sponsoring Org.:
USDOE National Nuclear Security Administration (NNSA); USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)
OSTI Identifier:
1240102
Alternate Identifier(s):
OSTI ID: 1328456
Report Number(s):
SAND-2016-0354J
Journal ID: ISSN 1877-7503; 619163
Grant/Contract Number:  
AC04-94AL85000
Resource Type:
Accepted Manuscript
Journal Name:
Journal of Computational Science
Additional Journal Information:
Journal Name: Journal of Computational Science; Journal ID: ISSN 1877-7503
Publisher:
Elsevier
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING; algorithm-based fault tolerance; resilient algorithms; numerical methods

Citation Formats

Hoemmen, Mark Frederick, Elliott, J., Sandia National Lab., and Mueller, F. Exploiting data representation for fault tolerance. United States: N. p., 2015. Web. doi:10.1016/j.jocs.2015.12.002.
Hoemmen, Mark Frederick, Elliott, J., Sandia National Lab., & Mueller, F. Exploiting data representation for fault tolerance. United States. https://doi.org/10.1016/j.jocs.2015.12.002
Hoemmen, Mark Frederick, Elliott, J., Sandia National Lab., and Mueller, F. Tue . "Exploiting data representation for fault tolerance". United States. https://doi.org/10.1016/j.jocs.2015.12.002. https://www.osti.gov/servlets/purl/1240102.
@article{osti_1240102,
title = {Exploiting data representation for fault tolerance},
author = {Hoemmen, Mark Frederick and Elliott, J. and Sandia National Lab. and Mueller, F.},
abstractNote = {Incorrect computer hardware behavior may corrupt intermediate computations in numerical algorithms, possibly resulting in incorrect answers. Prior work models misbehaving hardware by randomly flipping bits in memory. We start by accepting this premise, and present an analytic model for the error introduced by a bit flip in an IEEE 754 floating-point number. We then relate this finding to the linear algebra concepts of normalization and matrix equilibration. In particular, we present a case study illustrating that normalizing both vector inputs of a dot product minimizes the probability of a single bit flip causing a large error in the dot product's result. Moreover, the absolute error is either less than one or very large, which allows detection of large errors. Then, we apply this to the GMRES iterative solver. We count all possible errors that can be introduced through faults in arithmetic in the computationally intensive orthogonalization phase of GMRES, and show that when the matrix is equilibrated, the absolute error is bounded above by one.},
doi = {10.1016/j.jocs.2015.12.002},
journal = {Journal of Computational Science},
number = ,
volume = ,
place = {United States},
year = {Tue Jan 06 00:00:00 EST 2015},
month = {Tue Jan 06 00:00:00 EST 2015}
}

Journal Article:

Citation Metrics:
Cited by: 12 works
Citation information provided by
Web of Science

Save / Share:

Works referenced in this record:

The principle of minimized iterations in the solution of the matrix eigenvalue problem
journal, January 1951

  • Arnoldi, W. E.
  • Quarterly of Applied Mathematics, Vol. 9, Issue 1
  • DOI: 10.1090/qam/42792

Assessment of the Impact of Cosmic-Ray-Induced Neutrons on Hardware in the Roadrunner Supercomputer
journal, June 2012

  • Michalak, Sarah E.; DuBois, Andrew J.; Storlie, Curtis B.
  • IEEE Transactions on Device and Materials Reliability, Vol. 12, Issue 2
  • DOI: 10.1109/TDMR.2012.2192736

Modified Gram-Schmidt (MGS), Least Squares, and Backward Stability of MGS-GMRES
journal, January 2006

  • Paige, Christopher C.; Rozlozník, Miroslav; Strakos, Zdenvek
  • SIAM Journal on Matrix Analysis and Applications, Vol. 28, Issue 1
  • DOI: 10.1137/050630416

Floating point fault tolerance with backward error assertions
journal, January 1995

  • Boley, D.; Golub, G. H.; Makar, S.
  • IEEE Transactions on Computers, Vol. 44, Issue 2
  • DOI: 10.1109/12.364541

A detailed analysis of communication load balance on BlueGene supercomputer
journal, August 2009


GMRES: A Generalized Minimal Residual Algorithm for Solving Nonsymmetric Linear Systems
journal, July 1986

  • Saad, Youcef; Schultz, Martin H.
  • SIAM Journal on Scientific and Statistical Computing, Vol. 7, Issue 3
  • DOI: 10.1137/0907058

The principle of minimized iterations in the solution of the matrix eigenvalue problem
journal, January 1951

  • Arnoldi, W. E.
  • Quarterly of Applied Mathematics, Vol. 9, Issue 1
  • DOI: 10.1090/qam/42792

The university of Florida sparse matrix collection
journal, November 2011

  • Davis, Timothy A.; Hu, Yifan
  • ACM Transactions on Mathematical Software, Vol. 38, Issue 1
  • DOI: 10.1145/2049662.2049663

Works referencing / citing this record:

Multi-Objective Optimization for Size and Resilience of Spiking Neural Networks
preprint, January 2020


Multiscale Computing in the Exascale Era
preprint, January 2016


Resilience in Numerical Methods: A Position on Fault Models and Methodologies
preprint, January 2014


Resilience and fault tolerance in high-performance computing for numerical weather and climate prediction
journal, February 2021

  • Benacchio, Tommaso; Bonaventura, Luca; Altenbernd, Mirco
  • The International Journal of High Performance Computing Applications, Vol. 35, Issue 4
  • DOI: 10.1177/1094342021990433