skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Exploiting data representation for fault tolerance

Abstract

Incorrect computer hardware behavior may corrupt intermediate computations in numerical algorithms, possibly resulting in incorrect answers. Prior work models misbehaving hardware by randomly flipping bits in memory. We start by accepting this premise, and present an analytic model for the error introduced by a bit flip in an IEEE 754 floating-point number. We then relate this finding to the linear algebra concepts of normalization and matrix equilibration. In particular, we present a case study illustrating that normalizing both vector inputs of a dot product minimizes the probability of a single bit flip causing a large error in the dot product's result. Moreover, the absolute error is either less than one or very large, which allows detection of large errors. Then, we apply this to the GMRES iterative solver. We count all possible errors that can be introduced through faults in arithmetic in the computationally intensive orthogonalization phase of GMRES, and show that when the matrix is equilibrated, the absolute error is bounded above by one.

Authors:
 [1];  [2];  [3];  [2]
  1. Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
  2. North Carolina State Univ., Raleigh, NC (United States)
  3. (SNL-NM), Albuquerque, NM (United States)
Publication Date:
Research Org.:
Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
Sponsoring Org.:
USDOE National Nuclear Security Administration (NNSA)
OSTI Identifier:
1240102
Alternate Identifier(s):
OSTI ID: 1328456
Report Number(s):
SAND-2016-0354J
Journal ID: ISSN 1877-7503; 619163
Grant/Contract Number:  
AC04-94AL85000
Resource Type:
Journal Article: Accepted Manuscript
Journal Name:
Journal of Computational Science
Additional Journal Information:
Journal Name: Journal of Computational Science; Journal ID: ISSN 1877-7503
Publisher:
Elsevier
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING; algorithm-based fault tolerance; resilient algorithms; numerical methods

Citation Formats

Hoemmen, Mark Frederick, Elliott, J., Sandia National Lab., and Mueller, F. Exploiting data representation for fault tolerance. United States: N. p., 2015. Web. doi:10.1016/j.jocs.2015.12.002.
Hoemmen, Mark Frederick, Elliott, J., Sandia National Lab., & Mueller, F. Exploiting data representation for fault tolerance. United States. doi:10.1016/j.jocs.2015.12.002.
Hoemmen, Mark Frederick, Elliott, J., Sandia National Lab., and Mueller, F. Tue . "Exploiting data representation for fault tolerance". United States. doi:10.1016/j.jocs.2015.12.002. https://www.osti.gov/servlets/purl/1240102.
@article{osti_1240102,
title = {Exploiting data representation for fault tolerance},
author = {Hoemmen, Mark Frederick and Elliott, J. and Sandia National Lab. and Mueller, F.},
abstractNote = {Incorrect computer hardware behavior may corrupt intermediate computations in numerical algorithms, possibly resulting in incorrect answers. Prior work models misbehaving hardware by randomly flipping bits in memory. We start by accepting this premise, and present an analytic model for the error introduced by a bit flip in an IEEE 754 floating-point number. We then relate this finding to the linear algebra concepts of normalization and matrix equilibration. In particular, we present a case study illustrating that normalizing both vector inputs of a dot product minimizes the probability of a single bit flip causing a large error in the dot product's result. Moreover, the absolute error is either less than one or very large, which allows detection of large errors. Then, we apply this to the GMRES iterative solver. We count all possible errors that can be introduced through faults in arithmetic in the computationally intensive orthogonalization phase of GMRES, and show that when the matrix is equilibrated, the absolute error is bounded above by one.},
doi = {10.1016/j.jocs.2015.12.002},
journal = {Journal of Computational Science},
number = ,
volume = ,
place = {United States},
year = {Tue Jan 06 00:00:00 EST 2015},
month = {Tue Jan 06 00:00:00 EST 2015}
}

Journal Article:
Free Publicly Available Full Text
Publisher's Version of Record

Citation Metrics:
Cited by: 4 works
Citation information provided by
Web of Science

Save / Share: