Exploiting data representation for fault tolerance
Abstract
Incorrect computer hardware behavior may corrupt intermediate computations in numerical algorithms, possibly resulting in incorrect answers. Prior work models misbehaving hardware by randomly flipping bits in memory. We start by accepting this premise, and present an analytic model for the error introduced by a bit flip in an IEEE 754 floating-point number. We then relate this finding to the linear algebra concepts of normalization and matrix equilibration. In particular, we present a case study illustrating that normalizing both vector inputs of a dot product minimizes the probability of a single bit flip causing a large error in the dot product's result. Moreover, the absolute error is either less than one or very large, which allows detection of large errors. Then, we apply this to the GMRES iterative solver. We count all possible errors that can be introduced through faults in arithmetic in the computationally intensive orthogonalization phase of GMRES, and show that when the matrix is equilibrated, the absolute error is bounded above by one.
- Authors:
-
- Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
- North Carolina State Univ., Raleigh, NC (United States)
- (SNL-NM), Albuquerque, NM (United States)
- Publication Date:
- Research Org.:
- Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
- Sponsoring Org.:
- USDOE National Nuclear Security Administration (NNSA); USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)
- OSTI Identifier:
- 1240102
- Alternate Identifier(s):
- OSTI ID: 1328456
- Report Number(s):
- SAND-2016-0354J
Journal ID: ISSN 1877-7503; 619163
- Grant/Contract Number:
- AC04-94AL85000
- Resource Type:
- Accepted Manuscript
- Journal Name:
- Journal of Computational Science
- Additional Journal Information:
- Journal Name: Journal of Computational Science; Journal ID: ISSN 1877-7503
- Publisher:
- Elsevier
- Country of Publication:
- United States
- Language:
- English
- Subject:
- 97 MATHEMATICS AND COMPUTING; algorithm-based fault tolerance; resilient algorithms; numerical methods
Citation Formats
Hoemmen, Mark Frederick, Elliott, J., Sandia National Lab., and Mueller, F. Exploiting data representation for fault tolerance. United States: N. p., 2015.
Web. doi:10.1016/j.jocs.2015.12.002.
Hoemmen, Mark Frederick, Elliott, J., Sandia National Lab., & Mueller, F. Exploiting data representation for fault tolerance. United States. https://doi.org/10.1016/j.jocs.2015.12.002
Hoemmen, Mark Frederick, Elliott, J., Sandia National Lab., and Mueller, F. Tue .
"Exploiting data representation for fault tolerance". United States. https://doi.org/10.1016/j.jocs.2015.12.002. https://www.osti.gov/servlets/purl/1240102.
@article{osti_1240102,
title = {Exploiting data representation for fault tolerance},
author = {Hoemmen, Mark Frederick and Elliott, J. and Sandia National Lab. and Mueller, F.},
abstractNote = {Incorrect computer hardware behavior may corrupt intermediate computations in numerical algorithms, possibly resulting in incorrect answers. Prior work models misbehaving hardware by randomly flipping bits in memory. We start by accepting this premise, and present an analytic model for the error introduced by a bit flip in an IEEE 754 floating-point number. We then relate this finding to the linear algebra concepts of normalization and matrix equilibration. In particular, we present a case study illustrating that normalizing both vector inputs of a dot product minimizes the probability of a single bit flip causing a large error in the dot product's result. Moreover, the absolute error is either less than one or very large, which allows detection of large errors. Then, we apply this to the GMRES iterative solver. We count all possible errors that can be introduced through faults in arithmetic in the computationally intensive orthogonalization phase of GMRES, and show that when the matrix is equilibrated, the absolute error is bounded above by one.},
doi = {10.1016/j.jocs.2015.12.002},
journal = {Journal of Computational Science},
number = ,
volume = ,
place = {United States},
year = {Tue Jan 06 00:00:00 EST 2015},
month = {Tue Jan 06 00:00:00 EST 2015}
}
Web of Science
Works referenced in this record:
The principle of minimized iterations in the solution of the matrix eigenvalue problem
journal, January 1951
- Arnoldi, W. E.
- Quarterly of Applied Mathematics, Vol. 9, Issue 1
Assessment of the Impact of Cosmic-Ray-Induced Neutrons on Hardware in the Roadrunner Supercomputer
journal, June 2012
- Michalak, Sarah E.; DuBois, Andrew J.; Storlie, Curtis B.
- IEEE Transactions on Device and Materials Reliability, Vol. 12, Issue 2
Modified Gram-Schmidt (MGS), Least Squares, and Backward Stability of MGS-GMRES
journal, January 2006
- Paige, Christopher C.; Rozlozník, Miroslav; Strakos, Zdenvek
- SIAM Journal on Matrix Analysis and Applications, Vol. 28, Issue 1
Floating point fault tolerance with backward error assertions
journal, January 1995
- Boley, D.; Golub, G. H.; Makar, S.
- IEEE Transactions on Computers, Vol. 44, Issue 2
A detailed analysis of communication load balance on BlueGene supercomputer
journal, August 2009
- Chen, Yongzhi; Deng, Yuefan
- Computer Physics Communications, Vol. 180, Issue 8
GMRES: A Generalized Minimal Residual Algorithm for Solving Nonsymmetric Linear Systems
journal, July 1986
- Saad, Youcef; Schultz, Martin H.
- SIAM Journal on Scientific and Statistical Computing, Vol. 7, Issue 3
The principle of minimized iterations in the solution of the matrix eigenvalue problem
journal, January 1951
- Arnoldi, W. E.
- Quarterly of Applied Mathematics, Vol. 9, Issue 1
The university of Florida sparse matrix collection
journal, November 2011
- Davis, Timothy A.; Hu, Yifan
- ACM Transactions on Mathematical Software, Vol. 38, Issue 1
Works referencing / citing this record:
Multi-Objective Optimization for Size and Resilience of Spiking Neural Networks
preprint, January 2020
- Dimovska, Mihaela; Johnston, Travis; Schuman, Catherine D.
- arXiv
Multiscale Computing in the Exascale Era
preprint, January 2016
- Alowayyed, Saad; Groen, Derek; Coveney, Peter V.
- arXiv
Resilience in Numerical Methods: A Position on Fault Models and Methodologies
preprint, January 2014
- Elliott, James; Hoemmen, Mark; Mueller, Frank
- arXiv
Resilience and fault tolerance in high-performance computing for numerical weather and climate prediction
journal, February 2021
- Benacchio, Tommaso; Bonaventura, Luca; Altenbernd, Mirco
- The International Journal of High Performance Computing Applications, Vol. 35, Issue 4