Exploiting data representation for fault tolerance

Hoemmen, Mark Frederick; Elliott, J.; Mueller, F.

doi:10.1016/j.jocs.2015.12.002

Title: Exploiting data representation for fault tolerance

Journal Article · Tue Jan 06 00:00:00 EST 2015 · Journal of Computational Science

DOI:https://doi.org/10.1016/j.jocs.2015.12.002· OSTI ID:1240102

Hoemmen, Mark Frederick ^[1]; Elliott, J. ^[2]; Mueller, F. ^[2]

Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
North Carolina State Univ., Raleigh, NC (United States)

Incorrect computer hardware behavior may corrupt intermediate computations in numerical algorithms, possibly resulting in incorrect answers. Prior work models misbehaving hardware by randomly flipping bits in memory. We start by accepting this premise, and present an analytic model for the error introduced by a bit flip in an IEEE 754 floating-point number. We then relate this finding to the linear algebra concepts of normalization and matrix equilibration. In particular, we present a case study illustrating that normalizing both vector inputs of a dot product minimizes the probability of a single bit flip causing a large error in the dot product's result. Moreover, the absolute error is either less than one or very large, which allows detection of large errors. Then, we apply this to the GMRES iterative solver. We count all possible errors that can be introduced through faults in arithmetic in the computationally intensive orthogonalization phase of GMRES, and show that when the matrix is equilibrated, the absolute error is bounded above by one.

View Accepted Manuscript (DOE)

View Accepted Manuscript (Publisher)

Cite

Export

Save

Research Organization:: Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)

Sponsoring Organization:: USDOE National Nuclear Security Administration (NNSA); USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)

Grant/Contract Number:: AC04-94AL85000

OSTI ID:: 1240102

Alternate ID(s):: OSTI ID: 1328456

Report Number(s):: SAND-2016-0354J; 619163

Journal Information:: Journal of Computational Science, Journal Name: Journal of Computational Science; ISSN 1877-7503

Publisher:: ElsevierCopyright Statement

Country of Publication:: United States

Language:: English

Citation Metrics:

Cited by: 12 works

Citation information provided by
Web of Science

References (7)

The principle of minimized iterations in the solution of the matrix eigenvalue problem Arnoldi, W. E. Quarterly of Applied Mathematics, Vol. 9, Issue 1 https://doi.org/10.1090/qam/42792	journal	January 1951
Assessment of the Impact of Cosmic-Ray-Induced Neutrons on Hardware in the Roadrunner Supercomputer Michalak, Sarah E.; DuBois, Andrew J.; Storlie, Curtis B. IEEE Transactions on Device and Materials Reliability, Vol. 12, Issue 2 https://doi.org/10.1109/TDMR.2012.2192736	journal	June 2012
Modified Gram-Schmidt (MGS), Least Squares, and Backward Stability of MGS-GMRES Paige, Christopher C.; Rozlozník, Miroslav; Strakos, Zdenvek SIAM Journal on Matrix Analysis and Applications, Vol. 28, Issue 1 https://doi.org/10.1137/050630416	journal	January 2006
Floating point fault tolerance with backward error assertions Boley, D.; Golub, G. H.; Makar, S. IEEE Transactions on Computers, Vol. 44, Issue 2 https://doi.org/10.1109/12.364541	journal	January 1995
A detailed analysis of communication load balance on BlueGene supercomputer Chen, Yongzhi; Deng, Yuefan Computer Physics Communications, Vol. 180, Issue 8 https://doi.org/10.1016/j.cpc.2009.02.006	journal	August 2009
GMRES: A Generalized Minimal Residual Algorithm for Solving Nonsymmetric Linear Systems Saad, Youcef; Schultz, Martin H. SIAM Journal on Scientific and Statistical Computing, Vol. 7, Issue 3 https://doi.org/10.1137/0907058	journal	July 1986
The university of Florida sparse matrix collection Davis, Timothy A.; Hu, Yifan ACM Transactions on Mathematical Software, Vol. 38, Issue 1 https://doi.org/10.1145/2049662.2049663	journal	November 2011

Cited By (4)

Multi-Objective Optimization for Size and Resilience of Spiking Neural Networks Dimovska, Mihaela; Johnston, Travis; Schuman, Catherine D. arXiv https://doi.org/10.48550/arxiv.2002.01406	preprint	January 2020
Multiscale Computing in the Exascale Era Alowayyed, Saad; Groen, Derek; Coveney, Peter V. arXiv https://doi.org/10.48550/arxiv.1612.02467	preprint	January 2016
Resilience in Numerical Methods: A Position on Fault Models and Methodologies Elliott, James; Hoemmen, Mark; Mueller, Frank arXiv https://doi.org/10.48550/arxiv.1401.3013	preprint	January 2014
Resilience and fault tolerance in high-performance computing for numerical weather and climate prediction Benacchio, Tommaso; Bonaventura, Luca; Altenbernd, Mirco The International Journal of High Performance Computing Applications, Vol. 35, Issue 4 https://doi.org/10.1177/1094342021990433	journal	February 2021

Similar Records

Quantifying the Impact of Single Bit Flips on Floating Point Arithmetic

Technical Report · Thu Aug 01 00:00:00 EDT 2013 · OSTI ID:1240102

Elliott, James J; Mueller, Frank; Stoyanov, Miroslav K; +1 more

Fault tolerance in an inner-outer solver: A GVR-enabled case study

Journal Article · Sat Apr 18 00:00:00 EDT 2015 · Lecture Notes in Computer Science · OSTI ID:1240102

Zhang, Ziming; Chien, Andrew A.; Teranishi, Keita

Double-Precision Floating-Point Cores V1.9

Software · Sat Oct 15 00:00:00 EDT 2005 · OSTI ID:1240102

Govindu, Gokul; Scrofano, Ronald

Related Subjects

97 MATHEMATICS AND COMPUTING
algorithm-based fault tolerance
resilient algorithms
numerical methods

Title: Exploiting data representation for fault tolerance

Citation Formats

References (7)

Cited By (4)

Similar Records

Related Subjects