Exploiting data representation for fault tolerance

Hoemmen, Mark Frederick; Elliott, J.; Mueller, F.

doi:10.1016/j.jocs.2015.12.002

Title: Exploiting data representation for fault tolerance

Abstract

Incorrect computer hardware behavior may corrupt intermediate computations in numerical algorithms, possibly resulting in incorrect answers. Prior work models misbehaving hardware by randomly flipping bits in memory. We start by accepting this premise, and present an analytic model for the error introduced by a bit flip in an IEEE 754 floating-point number. We then relate this finding to the linear algebra concepts of normalization and matrix equilibration. In particular, we present a case study illustrating that normalizing both vector inputs of a dot product minimizes the probability of a single bit flip causing a large error in the dot product's result. Moreover, the absolute error is either less than one or very large, which allows detection of large errors. Then, we apply this to the GMRES iterative solver. We count all possible errors that can be introduced through faults in arithmetic in the computationally intensive orthogonalization phase of GMRES, and show that when the matrix is equilibrated, the absolute error is bounded above by one.

Authors:

Hoemmen, Mark Frederick ^[1]; Elliott, J. ^[2]; Sandia National Lab. ^[3]; Mueller, F. ^[2]

Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
North Carolina State Univ., Raleigh, NC (United States)
(SNL-NM), Albuquerque, NM (United States)

Publication Date:: Tue Jan 06 00:00:00 EST 2015

Research Org.:: Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)

Sponsoring Org.:: USDOE National Nuclear Security Administration (NNSA); USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)

OSTI Identifier:: 1240102

Alternate Identifier(s):: OSTI ID: 1328456

Report Number(s):: SAND-2016-0354J
Journal ID: ISSN 1877-7503; 619163

Grant/Contract Number:: AC04-94AL85000

Resource Type:: Accepted Manuscript

Journal Name:: Journal of Computational Science

Additional Journal Information:: Journal Name: Journal of Computational Science; Journal ID: ISSN 1877-7503

Publisher:: Elsevier

Country of Publication:: United States

Language:: English

Subject:: 97 MATHEMATICS AND COMPUTING; algorithm-based fault tolerance; resilient algorithms; numerical methods

Citation Formats


                    Hoemmen, Mark Frederick, Elliott, J., Sandia National Lab., and Mueller, F. Exploiting data representation for fault tolerance.  United States: N. p., 2015. 
Web.  doi:10.1016/j.jocs.2015.12.002.

Copy to clipboard


                    Hoemmen, Mark Frederick, Elliott, J., Sandia National Lab., & Mueller, F. Exploiting data representation for fault tolerance.  United States.  https://doi.org/10.1016/j.jocs.2015.12.002

Copy to clipboard


                    Hoemmen, Mark Frederick, Elliott, J., Sandia National Lab., and Mueller, F. Tue .  
"Exploiting data representation for fault tolerance".  United States.  https://doi.org/10.1016/j.jocs.2015.12.002.  https://www.osti.gov/servlets/purl/1240102.

Copy to clipboard


                    
@article{osti_1240102,

  title        = {Exploiting data representation for fault tolerance},

  author       = {Hoemmen, Mark Frederick and Elliott, J. and Sandia National Lab. and Mueller, F.},

  abstractNote = {Incorrect computer hardware behavior may corrupt intermediate computations in numerical algorithms, possibly resulting in incorrect answers. Prior work models misbehaving hardware by randomly flipping bits in memory. We start by accepting this premise, and present an analytic model for the error introduced by a bit flip in an IEEE 754 floating-point number. We then relate this finding to the linear algebra concepts of normalization and matrix equilibration. In particular, we present a case study illustrating that normalizing both vector inputs of a dot product minimizes the probability of a single bit flip causing a large error in the dot product's result. Moreover, the absolute error is either less than one or very large, which allows detection of large errors. Then, we apply this to the GMRES iterative solver. We count all possible errors that can be introduced through faults in arithmetic in the computationally intensive orthogonalization phase of GMRES, and show that when the matrix is equilibrated, the absolute error is bounded above by one.},

  doi          = {10.1016/j.jocs.2015.12.002},

  journal      = {Journal of Computational Science},

  number       = ,

  volume       = ,

  place        = {United States},

  year         = {Tue Jan 06 00:00:00 EST 2015},

  month        = {Tue Jan 06 00:00:00 EST 2015}

}

Copy to clipboard

Journal Article:

Free Publicly Available Full Text

Accepted Manuscript (Publisher)

Accepted Manuscript (DOE)

Publisher's Version of Record

https://doi.org/10.1016/j.jocs.2015.12.002

Other availability

Search WorldCat to find libraries that may hold this journal

Citation Metrics:

Cited by: 12 works

Citation information provided by
Web of Science

Save / Share:

Export Metadata

Save to My Library

Works referenced in this record:

The principle of minimized iterations in the solution of the matrix eigenvalue problem
journal, January 1951

Arnoldi, W. E.
Quarterly of Applied Mathematics, Vol. 9, Issue 1
DOI: 10.1090/qam/42792

Assessment of the Impact of Cosmic-Ray-Induced Neutrons on Hardware in the Roadrunner Supercomputer
journal, June 2012

Michalak, Sarah E.; DuBois, Andrew J.; Storlie, Curtis B.
IEEE Transactions on Device and Materials Reliability, Vol. 12, Issue 2
DOI: 10.1109/TDMR.2012.2192736

Modified Gram-Schmidt (MGS), Least Squares, and Backward Stability of MGS-GMRES
journal, January 2006

Paige, Christopher C.; Rozlozník, Miroslav; Strakos, Zdenvek
SIAM Journal on Matrix Analysis and Applications, Vol. 28, Issue 1
DOI: 10.1137/050630416

Floating point fault tolerance with backward error assertions
journal, January 1995

Boley, D.; Golub, G. H.; Makar, S.
IEEE Transactions on Computers, Vol. 44, Issue 2
DOI: 10.1109/12.364541

A detailed analysis of communication load balance on BlueGene supercomputer
journal, August 2009

Chen, Yongzhi; Deng, Yuefan
Computer Physics Communications, Vol. 180, Issue 8
DOI: 10.1016/j.cpc.2009.02.006

GMRES: A Generalized Minimal Residual Algorithm for Solving Nonsymmetric Linear Systems
journal, July 1986

Saad, Youcef; Schultz, Martin H.
SIAM Journal on Scientific and Statistical Computing, Vol. 7, Issue 3
DOI: 10.1137/0907058

The principle of minimized iterations in the solution of the matrix eigenvalue problem
journal, January 1951

Arnoldi, W. E.
Quarterly of Applied Mathematics, Vol. 9, Issue 1
DOI: 10.1090/qam/42792

The university of Florida sparse matrix collection
journal, November 2011

Davis, Timothy A.; Hu, Yifan
ACM Transactions on Mathematical Software, Vol. 38, Issue 1
DOI: 10.1145/2049662.2049663

Works referencing / citing this record:

Multi-Objective Optimization for Size and Resilience of Spiking Neural Networks
preprint, January 2020

Dimovska, Mihaela; Johnston, Travis; Schuman, Catherine D.
arXiv
DOI: 10.48550/arxiv.2002.01406

Multiscale Computing in the Exascale Era
preprint, January 2016

Alowayyed, Saad; Groen, Derek; Coveney, Peter V.
arXiv
DOI: 10.48550/arxiv.1612.02467

Resilience in Numerical Methods: A Position on Fault Models and Methodologies
preprint, January 2014

Elliott, James; Hoemmen, Mark; Mueller, Frank
arXiv
DOI: 10.48550/arxiv.1401.3013

Resilience and fault tolerance in high-performance computing for numerical weather and climate prediction
journal, February 2021

Benacchio, Tommaso; Bonaventura, Luca; Altenbernd, Mirco
The International Journal of High Performance Computing Applications, Vol. 35, Issue 4
DOI: 10.1177/1094342021990433

Similar Records in DOE PAGES and OSTI.GOV collections:

Quantifying the Impact of Single Bit Flips on Floating Point Arithmetic

Technical Report Elliott, James J ; Mueller, Frank ; Stoyanov, Miroslav K ; ...

In high-end computing, the collective surface area, smaller fabrication sizes, and increasing density of components have led to an increase in the number of observed bit flips. If mechanisms are not in place to detect them, such flips produce silent errors, i.e. the code returns a result that deviates from the desired solution by more than the allowed tolerance and the discrepancy cannot be distinguished from the standard numerical error associated with the algorithm. These phenomena are believed to occur more frequently in DRAM, but logic gates, arithmetic units, and other circuits are also susceptible to bit flips. Previous workmore »« less
https://doi.org/10.2172/1089338

Full Text Available
Fault tolerance in an inner-outer solver: A GVR-enabled case study

Journal Article Zhang, Ziming ; Chien, Andrew A. ; Teranishi, Keita - Lecture Notes in Computer Science

Resilience is a major challenge for large-scale systems. It is particularly important for iterative linear solvers, since they take much of the time of many scientific applications. We show that single bit flip errors in the Flexible GMRES iterative linear solver can lead to high computational overhead or even failure to converge to the right answer. Informed by these results, we design and evaluate several strategies for fault tolerance in both inner and outer solvers appropriate across a range of error rates. We implement them, extending Trilinos’ solver library with the Global View Resilience (GVR) programming model, which provides multi-streammore »« less
Cited by 3
https://doi.org/10.1007/978-3-319-17353-5_11

Full Text Available
Double-Precision Floating-Point Cores V1.9

Software Govindu, Gokul ; Scrofano, Ronald

In studying the acceleration of scientific computing applications with reconfigurable hardware, such as field programmable gate arrays, one finds that many scientific applications require high-precision, floating-point arithmetic that is not innately supported in reconfigurable hardware. Consequently, we have written VDL code that describes hardware for performing double-precision (64-bit) floating-point arithmetic. From this code, it is possible for users to implement double-precision floating-point operations on FPGAs or any other hardware device to which VHDL code can be synthesized. Specifically, we have written code for four floating-point cores. Each core performs one operation: one performs addition/subtraction, one performs multiplication, one performs division,more »« less
Benefits of IEEE-754 features in modern symmetric tridiagonaleigensolvers

Journal Article Marques, Osni ; Riedy, Jason E ; Vomel, Christof - SIAM Journal on Scientific Computing (SISC)

Bisection is one of the most common methods used to compute the eigenvalues of symmetric tridiagonal matrices. Bisection relies on the Sturm count: For a given shift a, the number of negative pivots in the factorization T - {sigma}I = LDL{sup T} equals the number of eigenvalues of T that are smaller than a. In IEEE-754 arithmetic, the value oo permits the computation to continue past a zero pivot, producing a correct Sturm count when T is unreduced. Demmel and Li showed that using oo rather than testing for zero pivots within the loop could significantly improve performance on certainmore »« less
https://doi.org/10.1137/050641624

Full Text Available
Algorithms for Efficient Reproducible Floating Point Summation

Journal Article Ahrens, Peter ; Demmel, James ; Nguyen, Hong Diep - ACM Transactions on Mathematical Software

We define “reproducibility” as getting bitwise identical results from multiple runs of the same program, perhaps with different hardware resources or other changes that should not affect the answer. Many users depend on reproducibility for debugging or correctness. However, dynamic scheduling of parallel computing resources, combined with nonassociative floating point addition, makes reproducibility challenging even for summation, or operations like the BLAS. We describe a “reproducible accumulator” data structure (the “binned number”) and associated algorithms to reproducibly sum binary floating point numbers, independent of summation order. We use a subset of the IEEE Floating Point Standard 754-2008 and bitwise operationsmore »« less
https://doi.org/10.1145/3389360

Similar Records

Title: Exploiting data representation for fault tolerance

Abstract

Citation Formats

The principle of minimized iterations in the solution of the matrix eigenvalue problem journal, January 1951

Assessment of the Impact of Cosmic-Ray-Induced Neutrons on Hardware in the Roadrunner Supercomputer journal, June 2012

Modified Gram-Schmidt (MGS), Least Squares, and Backward Stability of MGS-GMRES journal, January 2006

Floating point fault tolerance with backward error assertions journal, January 1995

A detailed analysis of communication load balance on BlueGene supercomputer journal, August 2009

GMRES: A Generalized Minimal Residual Algorithm for Solving Nonsymmetric Linear Systems journal, July 1986

The principle of minimized iterations in the solution of the matrix eigenvalue problem journal, January 1951

The university of Florida sparse matrix collection journal, November 2011

Multi-Objective Optimization for Size and Resilience of Spiking Neural Networks preprint, January 2020

Multiscale Computing in the Exascale Era preprint, January 2016

Resilience in Numerical Methods: A Position on Fault Models and Methodologies preprint, January 2014

Resilience and fault tolerance in high-performance computing for numerical weather and climate prediction journal, February 2021

The principle of minimized iterations in the solution of the matrix eigenvalue problem
journal, January 1951

Assessment of the Impact of Cosmic-Ray-Induced Neutrons on Hardware in the Roadrunner Supercomputer
journal, June 2012

Modified Gram-Schmidt (MGS), Least Squares, and Backward Stability of MGS-GMRES
journal, January 2006

Floating point fault tolerance with backward error assertions
journal, January 1995

A detailed analysis of communication load balance on BlueGene supercomputer
journal, August 2009

GMRES: A Generalized Minimal Residual Algorithm for Solving Nonsymmetric Linear Systems
journal, July 1986

The principle of minimized iterations in the solution of the matrix eigenvalue problem
journal, January 1951

The university of Florida sparse matrix collection
journal, November 2011

Multi-Objective Optimization for Size and Resilience of Spiking Neural Networks
preprint, January 2020

Multiscale Computing in the Exascale Era
preprint, January 2016

Resilience in Numerical Methods: A Position on Fault Models and Methodologies
preprint, January 2014

Resilience and fault tolerance in high-performance computing for numerical weather and climate prediction
journal, February 2021