skip to main content

Title: Quantifying the Impact of Single Bit Flips on Floating Point Arithmetic

In high-end computing, the collective surface area, smaller fabrication sizes, and increasing density of components have led to an increase in the number of observed bit flips. If mechanisms are not in place to detect them, such flips produce silent errors, i.e. the code returns a result that deviates from the desired solution by more than the allowed tolerance and the discrepancy cannot be distinguished from the standard numerical error associated with the algorithm. These phenomena are believed to occur more frequently in DRAM, but logic gates, arithmetic units, and other circuits are also susceptible to bit flips. Previous work has focused on algorithmic techniques for detecting and correcting bit flips in specific data structures, however, they suffer from lack of generality and often times cannot be implemented in heterogeneous computing environment. Our work takes a novel approach to this problem. We focus on quantifying the impact of a single bit flip on specific floating-point operations. We analyze the error induced by flipping specific bits in the most widely used IEEE floating-point representation in an architecture-agnostic manner, i.e., without requiring proprietary information such as bit flip rates and the vendor-specific circuit designs. We initially study dot products of vectors andmore » demonstrate that not all bit flips create a large error and, more importantly, expected value of the relative magnitude of the error is very sensitive on the bit pattern of the binary representation of the exponent, which strongly depends on scaling. Our results are derived analytically and then verified experimentally with Monte Carlo sampling of random vectors. Furthermore, we consider the natural resilience properties of solvers based on the fixed point iteration and we demonstrate how the resilience of the Jacobi method for linear equations can be significantly improved by rescaling the associated matrix.« less
 [1] ;  [2] ;  [1] ;  [1]
  1. ORNL
  2. North Carolina State University
Publication Date:
OSTI Identifier:
Report Number(s):
KJ0402000; ERKJU16
DOE Contract Number:
Resource Type:
Technical Report
Research Org:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Sponsoring Org:
USDOE Laboratory Directed Research and Development (LDRD) Program; USDOE Office of Science (SC)
Country of Publication:
United States