Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Real-number codes for fault-tolerant matrix operations on processor arrays

Journal Article · · IEEE Transactions on Computers (Institute of Electrical and Electronics Engineers); (USA)
DOI:https://doi.org/10.1109/12.54836· OSTI ID:6845297
 [1];  [2]
  1. Center for Reliable and High Performance Computing, Univ. of Illinois, Urbana, IL (US)
  2. Computer Engineering Research Center, Univ. of Texas at Austin, Austin, TX (US)
Various checksum codes have been suggested for fault-tolerant matrix computations on processor arrays. Use of these codes is limited due to inflexibility of the encoding schemes and also due to potential numerical problems. Numerical errors may also be misconstrued as errors due to physical faults in the system. In this paper, the authors develop a generalization of the existing schemes as a possible solution to these shortcomings. The authors prove that linearity is a necessary and sufficient condition for codes used for fault-tolerant matrix operations such as matrix addition, multiplication, transposition, and LU decomposition. They also prove that for every linear code defined over a finite field, there exists a corresponding linear real-number code with similar error detecting and correcting capabilities. Encoding schemes are given for some of the example codes which fall under the general set of real-number codes. With the help of experiments, the authors derive a rule of thumb for the selection of a particular code for a given application. The performance overhead of fault tolerance schemes using the generalized encoding schemes is shown to be very low, and this was substantiated through simulation experiments. Since the overall error in the code will also depend on the method of implementation of the coding scheme, the authors also suggest the use of specific algorithms and special hardware realizations for the check element computation.
OSTI ID:
6845297
Journal Information:
IEEE Transactions on Computers (Institute of Electrical and Electronics Engineers); (USA), Journal Name: IEEE Transactions on Computers (Institute of Electrical and Electronics Engineers); (USA) Vol. 39:4; ISSN ITCOB; ISSN 0018-9340
Country of Publication:
United States
Language:
English