Real-number codes for fault-tolerant matrix operations on processor arrays
Journal Article
·
· IEEE Transactions on Computers (Institute of Electrical and Electronics Engineers); (USA)
- Center for Reliable and High Performance Computing, Univ. of Illinois, Urbana, IL (US)
- Computer Engineering Research Center, Univ. of Texas at Austin, Austin, TX (US)
Various checksum codes have been suggested for fault-tolerant matrix computations on processor arrays. Use of these codes is limited due to inflexibility of the encoding schemes and also due to potential numerical problems. Numerical errors may also be misconstrued as errors due to physical faults in the system. In this paper, the authors develop a generalization of the existing schemes as a possible solution to these shortcomings. The authors prove that linearity is a necessary and sufficient condition for codes used for fault-tolerant matrix operations such as matrix addition, multiplication, transposition, and LU decomposition. They also prove that for every linear code defined over a finite field, there exists a corresponding linear real-number code with similar error detecting and correcting capabilities. Encoding schemes are given for some of the example codes which fall under the general set of real-number codes. With the help of experiments, the authors derive a rule of thumb for the selection of a particular code for a given application. The performance overhead of fault tolerance schemes using the generalized encoding schemes is shown to be very low, and this was substantiated through simulation experiments. Since the overall error in the code will also depend on the method of implementation of the coding scheme, the authors also suggest the use of specific algorithms and special hardware realizations for the check element computation.
- OSTI ID:
- 6845297
- Journal Information:
- IEEE Transactions on Computers (Institute of Electrical and Electronics Engineers); (USA), Journal Name: IEEE Transactions on Computers (Institute of Electrical and Electronics Engineers); (USA) Vol. 39:4; ISSN ITCOB; ISSN 0018-9340
- Country of Publication:
- United States
- Language:
- English
Similar Records
Fault-tolerant matrix arithmetic and signal processing on highly concurrent computing structures
Design and analysis of fault-tolerant processor arrays for numerical applications
New-Sum: A Novel Online ABFT Scheme For General Iterative Methods
Journal Article
·
Thu May 01 00:00:00 EDT 1986
· Proc. IEEE; (United States)
·
OSTI ID:5653088
Design and analysis of fault-tolerant processor arrays for numerical applications
Thesis/Dissertation
·
Wed Dec 31 23:00:00 EST 1986
·
OSTI ID:6963943
New-Sum: A Novel Online ABFT Scheme For General Iterative Methods
Conference
·
Tue May 31 00:00:00 EDT 2016
·
OSTI ID:1322529