Bounds on algorithm-based fault tolerance in multiple processor systems
Journal Article
·
· IEEE Trans. Comput.; (United States)
An important consideration in the design of high-performance multiple processor systems should be in ensuring the correctness of results computed by such complex systems which are extremely prone to transient and intermittent failures. The detection and location of faults and errors concurrently with normal system operation can be achieved through the application of appropriate on-line checks on the results of the computations. This is the domain of algorithm-based fault tolerance, which deals with low-cost system-level fault-tolerance techniques to produce reliable computations in multiple processor systems, by tailoring the fault-tolerance techniques toward specific algorithms. This paper presents a graph-theoretic model for determining upper and lower bounds on the number of checks needed for achieving concurrent fault detection and location. The objective is to estimate the overhead in time and the number of processors required for such a scheme. Faults in processors, errors in the data, and checks on the data to detect and locate errors are represented as a tripartite graph. Bounds on the time and processor overhead are obtained by considering a series of subproblems. First, using some crude concepts for t-fault detection and t-fault location, bounds on the maximum size of the error patterns that can arise from such fault patterns are obtained. Using these results, bounds are derived on the number of checks required for error detection and location. Some numerical results are derived from a linear programming formulation. Finally, using some simple fan-in arguments, bounds on the time and the number of processors required to compute the on-line checks are estimated.
- Research Organization:
- Dept. of Electrical and Computer Eng. and the Coordinated Science Lab., Univ. of Illinois, Urbana, IL 61801
- OSTI ID:
- 5760583
- Journal Information:
- IEEE Trans. Comput.; (United States), Journal Name: IEEE Trans. Comput.; (United States) Vol. C-35:4; ISSN ITCOB
- Country of Publication:
- United States
- Language:
- English
Similar Records
Theory of algorithm-based fault tolerance in array processor systems
Design and analysis of fault-tolerant processor arrays for numerical applications
Real-number codes for fault-tolerant matrix operations on processor arrays
Thesis/Dissertation
·
Mon Dec 31 23:00:00 EST 1984
·
OSTI ID:5954280
Design and analysis of fault-tolerant processor arrays for numerical applications
Thesis/Dissertation
·
Wed Dec 31 23:00:00 EST 1986
·
OSTI ID:6963943
Real-number codes for fault-tolerant matrix operations on processor arrays
Journal Article
·
Sat Mar 31 23:00:00 EST 1990
· IEEE Transactions on Computers (Institute of Electrical and Electronics Engineers); (USA)
·
OSTI ID:6845297
Related Subjects
99 GENERAL AND MISCELLANEOUS
990200* -- Mathematics & Computers
ALGORITHMS
ARRAY PROCESSORS
COMPUTERS
CONTROL SYSTEMS
DESIGN
DIGITAL COMPUTERS
ERRORS
FAILURES
FAULT TOLERANT COMPUTERS
GRAPHS
LINEAR PROGRAMMING
MATHEMATICAL LOGIC
ON-LINE CONTROL SYSTEMS
ON-LINE SYSTEMS
PROGRAMMING
RELIABILITY
TIME DEPENDENCE
TRANSIENTS
990200* -- Mathematics & Computers
ALGORITHMS
ARRAY PROCESSORS
COMPUTERS
CONTROL SYSTEMS
DESIGN
DIGITAL COMPUTERS
ERRORS
FAILURES
FAULT TOLERANT COMPUTERS
GRAPHS
LINEAR PROGRAMMING
MATHEMATICAL LOGIC
ON-LINE CONTROL SYSTEMS
ON-LINE SYSTEMS
PROGRAMMING
RELIABILITY
TIME DEPENDENCE
TRANSIENTS