Fault tolerance and reliability analysis of large-scale multicomputer systems

Najjar, W A

Fault tolerance and reliability analysis of large-scale multicomputer systems

Thesis/Dissertation · Thu Dec 31 23:00:00 EST 1987

OSTI ID:6045974

Najjar, W A

Fault tolerance is to become an integral part in the architectural design of large-scale systems and reliability and important measure in evaluation of their performance. The issue of the effects of increased processor failures rate in large-scale gracefully degradable distributed computing systems is addressed here. A probabilistic model of network disconnection is developed and used to evaluate the effects of node failures on the network topology. The results show that although the probability of network disconnection decreases with increasing system size, the resilience of a given topology to network disconnection decreases when the connectivity is kept constant. Combined measures of performance and reliability are used to evaluate the trade-off between increased computational power and failure rates as the number of processors is increased. For a given recovery mechanism, an optimal number of processors exist at which the amount of reliable computational work the system could deliver is maximum. Finally, a simple distributed iterative algorithm for fault tolerance is presented and evaluated. Based on a functional execution model of tasks, this algorithm allows the implementation of run-time fault detection, check-pointing, and recovery.

🛈

OSTI does not have a digital full text copy available. For more information, please see document availability, search WorldCat, or search Google Scholar.

Research Organization:: University of Southern California, Los Angeles, CA (USA)

OSTI ID:: 6045974

Country of Publication:: United States

Language:: English

Similar Records

Network resilience; A measure of network fault tolerance

Journal Article · Wed Jan 31 23:00:00 EST 1990 · IEEE Transactions on Computers (Institute of Electrical and Electronics Engineers); (USA) · OSTI ID:6987690

Fault tolerance for VLSI multicomputers

Thesis/Dissertation · Mon Dec 31 23:00:00 EST 1984 · OSTI ID:5127488

Design of fault-tolerant protocols for distributed processing systems

Thesis/Dissertation · Thu Dec 31 23:00:00 EST 1987 · OSTI ID:6988027

Related Subjects

99 GENERAL AND MISCELLANEOUS
990200* -- Mathematics & Computers
ALGORITHMS
ARRAY PROCESSORS
COMPUTERS
DIGITAL COMPUTERS
FAILURES
FAULT TOLERANT COMPUTERS
ITERATIVE METHODS
MATHEMATICAL LOGIC
RELIABILITY

Fault tolerance and reliability analysis of large-scale multicomputer systems

Citation Formats

Similar Records

Related Subjects