Fault tolerance and reliability analysis of large-scale multicomputer systems
Fault tolerance is to become an integral part in the architectural design of large-scale systems and reliability and important measure in evaluation of their performance. The issue of the effects of increased processor failures rate in large-scale gracefully degradable distributed computing systems is addressed here. A probabilistic model of network disconnection is developed and used to evaluate the effects of node failures on the network topology. The results show that although the probability of network disconnection decreases with increasing system size, the resilience of a given topology to network disconnection decreases when the connectivity is kept constant. Combined measures of performance and reliability are used to evaluate the trade-off between increased computational power and failure rates as the number of processors is increased. For a given recovery mechanism, an optimal number of processors exist at which the amount of reliable computational work the system could deliver is maximum. Finally, a simple distributed iterative algorithm for fault tolerance is presented and evaluated. Based on a functional execution model of tasks, this algorithm allows the implementation of run-time fault detection, check-pointing, and recovery.
- Research Organization:
- University of Southern California, Los Angeles, CA (USA)
- OSTI ID:
- 6045974
- Country of Publication:
- United States
- Language:
- English
Similar Records
Fault tolerance for VLSI multicomputers
Design of fault-tolerant protocols for distributed processing systems