An integrated study of fault tolerance in computing systems
A general framework for the design and analysis of distributed fault-tolerant systems is proposed including fault/error occurrence and detection, error propagation, fault location, retry, system reconfiguration, damage assessment, and error recovery. Detection mechanisms are usually assumed to be so perfect that problems within a particular phase of fault tolerance can be studied without considering its interplay with other phases. This dissertation shows that the assumption of imperfect detection mechanisms will greatly influence fault diagnosis, rollback recovery, and checkpointing. Two additional related problems are studied. One is concerned with the use of retry following a fault detection and the other with the optimal placement of checkpoints in a real-time task with or without the perfect detection assumption. A fault-classification scheme is developed for on-line estimation of fault parameters.
- Research Organization:
- Michigan Univ., Ann Arbor, MI (USA)
- OSTI ID:
- 5921993
- Resource Relation:
- Other Information: Thesis (Ph. D.)
- Country of Publication:
- United States
- Language:
- English
Similar Records
Fault-tolerant delivery algorithms
The analysis and optimization of fault tolerance in multiprocessor systems: A graph theoretic approach