skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: An integrated study of fault tolerance in computing systems

Miscellaneous ·
OSTI ID:5921993

A general framework for the design and analysis of distributed fault-tolerant systems is proposed including fault/error occurrence and detection, error propagation, fault location, retry, system reconfiguration, damage assessment, and error recovery. Detection mechanisms are usually assumed to be so perfect that problems within a particular phase of fault tolerance can be studied without considering its interplay with other phases. This dissertation shows that the assumption of imperfect detection mechanisms will greatly influence fault diagnosis, rollback recovery, and checkpointing. Two additional related problems are studied. One is concerned with the use of retry following a fault detection and the other with the optimal placement of checkpoints in a real-time task with or without the perfect detection assumption. A fault-classification scheme is developed for on-line estimation of fault parameters.

Research Organization:
Michigan Univ., Ann Arbor, MI (USA)
OSTI ID:
5921993
Resource Relation:
Other Information: Thesis (Ph. D.)
Country of Publication:
United States
Language:
English