Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

On fault-tolerant mechanisms in distributed systems

Thesis/Dissertation ·
OSTI ID:6309833
For a system of communicating processes to be able to recover from hardware failures, there should be adequate mechanisms for process migration, checkpointing and recovery. In this dissertation, protocols are developed for each one of these aspects of fault-tolerance. A set of processes can communicate efficiently with each other, if the underlying interconnection structure matches the communication pattern of the processes. In a fault-tolerant distributed system, this requirement should be satisfied even when node failures occur. The interconnection structure of a distributed system is represented by a graph G{sub p}, where the nodes of the graph represent the processors and the edges of the graph represent communication links between the processors. For a given distributed system G{sub p} with m nodes, an interconnection structure represented by a graph G{sub n} with m + k nodes is said to be a k-fault-tolerant structure of G{sub p}, if G{sub n} {minus} F (where F is a subset of the nodes in G{sub n} with card(F) {le} K) has a subgraph isomorphic to G{sub p}. Fault-tolerant structures for systems where G{sub p} has a loop, star or star-loop structure, are presented. A reconfiguration strategy is a procedure used to migrate the processes an a failed node to a spare node. Reconfiguration strategies for loop and star systems are presented. A checkpointing and a recovery protocol are also presented for fault-tolerant distributed systems. It is shown that this checkpointing protocol requires only a minimum number of processes to save their states during each checkpointing instance. It is also shown that the recovery protocol requires only a minimum number of additional processes to rollback, following the failure of a process. The proposed protocols are non-intrusive in the sense that they do not require the processes to stop their computational activity during checkpointing or recovery.
Research Organization:
Stevens Inst. of Tech., Hoboken, NJ (USA)
OSTI ID:
6309833
Country of Publication:
United States
Language:
English