On fault-tolerant mechanisms in distributed systems
Thesis/Dissertation
·
OSTI ID:6309833
For a system of communicating processes to be able to recover from hardware failures, there should be adequate mechanisms for process migration, checkpointing and recovery. In this dissertation, protocols are developed for each one of these aspects of fault-tolerance. A set of processes can communicate efficiently with each other, if the underlying interconnection structure matches the communication pattern of the processes. In a fault-tolerant distributed system, this requirement should be satisfied even when node failures occur. The interconnection structure of a distributed system is represented by a graph G{sub p}, where the nodes of the graph represent the processors and the edges of the graph represent communication links between the processors. For a given distributed system G{sub p} with m nodes, an interconnection structure represented by a graph G{sub n} with m + k nodes is said to be a k-fault-tolerant structure of G{sub p}, if G{sub n} {minus} F (where F is a subset of the nodes in G{sub n} with card(F) {le} K) has a subgraph isomorphic to G{sub p}. Fault-tolerant structures for systems where G{sub p} has a loop, star or star-loop structure, are presented. A reconfiguration strategy is a procedure used to migrate the processes an a failed node to a spare node. Reconfiguration strategies for loop and star systems are presented. A checkpointing and a recovery protocol are also presented for fault-tolerant distributed systems. It is shown that this checkpointing protocol requires only a minimum number of processes to save their states during each checkpointing instance. It is also shown that the recovery protocol requires only a minimum number of additional processes to rollback, following the failure of a process. The proposed protocols are non-intrusive in the sense that they do not require the processes to stop their computational activity during checkpointing or recovery.
- Research Organization:
- Stevens Inst. of Tech., Hoboken, NJ (USA)
- OSTI ID:
- 6309833
- Country of Publication:
- United States
- Language:
- English
Similar Records
Designing and reconfiguring fault-tolerant multiprocessor systems
Distributed recovery in fault-tolerant multiprocessor networks
Protocols for configuring computation loops on a distributed multiprocessor system
Thesis/Dissertation
·
Sun Dec 31 23:00:00 EST 1989
·
OSTI ID:7046530
Distributed recovery in fault-tolerant multiprocessor networks
Journal Article
·
Wed Oct 01 00:00:00 EDT 1986
· IEEE Trans. Comput.; (United States)
·
OSTI ID:5255272
Protocols for configuring computation loops on a distributed multiprocessor system
Conference
·
Fri Dec 31 23:00:00 EST 1982
·
OSTI ID:5169800