| | |
Summary: Distributed Reset
Anish ARORA Mohamed GOUDA
Department of Computer Sciences, The University of Texas at Austin y
Microelectronics and Computer Technology Corporation, Austin, TX, USA
Abstract
We design a reset subsystem that can be embedded in an arbitrary distributed system in
order to allow the system processes to reset the system when necessary. Our design is layered,
and comprises three main components: a leader election, a spanning tree construction, and a
diffusing computation. Each of these components is selfstabilizing in the following sense. If
the coordination between the up processes in the system is ever lost (due to failures or repairs
of processes and channels) then each component eventually reaches a state where coordination
is regained. This capability makes our reset subsystem very robust: it can tolerate failstop
failures and repairs of processes and channels even when a reset is in progress.
Categories and Subject Descriptors: C.2.4 [Computer Communication Systems]: Dis
tributed Systems--distributed applications, network operating systems ; D.1.3 [Programming
Techniques]: Concurrent Programming ; D.4.5 [Operating Systems]: Reliability--verification,
faulttolerance ; G.2.2 [Discrete Mathematics]: Graph theory--trees, graph algorithms.
General Terms: Reliability, Algorithms.
Additional Key Words and Phrases: Faulttolerance, selfstabilization, leader election,
spanning tree, diffusing computation.
|