Distributed reconfiguration strategies for fault-tolerant multiprocessor systems
The authors investigates strategies for dynamically reconfiguring shared memory multiprocessor systems that are subject to common memory faults and unpredictable processor deaths. These strategies aim at determining a communication page, i.e., a page of common memory that can be used by a group of processors for storing crucial common resources such as global locks for synchronization and global data structures for voting algorithms. To ensure system reliability, the reconfiguration strategies must be distributed so that each processor independently arrives at exactly the same choice. This type of reconfiguration strategy is currently used in the stage operating system on the pluribus multiprocessor. The authors analyze the weak points of the pluribus algorithm and examine alternative strategies satisfying optimization criteria such as maximization of the number of processors and the number of common memory pages in the reconfigured system. They also present a general distributed algorithm which enables the processors in such a system to exchange the local information that is needed to reach a consensus on system reconfiguration. 6 references.
- Research Organization:
- Harvard Univ., Cambridge, MA
- OSTI ID:
- 5000979
- Journal Information:
- IEEE Trans. Comput.; (United States), Journal Name: IEEE Trans. Comput.; (United States) Vol. 8; ISSN ITCOB
- Country of Publication:
- United States
- Language:
- English
Similar Records
Algorithm-based fault tolerance on a hypercube multiprocessor
Designing and reconfiguring fault-tolerant multiprocessor systems