Fault tolerance capabilities in multistage network-based multicomputer systems
The inherent fault tolerance capabilities of multicomputer systems based on multistage interconnection networks (MIN's) are investigated. A current view of fault tolerance in MIN's implies that a MIN is fault tolerant if it maintains full access capability, i.e., if all its inputs are able to connect to all its outputs in the presence of faults. This can be achieved by introducing redundancy to the MIN via additional links, switches, or stages. The integration of such redundancy-based techniques into the system is, however, complex and costly. Moreover, in a typical multicomputing environment not all the resources are continuously utilized, allowing the system to recover from faults via reconfiguration and continue its successful operation even when there is no redundancy in the MIN and its full access capability is not retained in the presence of faults. This is an obvious inherent benefit of any multiple resource system. The objective of this paper is to systematically analyze, formulate and demonstrate the fault tolerance capabilities of nonredundant MIN's. Graph models are used to describe the system, indicate faults, study their effects, and aid in mathematical formulation of these effects. Methodical terminology for defining functionality of two-sided-MIN-based multicomputer systems and specifying their fault tolerance capabilities is introduced and the inherent fault tolerance capabilities of such systems are analyzed. These capabilities are demonstrated on a practical system, the Texas reconfigurable array computer (TRAC). The results of the analysis should allow us to predict the system's capabilities in the presence of faults, and provide important information that can be used by the operating system in the recovery process.
- Research Organization:
- Dept. of Electrical and Computer Engineering, The Univ. of Texas, Austin, TX (US)
- OSTI ID:
- 6992962
- Journal Information:
- IEEE Trans. Comput.; (United States), Journal Name: IEEE Trans. Comput.; (United States) Vol. 37:7; ISSN ITCOB
- Country of Publication:
- United States
- Language:
- English
Similar Records
Multicomputer systems in real-time sensor data processing: a look at the problems of throughput and reliability
Fault tolerance and dynamic partitioning in large-scale parallel systems