Algorithm-based fault tolerance on a hypercube multiprocessor
- Univ. of Texas at Austin, Austin, TX (US)
- IBM, Thomas J. Watson Research Center, Yorktown Heights, NY (US)
- Univ. of Illinois at Urbana, Champaign, IL (US)
Hypercube multiprocessors have recently offered a cost effective and feasible approach to supercomputing through parallelism at the processor level by directly connecting a large number of low-cost processors with local memories which communicate by message-passing instead of shared variables. This paper discusses the design of a fault-tolerant hypercube multiprocessor architecture. Most of the recently proposed schemes of fault tolerance in parallel architectures address mainly the issue of reconfiguration of a parallel architecture once a faulty processor is identified. The schemes assume the existence of an off-line diagnosis strategy which locates the faulty processor. The authors propose the detection and location of faulty processors concurrently with the actual execution of parallel applications on the hypercube using a novel scheme of algorithm-based error detection.
- OSTI ID:
- 6569965
- Journal Information:
- IEEE Transactions on Computers (Institute of Electrical and Electronics Engineers); (USA), Journal Name: IEEE Transactions on Computers (Institute of Electrical and Electronics Engineers); (USA) Vol. 39:9; ISSN ITCOB; ISSN 0018-9340
- Country of Publication:
- United States
- Language:
- English
Similar Records
Designing and reconfiguring fault-tolerant multiprocessor systems
Hypercube multiprocessors 1986