Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Algorithm-based fault tolerance on a hypercube multiprocessor

Journal Article · · IEEE Transactions on Computers (Institute of Electrical and Electronics Engineers); (USA)
OSTI ID:6569965
 [1];  [2];  [3]
  1. Univ. of Texas at Austin, Austin, TX (US)
  2. IBM, Thomas J. Watson Research Center, Yorktown Heights, NY (US)
  3. Univ. of Illinois at Urbana, Champaign, IL (US)

Hypercube multiprocessors have recently offered a cost effective and feasible approach to supercomputing through parallelism at the processor level by directly connecting a large number of low-cost processors with local memories which communicate by message-passing instead of shared variables. This paper discusses the design of a fault-tolerant hypercube multiprocessor architecture. Most of the recently proposed schemes of fault tolerance in parallel architectures address mainly the issue of reconfiguration of a parallel architecture once a faulty processor is identified. The schemes assume the existence of an off-line diagnosis strategy which locates the faulty processor. The authors propose the detection and location of faulty processors concurrently with the actual execution of parallel applications on the hypercube using a novel scheme of algorithm-based error detection.

OSTI ID:
6569965
Journal Information:
IEEE Transactions on Computers (Institute of Electrical and Electronics Engineers); (USA), Journal Name: IEEE Transactions on Computers (Institute of Electrical and Electronics Engineers); (USA) Vol. 39:9; ISSN ITCOB; ISSN 0018-9340
Country of Publication:
United States
Language:
English