DOE Patents title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Fault tolerance in a supercomputer through dynamic repartitioning

Abstract

A multiprocessor, parallel computer is made tolerant to hardware failures by providing extra groups of redundant standby processors and by designing the system so that these extra groups of processors can be swapped with any group which experiences a hardware failure. This swapping can be under software control, thereby permitting the entire computer to sustain a hardware failure but, after swapping in the standby processors, to still appear to software as a pristine, fully functioning system.

Inventors:
 [1];  [2];  [3];  [3]
  1. Croton On Hudson, NY
  2. Yorktown Heights, NY
  3. Mount Kisco, NY
Issue Date:
Research Org.:
Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States); International Business Machines Corp., Armonk, NY (United States)
Sponsoring Org.:
USDOE
OSTI Identifier:
949199
Patent Number(s):
7185226
Application Number:
10/469,002
Assignee:
International Business Machines Corporation (Armonk, NY)
Patent Classifications (CPCs):
H - ELECTRICITY H05 - ELECTRIC TECHNIQUES NOT OTHERWISE PROVIDED FOR H05K - PRINTED CIRCUITS
G - PHYSICS G06 - COMPUTING G06F - ELECTRIC DIGITAL DATA PROCESSING
DOE Contract Number:  
W-7405-ENG-48
Resource Type:
Patent
Country of Publication:
United States
Language:
English

Citation Formats

Chen, Dong, Coteus, Paul W, Gara, Alan G, and Takken, Todd E. Fault tolerance in a supercomputer through dynamic repartitioning. United States: N. p., 2007. Web.
Chen, Dong, Coteus, Paul W, Gara, Alan G, & Takken, Todd E. Fault tolerance in a supercomputer through dynamic repartitioning. United States.
Chen, Dong, Coteus, Paul W, Gara, Alan G, and Takken, Todd E. Tue . "Fault tolerance in a supercomputer through dynamic repartitioning". United States. https://www.osti.gov/servlets/purl/949199.
@article{osti_949199,
title = {Fault tolerance in a supercomputer through dynamic repartitioning},
author = {Chen, Dong and Coteus, Paul W and Gara, Alan G and Takken, Todd E},
abstractNote = {A multiprocessor, parallel computer is made tolerant to hardware failures by providing extra groups of redundant standby processors and by designing the system so that these extra groups of processors can be swapped with any group which experiences a hardware failure. This swapping can be under software control, thereby permitting the entire computer to sustain a hardware failure but, after swapping in the standby processors, to still appear to software as a pristine, fully functioning system.},
doi = {},
journal = {},
number = ,
volume = ,
place = {United States},
year = {2007},
month = {2}
}