Fault tolerance in a supercomputer through dynamic repartitioning
Abstract
A multiprocessor, parallel computer is made tolerant to hardware failures by providing extra groups of redundant standby processors and by designing the system so that these extra groups of processors can be swapped with any group which experiences a hardware failure. This swapping can be under software control, thereby permitting the entire computer to sustain a hardware failure but, after swapping in the standby processors, to still appear to software as a pristine, fully functioning system.
- Inventors:
-
- Croton On Hudson, NY
- Yorktown Heights, NY
- Mount Kisco, NY
- Issue Date:
- Research Org.:
- Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States); International Business Machines Corp., Armonk, NY (United States)
- Sponsoring Org.:
- USDOE
- OSTI Identifier:
- 949199
- Patent Number(s):
- 7185226
- Application Number:
- 10/469,002
- Assignee:
- International Business Machines Corporation (Armonk, NY)
- Patent Classifications (CPCs):
-
H - ELECTRICITY H05 - ELECTRIC TECHNIQUES NOT OTHERWISE PROVIDED FOR H05K - PRINTED CIRCUITS
G - PHYSICS G06 - COMPUTING G06F - ELECTRIC DIGITAL DATA PROCESSING
- DOE Contract Number:
- W-7405-ENG-48
- Resource Type:
- Patent
- Country of Publication:
- United States
- Language:
- English
Citation Formats
Chen, Dong, Coteus, Paul W, Gara, Alan G, and Takken, Todd E. Fault tolerance in a supercomputer through dynamic repartitioning. United States: N. p., 2007.
Web.
Chen, Dong, Coteus, Paul W, Gara, Alan G, & Takken, Todd E. Fault tolerance in a supercomputer through dynamic repartitioning. United States.
Chen, Dong, Coteus, Paul W, Gara, Alan G, and Takken, Todd E. Tue .
"Fault tolerance in a supercomputer through dynamic repartitioning". United States. https://www.osti.gov/servlets/purl/949199.
@article{osti_949199,
title = {Fault tolerance in a supercomputer through dynamic repartitioning},
author = {Chen, Dong and Coteus, Paul W and Gara, Alan G and Takken, Todd E},
abstractNote = {A multiprocessor, parallel computer is made tolerant to hardware failures by providing extra groups of redundant standby processors and by designing the system so that these extra groups of processors can be swapped with any group which experiences a hardware failure. This swapping can be under software control, thereby permitting the entire computer to sustain a hardware failure but, after swapping in the standby processors, to still appear to software as a pristine, fully functioning system.},
doi = {},
journal = {},
number = ,
volume = ,
place = {United States},
year = {2007},
month = {2}
}