DOE Patents title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Methods and apparatus using commutative error detection values for fault isolation in multiple node computers

Abstract

Methods and apparatus perform fault isolation in multiple node computing systems using commutative error detection values for--example, checksums--to identify and to isolate faulty nodes. When information associated with a reproducible portion of a computer program is injected into a network by a node, a commutative error detection value is calculated. At intervals, node fault detection apparatus associated with the multiple node computer system retrieve commutative error detection values associated with the node and stores them in memory. When the computer program is executed again by the multiple node computer system, new commutative error detection values are created and stored in memory. The node fault detection apparatus identifies faulty nodes by comparing commutative error detection values associated with reproducible portions of the application program generated by a particular node from different runs of the application program. Differences in values indicate a possible faulty node.

Inventors:
 [1];  [2];  [3];  [4];  [5];  [6];  [7];  [8];  [9];  [10];  [11];  [12]
  1. Ardsley, NY
  2. Ridgefield, CT
  3. Croton-On-Hudson, NY
  4. Yorktown, NY
  5. Mount Kisco, NY
  6. Irvington, NY
  7. Cortlandt Manor, NY
  8. Ossining, NY
  9. Mississauga, CA
  10. Wernau, DE
  11. Brewster, NY
  12. Bedford Hills, NY
Issue Date:
Research Org.:
International Business Machines Corp., Armonk, NY (United States)
Sponsoring Org.:
USDOE
OSTI Identifier:
983062
Patent Number(s):
7383490
Application Number:
11/106,069
Assignee:
International Business Machines Corporation (Armonk, NY)
Patent Classifications (CPCs):
G - PHYSICS G06 - COMPUTING G06F - ELECTRIC DIGITAL DATA PROCESSING
DOE Contract Number:  
W-7405-ENG-48
Resource Type:
Patent
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING

Citation Formats

Almasi, Gheorghe, Blumrich, Matthias Augustin, Chen, Dong, Coteus, Paul, Gara, Alan, Giampapa, Mark E, Heidelberger, Philip, Hoenicke, Dirk I, Singh, Sarabjeet, Steinmacher-Burow, Burkhard D, Takken, Todd, and Vranas, Pavlos. Methods and apparatus using commutative error detection values for fault isolation in multiple node computers. United States: N. p., 2008. Web.
Almasi, Gheorghe, Blumrich, Matthias Augustin, Chen, Dong, Coteus, Paul, Gara, Alan, Giampapa, Mark E, Heidelberger, Philip, Hoenicke, Dirk I, Singh, Sarabjeet, Steinmacher-Burow, Burkhard D, Takken, Todd, & Vranas, Pavlos. Methods and apparatus using commutative error detection values for fault isolation in multiple node computers. United States.
Almasi, Gheorghe, Blumrich, Matthias Augustin, Chen, Dong, Coteus, Paul, Gara, Alan, Giampapa, Mark E, Heidelberger, Philip, Hoenicke, Dirk I, Singh, Sarabjeet, Steinmacher-Burow, Burkhard D, Takken, Todd, and Vranas, Pavlos. Tue . "Methods and apparatus using commutative error detection values for fault isolation in multiple node computers". United States. https://www.osti.gov/servlets/purl/983062.
@article{osti_983062,
title = {Methods and apparatus using commutative error detection values for fault isolation in multiple node computers},
author = {Almasi, Gheorghe and Blumrich, Matthias Augustin and Chen, Dong and Coteus, Paul and Gara, Alan and Giampapa, Mark E and Heidelberger, Philip and Hoenicke, Dirk I and Singh, Sarabjeet and Steinmacher-Burow, Burkhard D and Takken, Todd and Vranas, Pavlos},
abstractNote = {Methods and apparatus perform fault isolation in multiple node computing systems using commutative error detection values for--example, checksums--to identify and to isolate faulty nodes. When information associated with a reproducible portion of a computer program is injected into a network by a node, a commutative error detection value is calculated. At intervals, node fault detection apparatus associated with the multiple node computer system retrieve commutative error detection values associated with the node and stores them in memory. When the computer program is executed again by the multiple node computer system, new commutative error detection values are created and stored in memory. The node fault detection apparatus identifies faulty nodes by comparing commutative error detection values associated with reproducible portions of the application program generated by a particular node from different runs of the application program. Differences in values indicate a possible faulty node.},
doi = {},
journal = {},
number = ,
volume = ,
place = {United States},
year = {2008},
month = {6}
}