DOE Patents title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Method and apparatus for analyzing error conditions in a massively parallel computer system by identifying anomalous nodes within a communicator set

Abstract

An analytical mechanism for a massively parallel computer system automatically analyzes data retrieved from the system, and identifies nodes which exhibit anomalous behavior in comparison to their immediate neighbors. Preferably, anomalous behavior is determined by comparing call-return stack tracebacks for each node, grouping like nodes together, and identifying neighboring nodes which do not themselves belong to the group. A node, not itself in the group, having a large number of neighbors in the group, is a likely locality of error. The analyzer preferably presents this information to the user by sorting the neighbors according to number of adjoining members of the group.

Inventors:
 [1]
  1. Rochester, MN
Issue Date:
Research Org.:
International Business Machines Corp., Armonk, NY (United States)
Sponsoring Org.:
USDOE
OSTI Identifier:
1018213
Patent Number(s):
7930595
Application Number:
11/425,773
Assignee:
International Business Machines Corporation (Armonk, NY)
Patent Classifications (CPCs):
G - PHYSICS G06 - COMPUTING G06F - ELECTRIC DIGITAL DATA PROCESSING
DOE Contract Number:  
B591700
Resource Type:
Patent
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING

Citation Formats

Gooding, Thomas Michael. Method and apparatus for analyzing error conditions in a massively parallel computer system by identifying anomalous nodes within a communicator set. United States: N. p., 2011. Web.
Gooding, Thomas Michael. Method and apparatus for analyzing error conditions in a massively parallel computer system by identifying anomalous nodes within a communicator set. United States.
Gooding, Thomas Michael. Tue . "Method and apparatus for analyzing error conditions in a massively parallel computer system by identifying anomalous nodes within a communicator set". United States. https://www.osti.gov/servlets/purl/1018213.
@article{osti_1018213,
title = {Method and apparatus for analyzing error conditions in a massively parallel computer system by identifying anomalous nodes within a communicator set},
author = {Gooding, Thomas Michael},
abstractNote = {An analytical mechanism for a massively parallel computer system automatically analyzes data retrieved from the system, and identifies nodes which exhibit anomalous behavior in comparison to their immediate neighbors. Preferably, anomalous behavior is determined by comparing call-return stack tracebacks for each node, grouping like nodes together, and identifying neighboring nodes which do not themselves belong to the group. A node, not itself in the group, having a large number of neighbors in the group, is a likely locality of error. The analyzer preferably presents this information to the user by sorting the neighbors according to number of adjoining members of the group.},
doi = {},
journal = {},
number = ,
volume = ,
place = {United States},
year = {Tue Apr 19 00:00:00 EDT 2011},
month = {Tue Apr 19 00:00:00 EDT 2011}
}