DOE Patents title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Identifying failure in a tree network of a parallel computer

Abstract

Methods, parallel computers, and products are provided for identifying failure in a tree network of a parallel computer. The parallel computer includes one or more processing sets including an I/O node and a plurality of compute nodes. For each processing set embodiments include selecting a set of test compute nodes, the test compute nodes being a subset of the compute nodes of the processing set; measuring the performance of the I/O node of the processing set; measuring the performance of the selected set of test compute nodes; calculating a current test value in dependence upon the measured performance of the I/O node of the processing set, the measured performance of the set of test compute nodes, and a predetermined value for I/O node performance; and comparing the current test value with a predetermined tree performance threshold. If the current test value is below the predetermined tree performance threshold, embodiments include selecting another set of test compute nodes. If the current test value is not below the predetermined tree performance threshold, embodiments include selecting from the test compute nodes one or more potential problem nodes and testing individually potential problem nodes and links to potential problem nodes.

Inventors:
 [1];  [1];  [2]
  1. Rochester, MN
  2. Eden Prairie, MN
Issue Date:
Research Org.:
International Business Machines Corp., Armonk, NY (United States)
Sponsoring Org.:
USDOE
OSTI Identifier:
1013612
Patent Number(s):
7783933
Application Number:
11/531,787
Assignee:
International Business Machines Corporation (Armonk, NY)
Patent Classifications (CPCs):
G - PHYSICS G06 - COMPUTING G06F - ELECTRIC DIGITAL DATA PROCESSING
DOE Contract Number:  
B519700
Resource Type:
Patent
Country of Publication:
United States
Language:
English

Citation Formats

Archer, Charles J, Pinnow, Kurt W, and Wallenfelt, Brian P. Identifying failure in a tree network of a parallel computer. United States: N. p., 2010. Web.
Archer, Charles J, Pinnow, Kurt W, & Wallenfelt, Brian P. Identifying failure in a tree network of a parallel computer. United States.
Archer, Charles J, Pinnow, Kurt W, and Wallenfelt, Brian P. Tue . "Identifying failure in a tree network of a parallel computer". United States. https://www.osti.gov/servlets/purl/1013612.
@article{osti_1013612,
title = {Identifying failure in a tree network of a parallel computer},
author = {Archer, Charles J and Pinnow, Kurt W and Wallenfelt, Brian P},
abstractNote = {Methods, parallel computers, and products are provided for identifying failure in a tree network of a parallel computer. The parallel computer includes one or more processing sets including an I/O node and a plurality of compute nodes. For each processing set embodiments include selecting a set of test compute nodes, the test compute nodes being a subset of the compute nodes of the processing set; measuring the performance of the I/O node of the processing set; measuring the performance of the selected set of test compute nodes; calculating a current test value in dependence upon the measured performance of the I/O node of the processing set, the measured performance of the set of test compute nodes, and a predetermined value for I/O node performance; and comparing the current test value with a predetermined tree performance threshold. If the current test value is below the predetermined tree performance threshold, embodiments include selecting another set of test compute nodes. If the current test value is not below the predetermined tree performance threshold, embodiments include selecting from the test compute nodes one or more potential problem nodes and testing individually potential problem nodes and links to potential problem nodes.},
doi = {},
journal = {},
number = ,
volume = ,
place = {United States},
year = {2010},
month = {8}
}