Identifying failure in a tree network of a parallel computer
Abstract
Methods, parallel computers, and products are provided for identifying failure in a tree network of a parallel computer. The parallel computer includes one or more processing sets including an I/O node and a plurality of compute nodes. For each processing set embodiments include selecting a set of test compute nodes, the test compute nodes being a subset of the compute nodes of the processing set; measuring the performance of the I/O node of the processing set; measuring the performance of the selected set of test compute nodes; calculating a current test value in dependence upon the measured performance of the I/O node of the processing set, the measured performance of the set of test compute nodes, and a predetermined value for I/O node performance; and comparing the current test value with a predetermined tree performance threshold. If the current test value is below the predetermined tree performance threshold, embodiments include selecting another set of test compute nodes. If the current test value is not below the predetermined tree performance threshold, embodiments include selecting from the test compute nodes one or more potential problem nodes and testing individually potential problem nodes and links to potential problem nodes.
- Inventors:
-
- Rochester, MN
- Eden Prairie, MN
- Issue Date:
- Research Org.:
- International Business Machines Corp., Armonk, NY (United States)
- Sponsoring Org.:
- USDOE
- OSTI Identifier:
- 1013612
- Patent Number(s):
- 7783933
- Application Number:
- 11/531,787
- Assignee:
- International Business Machines Corporation (Armonk, NY)
- Patent Classifications (CPCs):
-
G - PHYSICS G06 - COMPUTING G06F - ELECTRIC DIGITAL DATA PROCESSING
- DOE Contract Number:
- B519700
- Resource Type:
- Patent
- Country of Publication:
- United States
- Language:
- English
Citation Formats
Archer, Charles J, Pinnow, Kurt W, and Wallenfelt, Brian P. Identifying failure in a tree network of a parallel computer. United States: N. p., 2010.
Web.
Archer, Charles J, Pinnow, Kurt W, & Wallenfelt, Brian P. Identifying failure in a tree network of a parallel computer. United States.
Archer, Charles J, Pinnow, Kurt W, and Wallenfelt, Brian P. Tue .
"Identifying failure in a tree network of a parallel computer". United States. https://www.osti.gov/servlets/purl/1013612.
@article{osti_1013612,
title = {Identifying failure in a tree network of a parallel computer},
author = {Archer, Charles J and Pinnow, Kurt W and Wallenfelt, Brian P},
abstractNote = {Methods, parallel computers, and products are provided for identifying failure in a tree network of a parallel computer. The parallel computer includes one or more processing sets including an I/O node and a plurality of compute nodes. For each processing set embodiments include selecting a set of test compute nodes, the test compute nodes being a subset of the compute nodes of the processing set; measuring the performance of the I/O node of the processing set; measuring the performance of the selected set of test compute nodes; calculating a current test value in dependence upon the measured performance of the I/O node of the processing set, the measured performance of the set of test compute nodes, and a predetermined value for I/O node performance; and comparing the current test value with a predetermined tree performance threshold. If the current test value is below the predetermined tree performance threshold, embodiments include selecting another set of test compute nodes. If the current test value is not below the predetermined tree performance threshold, embodiments include selecting from the test compute nodes one or more potential problem nodes and testing individually potential problem nodes and links to potential problem nodes.},
doi = {},
journal = {},
number = ,
volume = ,
place = {United States},
year = {2010},
month = {8}
}