Link failure detection in a parallel computer
Abstract
Methods, apparatus, and products are disclosed for link failure detection in a parallel computer including compute nodes connected in a rectangular mesh network, each pair of adjacent compute nodes in the rectangular mesh network connected together using a pair of links, that includes: assigning each compute node to either a first group or a second group such that adjacent compute nodes in the rectangular mesh network are assigned to different groups; sending, by each of the compute nodes assigned to the first group, a first test message to each adjacent compute node assigned to the second group; determining, by each of the compute nodes assigned to the second group, whether the first test message was received from each adjacent compute node assigned to the first group; and notifying a user, by each of the compute nodes assigned to the second group, whether the first test message was received.
- Inventors:
-
- Rochester, MN
- Issue Date:
- Research Org.:
- International Business Machines Corp., Armonk, NY (United States)
- Sponsoring Org.:
- USDOE
- OSTI Identifier:
- 1017450
- Patent Number(s):
- 7831866
- Application Number:
- 11/832,940
- Assignee:
- International Business Machines Corporation (Armonk, NY)
- Patent Classifications (CPCs):
-
H - ELECTRICITY H04 - ELECTRIC COMMUNICATION TECHNIQUE H04L - TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- DOE Contract Number:
- B554331
- Resource Type:
- Patent
- Country of Publication:
- United States
- Language:
- English
Citation Formats
Archer, Charles J, Blocksome, Michael A, Megerian, Mark G, and Smith, Brian E. Link failure detection in a parallel computer. United States: N. p., 2010.
Web.
Archer, Charles J, Blocksome, Michael A, Megerian, Mark G, & Smith, Brian E. Link failure detection in a parallel computer. United States.
Archer, Charles J, Blocksome, Michael A, Megerian, Mark G, and Smith, Brian E. Tue .
"Link failure detection in a parallel computer". United States. https://www.osti.gov/servlets/purl/1017450.
@article{osti_1017450,
title = {Link failure detection in a parallel computer},
author = {Archer, Charles J and Blocksome, Michael A and Megerian, Mark G and Smith, Brian E},
abstractNote = {Methods, apparatus, and products are disclosed for link failure detection in a parallel computer including compute nodes connected in a rectangular mesh network, each pair of adjacent compute nodes in the rectangular mesh network connected together using a pair of links, that includes: assigning each compute node to either a first group or a second group such that adjacent compute nodes in the rectangular mesh network are assigned to different groups; sending, by each of the compute nodes assigned to the first group, a first test message to each adjacent compute node assigned to the second group; determining, by each of the compute nodes assigned to the second group, whether the first test message was received from each adjacent compute node assigned to the first group; and notifying a user, by each of the compute nodes assigned to the second group, whether the first test message was received.},
doi = {},
journal = {},
number = ,
volume = ,
place = {United States},
year = {Tue Nov 09 00:00:00 EST 2010},
month = {Tue Nov 09 00:00:00 EST 2010}
}
Works referenced in this record: