skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Toward understanding soft faults in high performance cluster networks.

Conference ·
OSTI ID:925065

Fault management in high performance cluster networks has been focused on the notion of hard faults (i.e., link or node failures). Network degradations that negatively impact performance but do not result in failures often go unnoticed. In this paper, we classify such degradations as soft faults. In addition, we identify consistent performance as an important requirement in cluster networks. Using this service requirement, we describe a comprehensive strategy for cluster fault management.

Research Organization:
Argonne National Lab. (ANL), Argonne, IL (United States)
Sponsoring Organization:
National Science Foundation (NSF)
DOE Contract Number:
DE-AC02-06CH11357
OSTI ID:
925065
Report Number(s):
ANL/MCS/CP-109498; TRN: US200807%%26
Resource Relation:
Conference: Integrated Network Management : Managing It All; Mar 24-28, 2003; Colorado Springs, CO
Country of Publication:
United States
Language:
ENGLISH

Similar Records

Fault-tolerant bandwidth reservation strategies for data transfers in high-performance networks
Journal Article · Tue Nov 22 00:00:00 EST 2016 · Computer Networks · OSTI ID:925065

Partial differential equations preconditioner resilient to soft and hard faults
Journal Article · Sun Jan 29 00:00:00 EST 2017 · International Journal of High Performance Computing Applications · OSTI ID:925065

DiagSoftfailure: Automated Soft-Failure Diagnostic Tool Using Machine Learning for Network Users
Technical Report · Mon Nov 18 00:00:00 EST 2019 · OSTI ID:925065