Toward understanding soft faults in high performance cluster networks.
Conference
·
OSTI ID:925065
Fault management in high performance cluster networks has been focused on the notion of hard faults (i.e., link or node failures). Network degradations that negatively impact performance but do not result in failures often go unnoticed. In this paper, we classify such degradations as soft faults. In addition, we identify consistent performance as an important requirement in cluster networks. Using this service requirement, we describe a comprehensive strategy for cluster fault management.
- Research Organization:
- Argonne National Lab. (ANL), Argonne, IL (United States)
- Sponsoring Organization:
- National Science Foundation (NSF)
- DOE Contract Number:
- DE-AC02-06CH11357
- OSTI ID:
- 925065
- Report Number(s):
- ANL/MCS/CP-109498; TRN: US200807%%26
- Resource Relation:
- Conference: Integrated Network Management : Managing It All; Mar 24-28, 2003; Colorado Springs, CO
- Country of Publication:
- United States
- Language:
- ENGLISH
Similar Records
Fault-tolerant bandwidth reservation strategies for data transfers in high-performance networks
Partial differential equations preconditioner resilient to soft and hard faults
DiagSoftfailure: Automated Soft-Failure Diagnostic Tool Using Machine Learning for Network Users
Journal Article
·
Tue Nov 22 00:00:00 EST 2016
· Computer Networks
·
OSTI ID:925065
+1 more
Partial differential equations preconditioner resilient to soft and hard faults
Journal Article
·
Sun Jan 29 00:00:00 EST 2017
· International Journal of High Performance Computing Applications
·
OSTI ID:925065
+5 more
DiagSoftfailure: Automated Soft-Failure Diagnostic Tool Using Machine Learning for Network Users
Technical Report
·
Mon Nov 18 00:00:00 EST 2019
·
OSTI ID:925065