skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Understanding and Analyzing Interconnect Errors and Network Congestion on a Large Scale HPC System

Conference ·

Today's High Performance Computing (HPC) systems are capable of delivering performance in the order of petaflops due to the fast computing devices, network interconnect, and back-end storage systems. In particular, interconnect resilience and congestion resolution methods have a major impact on the overall interconnect and application performance. This is especially true for scientific applications running multiple processes on different compute nodes as they rely on fast network messages to communicate and synchronize frequently. Unfortunately, the HPC community lacks state-of-practice experience reports that detail how different interconnect errors and congestion events occur on large-scale HPC systems. Therefore, in this paper, we process and analyze interconnect data of the Titan supercomputer to develop a thorough understanding of interconnects faults, errors and congestion events. We also study the interaction between interconnect, errors, network congestion and application characteristics.

Research Organization:
Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
Sponsoring Organization:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)
DOE Contract Number:
AC05-00OR22725
OSTI ID:
1465034
Resource Relation:
Conference: 48th IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) 2018 - Luxembourg City, , Luxembourg - 6/25/2018 12:00:00 PM-6/28/2018 12:00:00 PM
Country of Publication:
United States
Language:
English

Similar Records

Study of interconnect errors, network congestion, and applications characteristics for throttle prediction on a large scale HPC system
Journal Article · Mon Mar 22 00:00:00 EDT 2021 · Journal of Parallel and Distributed Computing · OSTI ID:1465034

Understanding GPU errors on large-scale HPC systems and the implications for system design and operation
Conference · Sun Feb 01 00:00:00 EST 2015 · 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA); 7-11 Feb. 2015; Burlingame, CA, USA · OSTI ID:1465034

Designing Scalable PGAS Communication Subsystems on Cray Gemini Interconnect
Conference · Wed Dec 26 00:00:00 EST 2012 · OSTI ID:1465034

Related Subjects