skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Study of interconnect errors, network congestion, and applications characteristics for throttle prediction on a large scale HPC system

Journal Article · · Journal of Parallel and Distributed Computing
ORCiD logo [1];  [2];  [3];  [4];  [5];  [6];  [1];  [3]
  1. Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
  2. Intel Labs, Bangalore (India)
  3. Northeastern Univ., Boston, MA (United States)
  4. Leidos, Oak Ridge, TN (United States)
  5. Wayne State Univ., Detroit, MI (United States)
  6. Univ. of North Texas, Denton, TX (United States)

Today’s High Performance Computing (HPC) systems contain thousand of nodes which work together to provide performance in the order of petaflops. The performance of these systems depends on various components like processors, memory, and interconnect. Among all, interconnect plays a major role as it glues together all the hardware components in an HPC system. A slow interconnect can impact a scientific application running on multiple processes severely as they rely on fast network messages to communicate and synchronize frequently. Unfortunately, the HPC community lacks a study that explores different interconnect errors, congestion events and applications characteristics on a large-scale HPC system. In our previous work, we process and analyze interconnect data of the Titan supercomputer to develop a thorough understanding of interconnects faults, errors, and congestion events. In this work, we first show how congestion events can impact application performance. We then investigate application characteristics interaction with interconnect errors and network congestion to predict applications encountering congestion with more than 90% accuracy.

Research Organization:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Sponsoring Organization:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR); National Science Foundation (NSF)
Grant/Contract Number:
AC05-00OR22725; 1563728; 1561216; 1563750
OSTI ID:
1777710
Alternate ID(s):
OSTI ID: 1815260
Journal Information:
Journal of Parallel and Distributed Computing, Vol. 153; ISSN 0743-7315
Publisher:
ElsevierCopyright Statement
Country of Publication:
United States
Language:
English

References (13)

Tofu: A 6D Mesh/Torus Interconnect for Exascale Computers journal November 2009
SeaStar Interconnect: Balanced Bandwidth for Scalable Performance journal May 2006
Resiliency of HPC Interconnects: A Case Study of Interconnect Failures and Recovery in Blue Waters journal November 2018
Blue Gene/L torus interconnection network journal March 2005
Performance analysis of k-ary n-cube interconnection networks journal June 1990
The Kolmogorov-Smirnov Test for Goodness of Fit journal March 1951
The TH Express high performance interconnect networks journal June 2014
Express cubes: improving the performance of k-ary n-cube interconnection networks journal January 1991
Understanding network failures in data centers: measurement, analysis, and implications journal October 2011
Fat-trees: Universal networks for hardware-efficient supercomputing journal October 1985
Topology-aware network fault influence domain analysis journal January 2017
A guided tour of data-center networking journal June 2012
Interconnection Networks in Petascale Computer Systems: A Survey journal December 2016