Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Understanding and Analyzing Interconnect Errors and Network Congestion on a Large Scale HPC System

Conference ·
 [1];  [2];  [3];  [4];  [1];  [5];  [6];  [3]
  1. Wayne State University, Detroit
  2. Intel Corporation
  3. Northeastern University, Boston
  4. University of Tennessee, Knoxville (UTK)
  5. University of North Texas
  6. ORNL

Today's High Performance Computing (HPC) systems are capable of delivering performance in the order of petaflops due to the fast computing devices, network interconnect, and back-end storage systems. In particular, interconnect resilience and congestion resolution methods have a major impact on the overall interconnect and application performance. This is especially true for scientific applications running multiple processes on different compute nodes as they rely on fast network messages to communicate and synchronize frequently. Unfortunately, the HPC community lacks state-of-practice experience reports that detail how different interconnect errors and congestion events occur on large-scale HPC systems. Therefore, in this paper, we process and analyze interconnect data of the Titan supercomputer to develop a thorough understanding of interconnects faults, errors and congestion events. We also study the interaction between interconnect, errors, network congestion and application characteristics.

Research Organization:
Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
Sponsoring Organization:
USDOE; USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR) (SC-21)
DOE Contract Number:
AC05-00OR22725
OSTI ID:
1465034
Country of Publication:
United States
Language:
English

References (15)

Evaluating the Potential of Cray Gemini Interconnect for PGAS Communication Runtime Systems conference August 2011
SeaStar Interconnect: Balanced Bandwidth for Scalable Performance journal May 2006
Blue Gene/L torus interconnection network journal March 2005
The IBM Blue Gene/Q interconnection network and message unit
  • Chen, Dong; Parker, Jeffrey J.; Eisley, Noel A.
  • Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11 https://doi.org/10.1145/2063384.2063419
conference January 2011
Fail-in-Place Network Design: Interaction Between Topology, Routing Algorithm and Failures
  • Domke, Jens; Hoefler, Torsten; Matsuoka, Satoshi
  • SC14: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2014.54
conference November 2014
A guided tour of data-center networking journal June 2012
Tofu: A 6D Mesh/Torus Interconnect for Exascale Computers journal November 2009
Cray Cascade: A scalable HPC system based on a Dragonfly network
  • Faanes, Greg; Bataineh, Abdulla; Roweth, Duncan
  • 2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2012.39
conference November 2012
The Gemini System Interconnect
  • Alverson, Robert; Roweth, Duncan; Kaplan, Larry
  • 2010 IEEE 18th Annual Symposium on High-Performance Interconnects (HOTI), 2010 18th IEEE Symposium on High Performance Interconnects https://doi.org/10.1109/HOTI.2010.23
conference August 2010
Express cubes: improving the performance of k-ary n-cube interconnection networks journal January 1991
Performance analysis of k-ary n-cube interconnection networks journal June 1990
The TH Express high performance interconnect networks journal June 2014
Measuring and Understanding Extreme-Scale Application Resilience: A Field Study of 5,000,000 HPC Application Runs
  • Martino, Catello Di; Kramer, William; Kalbarczyk, Zbigniew
  • 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) https://doi.org/10.1109/DSN.2015.50
conference June 2015
Understanding network failures in data centers: measurement, analysis, and implications journal October 2011
Fat-trees: Universal networks for hardware-efficient supercomputing journal October 1985

Similar Records

Study of interconnect errors, network congestion, and applications characteristics for throttle prediction on a large scale HPC system
Journal Article · Mon Mar 22 00:00:00 EDT 2021 · Journal of Parallel and Distributed Computing · OSTI ID:1777710

Understanding GPU errors on large-scale HPC systems and the implications for system design and operation
Conference · Sat Jan 31 23:00:00 EST 2015 · 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA); 7-11 Feb. 2015; Burlingame, CA, USA · OSTI ID:1567575

Designing Scalable PGAS Communication Subsystems on Cray Gemini Interconnect
Conference · Tue Dec 25 23:00:00 EST 2012 · OSTI ID:1089101

Related Subjects