skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Understanding and Analyzing Interconnect Errors and Network Congestion on a Large Scale HPC System

Abstract

Today's High Performance Computing (HPC) systems are capable of delivering performance in the order of petaflops due to the fast computing devices, network interconnect, and back-end storage systems. In particular, interconnect resilience and congestion resolution methods have a major impact on the overall interconnect and application performance. This is especially true for scientific applications running multiple processes on different compute nodes as they rely on fast network messages to communicate and synchronize frequently. Unfortunately, the HPC community lacks state-of-practice experience reports that detail how different interconnect errors and congestion events occur on large-scale HPC systems. Therefore, in this paper, we process and analyze interconnect data of the Titan supercomputer to develop a thorough understanding of interconnects faults, errors and congestion events. We also study the interaction between interconnect, errors, network congestion and application characteristics.

Authors:
 [1];  [2];  [3];  [4];  [1];  [5]; ORCiD logo [6];  [3]
  1. Wayne State University, Detroit
  2. Intel Corporation
  3. Northeastern University, Boston
  4. The University of Tennessee, Knoxville
  5. University of North Texas
  6. ORNL
Publication Date:
Research Org.:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF)
Sponsoring Org.:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)
OSTI Identifier:
1465034
DOE Contract Number:  
AC05-00OR22725
Resource Type:
Conference
Resource Relation:
Conference: 48th IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) 2018 - Luxembourg City, , Luxembourg - 6/25/2018 12:00:00 PM-6/28/2018 12:00:00 PM
Country of Publication:
United States
Language:
English

Citation Formats

Kumar, Mohit, Gupta, Saurabh, Patel, Tirthak, Wilder, Michael, Shi, Weisong, Fu, Song, Engelmann, Christian, and Tiwari, Devesh. Understanding and Analyzing Interconnect Errors and Network Congestion on a Large Scale HPC System. United States: N. p., 2018. Web. doi:10.1109/DSN.2018.00023.
Kumar, Mohit, Gupta, Saurabh, Patel, Tirthak, Wilder, Michael, Shi, Weisong, Fu, Song, Engelmann, Christian, & Tiwari, Devesh. Understanding and Analyzing Interconnect Errors and Network Congestion on a Large Scale HPC System. United States. https://doi.org/10.1109/DSN.2018.00023
Kumar, Mohit, Gupta, Saurabh, Patel, Tirthak, Wilder, Michael, Shi, Weisong, Fu, Song, Engelmann, Christian, and Tiwari, Devesh. Fri . "Understanding and Analyzing Interconnect Errors and Network Congestion on a Large Scale HPC System". United States. https://doi.org/10.1109/DSN.2018.00023. https://www.osti.gov/servlets/purl/1465034.
@article{osti_1465034,
title = {Understanding and Analyzing Interconnect Errors and Network Congestion on a Large Scale HPC System},
author = {Kumar, Mohit and Gupta, Saurabh and Patel, Tirthak and Wilder, Michael and Shi, Weisong and Fu, Song and Engelmann, Christian and Tiwari, Devesh},
abstractNote = {Today's High Performance Computing (HPC) systems are capable of delivering performance in the order of petaflops due to the fast computing devices, network interconnect, and back-end storage systems. In particular, interconnect resilience and congestion resolution methods have a major impact on the overall interconnect and application performance. This is especially true for scientific applications running multiple processes on different compute nodes as they rely on fast network messages to communicate and synchronize frequently. Unfortunately, the HPC community lacks state-of-practice experience reports that detail how different interconnect errors and congestion events occur on large-scale HPC systems. Therefore, in this paper, we process and analyze interconnect data of the Titan supercomputer to develop a thorough understanding of interconnects faults, errors and congestion events. We also study the interaction between interconnect, errors, network congestion and application characteristics.},
doi = {10.1109/DSN.2018.00023},
url = {https://www.osti.gov/biblio/1465034}, journal = {},
number = ,
volume = ,
place = {United States},
year = {2018},
month = {6}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share: