Understanding and Analyzing Interconnect Errors and Network Congestion on a Large Scale HPC System

Kumar, Mohit; Gupta, Saurabh; Patel, Tirthak; Wilder, Michael; Shi, Weisong; Fu, Song; Engelmann, Christian; Tiwari, Devesh

doi:10.1109/DSN.2018.00023

Understanding and Analyzing Interconnect Errors and Network Congestion on a Large Scale HPC System

Conference · Fri Jun 01 00:00:00 EDT 2018

DOI:https://doi.org/10.1109/DSN.2018.00023· OSTI ID:1465034

Kumar, Mohit ^[1]; Gupta, Saurabh ^[2]; Patel, Tirthak ^[3]; Wilder, Michael ^[4]; Shi, Weisong ^[1]; Fu, Song ^[5]; ^[6]; Tiwari, Devesh ^[3]

Wayne State University, Detroit
Intel Corporation
Northeastern University, Boston
University of Tennessee, Knoxville (UTK)
University of North Texas
ORNL

Today's High Performance Computing (HPC) systems are capable of delivering performance in the order of petaflops due to the fast computing devices, network interconnect, and back-end storage systems. In particular, interconnect resilience and congestion resolution methods have a major impact on the overall interconnect and application performance. This is especially true for scientific applications running multiple processes on different compute nodes as they rely on fast network messages to communicate and synchronize frequently. Unfortunately, the HPC community lacks state-of-practice experience reports that detail how different interconnect errors and congestion events occur on large-scale HPC systems. Therefore, in this paper, we process and analyze interconnect data of the Titan supercomputer to develop a thorough understanding of interconnects faults, errors and congestion events. We also study the interaction between interconnect, errors, network congestion and application characteristics.

View Conference

Research Organization:: Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)

Sponsoring Organization:: USDOE; USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR) (SC-21)

DOE Contract Number:: AC05-00OR22725

OSTI ID:: 1465034

Country of Publication:: United States

Language:: English

References (15)

Evaluating the Potential of Cray Gemini Interconnect for PGAS Communication Runtime Systems Vishnu, Abhinav; ten Bruggencate, Monika; Olson, Ryan 2011 IEEE 19th Annual Symposium on High Performance Interconnects https://doi.org/10.1109/HOTI.2011.19	conference	August 2011
SeaStar Interconnect: Balanced Bandwidth for Scalable Performance Brightwell, R.; Pedretti, K. T.; Underwood, K. D. IEEE Micro, Vol. 26, Issue 3 https://doi.org/10.1109/MM.2006.65	journal	May 2006
Blue Gene/L torus interconnection network Adiga, N. R.; Blumrich, M. A.; Chen, D. IBM Journal of Research and Development, Vol. 49, Issue 2.3 https://doi.org/10.1147/rd.492.0265	journal	March 2005
The IBM Blue Gene/Q interconnection network and message unit Chen, Dong; Parker, Jeffrey J.; Eisley, Noel A. Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11 https://doi.org/10.1145/2063384.2063419	conference	January 2011
Fail-in-Place Network Design: Interaction Between Topology, Routing Algorithm and Failures Domke, Jens; Hoefler, Torsten; Matsuoka, Satoshi SC14: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2014.54	conference	November 2014
A guided tour of data-center networking Abts, Dennis; Felderman, Bob Communications of the ACM, Vol. 55, Issue 6 https://doi.org/10.1145/2184319.2184335	journal	June 2012
Tofu: A 6D Mesh/Torus Interconnect for Exascale Computers Ajima, Yuichiro; Sumimoto, Shinji; Shimizu, Toshiyuki Computer, Vol. 42, Issue 11 https://doi.org/10.1109/MC.2009.370	journal	November 2009
Cray Cascade: A scalable HPC system based on a Dragonfly network Faanes, Greg; Bataineh, Abdulla; Roweth, Duncan 2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2012.39	conference	November 2012
The Gemini System Interconnect Alverson, Robert; Roweth, Duncan; Kaplan, Larry 2010 IEEE 18th Annual Symposium on High-Performance Interconnects (HOTI), 2010 18th IEEE Symposium on High Performance Interconnects https://doi.org/10.1109/HOTI.2010.23	conference	August 2010
Express cubes: improving the performance of k-ary n-cube interconnection networks Dally, W. J. IEEE Transactions on Computers, Vol. 40, Issue 9 https://doi.org/10.1109/12.83652	journal	January 1991
Performance analysis of k-ary n-cube interconnection networks Dally, W. J. IEEE Transactions on Computers, Vol. 39, Issue 6 https://doi.org/10.1109/12.53599	journal	June 1990
The TH Express high performance interconnect networks Pang, Zhengbin; Xie, Min; Zhang, Jun Frontiers of Computer Science, Vol. 8, Issue 3 https://doi.org/10.1007/s11704-014-3500-9	journal	June 2014
Measuring and Understanding Extreme-Scale Application Resilience: A Field Study of 5,000,000 HPC Application Runs Martino, Catello Di; Kramer, William; Kalbarczyk, Zbigniew 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) https://doi.org/10.1109/DSN.2015.50	conference	June 2015
Understanding network failures in data centers: measurement, analysis, and implications Gill, Phillipa; Jain, Navendu; Nagappan, Nachiappan ACM SIGCOMM Computer Communication Review, Vol. 41, Issue 4 https://doi.org/10.1145/2043164.2018477	journal	October 2011
Fat-trees: Universal networks for hardware-efficient supercomputing Leiserson, Charles E. IEEE Transactions on Computers, Vol. C-34, Issue 10 https://doi.org/10.1109/TC.1985.6312192	journal	October 1985

Similar Records

Study of interconnect errors, network congestion, and applications characteristics for throttle prediction on a large scale HPC system

Journal Article · Mon Mar 22 00:00:00 EDT 2021 · Journal of Parallel and Distributed Computing · OSTI ID:1777710

Understanding GPU errors on large-scale HPC systems and the implications for system design and operation

Conference · Sat Jan 31 23:00:00 EST 2015 · 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA); 7-11 Feb. 2015; Burlingame, CA, USA · OSTI ID:1567575

Designing Scalable PGAS Communication Subsystems on Cray Gemini Interconnect

Conference · Tue Dec 25 23:00:00 EST 2012 · OSTI ID:1089101

Understanding and Analyzing Interconnect Errors and Network Congestion on a Large Scale HPC System

Citation Formats

References (15)

Similar Records

Related Subjects