Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Performance Evaluation of Adaptive Routing on Dragonfly-based Production Systems

Conference ·
Performance of applications in production environments can he sensitive to network congestion. Cray Aries supports adaptively routing each network packet independently based on the load or congestion encountered as a packet traverses the network. Software can dictate different routing policies, adjusting between minimal and non-minimal bias, for each posted message. We have extensively evaluated the sensitivity of the routing bias selection on application performance as well as whole system performance in both production and controlled conditions. We show that the default routing bias used in Aries-based systems is often sub-optimal and that using a higher bias towards minimal routes will not only reduce the congestion effects on the application but also will decrease the overall congestion on the network. This routing scheme results in not only improved mean performance (by up to 12%) of most production applications hut also reduced run-to-run variability. Our study prompted the two supercomputing facilities (ALCF and NERSC) to change the default routing mode on their Aries-based systems. We present the substantial improvement measured in the overall congestion management and interconnect performance in production after making this change.
Research Organization:
Argonne National Laboratory (ANL)
Sponsoring Organization:
USDOE Office of Science
DOE Contract Number:
AC02-06CH11357
OSTI ID:
1863247
Country of Publication:
United States
Language:
English

Similar Records

Preparing NERSC users for Cori, a Cray XC40 system with Intel many integrated cores
Journal Article · Thu Aug 24 20:00:00 EDT 2017 · Concurrency and Computation. Practice and Experience · OSTI ID:1459400

Understanding and Analyzing Interconnect Errors and Network Congestion on a Large Scale HPC System
Conference · Fri Jun 01 00:00:00 EDT 2018 · OSTI ID:1465034

Modeling Large-Scale Slim Fly Networks Using Parallel Discrete-Event Simulation
Journal Article · Wed Aug 29 20:00:00 EDT 2018 · ACM Transactions on Modeling and Computer Simulation · OSTI ID:1488539

Related Subjects