DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Modeling Large-Scale Slim Fly Networks Using Parallel Discrete-Event Simulation

Journal Article · · ACM Transactions on Modeling and Computer Simulation
DOI: https://doi.org/10.1145/3203406 · OSTI ID:1488539
 [1];  [2];  [1];  [2];  [2]
  1. Rensselaer Polytechnic Inst., Troy, NY (United States)
  2. Argonne National Lab. (ANL), Lemont, IL (United States)

As supercomputers approach exascale performance, the increased number of processors translates to an increased demand on the underlying network interconnect. We present that the slim fly network topology, a new low-diameter, low-latency, and low-cost interconnection network, is gaining interest as one possible solution for next-generation supercomputing interconnect systems. In this article, we present a high-fidelity slim fly packet-level model leveraging the Rensselaer Optimistic Simulation System (ROSS) and Co-Design of Exascale Storage (CODES) frameworks. We validate the model with published work before scaling the network size up to an unprecedented 1 million compute nodes and confirming that the slim fly observes peak network throughput at extreme scale. In addition to synthetic workloads, we evaluate large-scale slim fly models with real communication workloads from applications in the Design Forward program with over 110,000 MPI processes. We show strong scaling of the slim fly model on an Intel cluster achieving a peak network packet transfer rate of 2.3 million packets per second and processing over 7 billion discrete events using 128 MPI tasks. Enabled by the strong performance capabilities of the model, we perform a detailed application trace and routing protocol performance study. Lastly, through analysis of metrics such as packet latency, hop count, and congestion, we find that the slim fly network is able to leverage simple minimal routing and achieve the same performance as more complex adaptive routing for tested DOE benchmark applications.

Research Organization:
Argonne National Laboratory (ANL), Argonne, IL (United States)
Sponsoring Organization:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR) (SC-21); Air Force Research Laboratory (AFRL)
Grant/Contract Number:
AC02-06CH11357
OSTI ID:
1488539
Journal Information:
ACM Transactions on Modeling and Computer Simulation, Journal Name: ACM Transactions on Modeling and Computer Simulation Journal Issue: 4 Vol. 28; ISSN 1049-3301
Publisher:
Association for Computing MachineryCopyright Statement
Country of Publication:
United States
Language:
English

References (24)

A Note on Large Graphs of Diameter Two and Given Maximum Degree journal September 1998
Load-Balancing in Multistage Interconnection Networks under Multiple-Pass Routing journal August 1996
A Case for Epidemic Fault Detection and Group Membership in HPC Storage Systems book January 2015
Preliminary Evaluation of a Parallel Trace Replay Tool for HPC Network Simulations book January 2015
ROSS: A high-performance, low-memory, modular Time Warp system journal November 2002
Geometric realisation of the graphs of McKay–Miller–Širáň journal March 2004
Virtual-channel flow control journal March 1992
(SAI) Stalled, Active and Idle: Characterizing Power and Performance of Large-Scale Dragonfly Networks conference September 2016
Modeling a Million-Node Dragonfly Network Using Massively Parallel Discrete-Event Simulation
  • Mubarak, Misbah; Carothers, Christopher D.; Ross, Robert
  • 2012 SC Companion: High Performance Computing, Networking, Storage and Analysis (SCC), 2012 SC Companion: High Performance Computing, Networking Storage and Analysis https://doi.org/10.1109/SC.Companion.2012.56
conference November 2012
Enabling Parallel Simulation of Large-Scale HPC Network Systems journal January 2017
A Scheme for Fast Parallel Communication journal May 1982
Technology-Driven, Highly-Scalable Dragonfly Topology journal June 2008
The cost of conservative synchronization in parallel discrete event simulations journal April 1993
Speeding up Nek5000 with autotuning and specialization conference January 2010
The structural simulation toolkit journal March 2011
LogGP: incorporating long messages into the LogP model---one step closer towards a realistic model for parallel computation
  • Alexandrov, Albert; Ionescu, Mihai F.; Schauser, Klaus E.
  • Proceedings of the seventh annual ACM symposium on Parallel algorithms and architectures - SPAA '95 https://doi.org/10.1145/215399.215427
conference January 1995
Warp speed: executing time warp on 1,966,080 cores
  • Barnes, Peter D.; Carothers, Christopher D.; Jefferson, David R.
  • Proceedings of the 2013 ACM SIGSIM conference on Principles of advanced discrete simulation - SIGSIM-PADS '13 https://doi.org/10.1145/2486092.2486134
conference January 2013
A case study in using massively parallel simulation for extreme-scale torus network codesign
  • Mubarak, Misbah; Carothers, Christopher D.; Ross, Robert B.
  • Proceedings of the 2nd ACM SIGSIM/PADS conference on Principles of advanced discrete simulation - SIGSIM-PADS '14 https://doi.org/10.1145/2601381.2601383
conference January 2014
FatTreeSim: Modeling Large-scale Fat-Tree Networks for HPC Systems and Data Centers Using Parallel and Discrete Event Simulation conference January 2015
Cost-effective diameter-two topologies: analysis and evaluation
  • Kathareios, Georgios; Minkenberg, Cyriel; Prisacari, Bogdan
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '15 https://doi.org/10.1145/2807591.2807652
conference January 2015
Techniques for modeling large-scale HPC I/O workloads
  • Snyder, Shane; Carns, Philip
  • Proceedings of the 6th International Workshop on Performance Modeling, Benchmarking, and Simulation of High Performance Computing Systems - PMBS '15 https://doi.org/10.1145/2832087.2832091
conference January 2015
Modeling a Million-Node Slim Fly Network Using Parallel Discrete-Event Simulation
  • Wolfe, Noah; Carothers, Christopher D.; Mubarak, Misbah
  • Proceedings of the 2016 annual ACM Conference on SIGSIM Principles of Advanced Discrete Simulation - SIGSIM-PADS '16 https://doi.org/10.1145/2901378.2901389
conference January 2016
Efficient optimistic parallel simulations using reverse computation journal July 1999
Trace-driven Co-simulation of High-Performance Computing Systems using OMNeT++
  • Minkenberg, Cyriel; Herrera, German Rodriguez
  • 2nd International ICST Conference on Simulation Tools and Techniques, Proceedings of the Second International ICST Conference on Simulation Tools and Techniques https://doi.org/10.4108/ICST.SIMUTOOLS2009.5521
conference January 2009