Enabling parallel simulation of large-scale HPC network systems
Abstract
Here, with the increasing complexity of today’s high-performance computing (HPC) architectures, simulation has become an indispensable tool for exploring the design space of HPC systems—in particular, networks. In order to make effective design decisions, simulations of these systems must possess the following properties: (1) have high accuracy and fidelity, (2) produce results in a timely manner, and (3) be able to analyze a broad range of network workloads. Most state-of-the-art HPC network simulation frameworks, however, are constrained in one or more of these areas. In this work, we present a simulation framework for modeling two important classes of networks used in today’s IBM and Cray supercomputers: torus and dragonfly networks. We use the Co-Design of Multi-layer Exascale Storage Architecture (CODES) simulation framework to simulate these network topologies at a flit-level detail using the Rensselaer Optimistic Simulation System (ROSS) for parallel discrete-event simulation. Our simulation framework meets all the requirements of a practical network simulation and can assist network designers in design space exploration. First, it uses validated and detailed flit-level network models to provide an accurate and high-fidelity network simulation. Second, instead of relying on serial time-stepped or traditional conservative discrete-event simulations that limit simulation scalability and efficiency, we usemore »
- Authors:
-
- Argonne National Lab. (ANL), Lemont, IL (United States)
- Rensselaer Polytechnic Inst., Troy, NY (United States)
- Publication Date:
- Research Org.:
- Argonne National Lab. (ANL), Argonne, IL (United States)
- Sponsoring Org.:
- USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)
- OSTI Identifier:
- 1366454
- Grant/Contract Number:
- AC02-06CH11357
- Resource Type:
- Journal Article: Accepted Manuscript
- Journal Name:
- IEEE Transactions on Parallel and Distributed Systems
- Additional Journal Information:
- Journal Volume: 28; Journal Issue: 1; Journal ID: ISSN 1045-9219
- Publisher:
- IEEE
- Country of Publication:
- United States
- Language:
- English
- Subject:
- 97 MATHEMATICS AND COMPUTING; interconnect networks; massively parallel discrete-event simulation; trace-based simulation
Citation Formats
Mubarak, Misbah, Carothers, Christopher D., Ross, Robert B., and Carns, Philip. Enabling parallel simulation of large-scale HPC network systems. United States: N. p., 2016.
Web. doi:10.1109/TPDS.2016.2543725.
Mubarak, Misbah, Carothers, Christopher D., Ross, Robert B., & Carns, Philip. Enabling parallel simulation of large-scale HPC network systems. United States. https://doi.org/10.1109/TPDS.2016.2543725
Mubarak, Misbah, Carothers, Christopher D., Ross, Robert B., and Carns, Philip. Thu .
"Enabling parallel simulation of large-scale HPC network systems". United States. https://doi.org/10.1109/TPDS.2016.2543725. https://www.osti.gov/servlets/purl/1366454.
@article{osti_1366454,
title = {Enabling parallel simulation of large-scale HPC network systems},
author = {Mubarak, Misbah and Carothers, Christopher D. and Ross, Robert B. and Carns, Philip},
abstractNote = {Here, with the increasing complexity of today’s high-performance computing (HPC) architectures, simulation has become an indispensable tool for exploring the design space of HPC systems—in particular, networks. In order to make effective design decisions, simulations of these systems must possess the following properties: (1) have high accuracy and fidelity, (2) produce results in a timely manner, and (3) be able to analyze a broad range of network workloads. Most state-of-the-art HPC network simulation frameworks, however, are constrained in one or more of these areas. In this work, we present a simulation framework for modeling two important classes of networks used in today’s IBM and Cray supercomputers: torus and dragonfly networks. We use the Co-Design of Multi-layer Exascale Storage Architecture (CODES) simulation framework to simulate these network topologies at a flit-level detail using the Rensselaer Optimistic Simulation System (ROSS) for parallel discrete-event simulation. Our simulation framework meets all the requirements of a practical network simulation and can assist network designers in design space exploration. First, it uses validated and detailed flit-level network models to provide an accurate and high-fidelity network simulation. Second, instead of relying on serial time-stepped or traditional conservative discrete-event simulations that limit simulation scalability and efficiency, we use the optimistic event-scheduling capability of ROSS to achieve efficient and scalable HPC network simulations on today’s high-performance cluster systems. Third, our models give network designers a choice in simulating a broad range of network workloads, including HPC application workloads using detailed network traces, an ability that is rarely offered in parallel with high-fidelity network simulations},
doi = {10.1109/TPDS.2016.2543725},
url = {https://www.osti.gov/biblio/1366454},
journal = {IEEE Transactions on Parallel and Distributed Systems},
issn = {1045-9219},
number = 1,
volume = 28,
place = {United States},
year = {2016},
month = {4}
}
Web of Science
Works referencing / citing this record:
Interference between I/O and MPI Traffic on Fat-tree Networks
conference, August 2018
- Brown, Kevin A.; Jain, Nikhil; Matsuoka, Satoshi
- ICPP 2018: 47th International Conference on Parallel Processing, Proceedings of the 47th International Conference on Parallel Processing
Reconfiguration of the Multi-channel Communication System with Hierarchical Structure and Distributed Passive Switching
book, June 2019
- Hajder, Piotr; Rauch, Łukasz; Rodrigues, João M. F.
- Computational Science – ICCS 2019: 19th International Conference, Faro, Portugal, June 12–14, 2019, Proceedings, Part II, p. 502-516
AccaSim: a customizable workload management simulator for job dispatching research in HPC systems
journal, February 2019
- Galleguillos, Cristian; Kiziltan, Zeynep; Netti, Alessio
- Cluster Computing, Vol. 23, Issue 1
MAHA: Migration-based Adaptive Heuristic Algorithm for Large-scale Network Simulations
journal, September 2019
- Ibrahim, Muhammad; Iqbal, Muhammad Azhar; Aleem, Muhammad
- Cluster Computing, Vol. 23, Issue 2
Comparative Analysis of Parallel Brain Activity Mapping Algorithms for High Resolution Brain Models
journal, September 2019
- Molina-Machado, Cristhian D.; Cuartas, Ernesto; Martínez-Vargas, Juan D.
- TecnoLógicas, Vol. 22, Issue 46