skip to main content
DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Enabling parallel simulation of large-scale HPC network systems

Abstract

Here, with the increasing complexity of today’s high-performance computing (HPC) architectures, simulation has become an indispensable tool for exploring the design space of HPC systems—in particular, networks. In order to make effective design decisions, simulations of these systems must possess the following properties: (1) have high accuracy and fidelity, (2) produce results in a timely manner, and (3) be able to analyze a broad range of network workloads. Most state-of-the-art HPC network simulation frameworks, however, are constrained in one or more of these areas. In this work, we present a simulation framework for modeling two important classes of networks used in today’s IBM and Cray supercomputers: torus and dragonfly networks. We use the Co-Design of Multi-layer Exascale Storage Architecture (CODES) simulation framework to simulate these network topologies at a flit-level detail using the Rensselaer Optimistic Simulation System (ROSS) for parallel discrete-event simulation. Our simulation framework meets all the requirements of a practical network simulation and can assist network designers in design space exploration. First, it uses validated and detailed flit-level network models to provide an accurate and high-fidelity network simulation. Second, instead of relying on serial time-stepped or traditional conservative discrete-event simulations that limit simulation scalability and efficiency, we usemore » the optimistic event-scheduling capability of ROSS to achieve efficient and scalable HPC network simulations on today’s high-performance cluster systems. Third, our models give network designers a choice in simulating a broad range of network workloads, including HPC application workloads using detailed network traces, an ability that is rarely offered in parallel with high-fidelity network simulations« less

Authors:
ORCiD logo [1];  [2];  [1];  [1]
  1. Argonne National Lab. (ANL), Lemont, IL (United States)
  2. Rensselaer Polytechnic Inst., Troy, NY (United States)
Publication Date:
Research Org.:
Argonne National Lab. (ANL), Argonne, IL (United States)
Sponsoring Org.:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR) (SC-21)
OSTI Identifier:
1366454
Grant/Contract Number:  
AC02-06CH11357
Resource Type:
Accepted Manuscript
Journal Name:
IEEE Transactions on Parallel and Distributed Systems
Additional Journal Information:
Journal Volume: 28; Journal Issue: 1; Journal ID: ISSN 1045-9219
Publisher:
IEEE
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING; interconnect networks; massively parallel discrete-event simulation; trace-based simulation

Citation Formats

Mubarak, Misbah, Carothers, Christopher D., Ross, Robert B., and Carns, Philip. Enabling parallel simulation of large-scale HPC network systems. United States: N. p., 2016. Web. doi:10.1109/TPDS.2016.2543725.
Mubarak, Misbah, Carothers, Christopher D., Ross, Robert B., & Carns, Philip. Enabling parallel simulation of large-scale HPC network systems. United States. doi:10.1109/TPDS.2016.2543725.
Mubarak, Misbah, Carothers, Christopher D., Ross, Robert B., and Carns, Philip. Thu . "Enabling parallel simulation of large-scale HPC network systems". United States. doi:10.1109/TPDS.2016.2543725. https://www.osti.gov/servlets/purl/1366454.
@article{osti_1366454,
title = {Enabling parallel simulation of large-scale HPC network systems},
author = {Mubarak, Misbah and Carothers, Christopher D. and Ross, Robert B. and Carns, Philip},
abstractNote = {Here, with the increasing complexity of today’s high-performance computing (HPC) architectures, simulation has become an indispensable tool for exploring the design space of HPC systems—in particular, networks. In order to make effective design decisions, simulations of these systems must possess the following properties: (1) have high accuracy and fidelity, (2) produce results in a timely manner, and (3) be able to analyze a broad range of network workloads. Most state-of-the-art HPC network simulation frameworks, however, are constrained in one or more of these areas. In this work, we present a simulation framework for modeling two important classes of networks used in today’s IBM and Cray supercomputers: torus and dragonfly networks. We use the Co-Design of Multi-layer Exascale Storage Architecture (CODES) simulation framework to simulate these network topologies at a flit-level detail using the Rensselaer Optimistic Simulation System (ROSS) for parallel discrete-event simulation. Our simulation framework meets all the requirements of a practical network simulation and can assist network designers in design space exploration. First, it uses validated and detailed flit-level network models to provide an accurate and high-fidelity network simulation. Second, instead of relying on serial time-stepped or traditional conservative discrete-event simulations that limit simulation scalability and efficiency, we use the optimistic event-scheduling capability of ROSS to achieve efficient and scalable HPC network simulations on today’s high-performance cluster systems. Third, our models give network designers a choice in simulating a broad range of network workloads, including HPC application workloads using detailed network traces, an ability that is rarely offered in parallel with high-fidelity network simulations},
doi = {10.1109/TPDS.2016.2543725},
journal = {IEEE Transactions on Parallel and Distributed Systems},
number = 1,
volume = 28,
place = {United States},
year = {2016},
month = {4}
}

Journal Article:
Free Publicly Available Full Text
Publisher's Version of Record

Citation Metrics:
Cited by: 2 works
Citation information provided by
Web of Science

Save / Share:

Works referencing / citing this record:

Interference between I/O and MPI Traffic on Fat-tree Networks
conference, August 2018

  • Brown, Kevin A.; Jain, Nikhil; Matsuoka, Satoshi
  • ICPP 2018: 47th International Conference on Parallel Processing, Proceedings of the 47th International Conference on Parallel Processing
  • DOI: 10.1145/3225058.3225144