skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: A new deadlock resolution protocol and message matching algorithm for the extreme-scale simulator

Journal Article · · Concurrency and Computation. Practice and Experience
DOI:https://doi.org/10.1002/cpe.3805· OSTI ID:1286913
 [1];  [1]
  1. Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)

Summary Investigating the performance of parallel applications at scale on future high‐performance computing (HPC) architectures and the performance impact of different HPC architecture choices is an important component of HPC hardware/software co‐design. The Extreme‐scale Simulator (xSim) is a simulation toolkit for investigating the performance of parallel applications at scale. xSim scales to millions of simulated Message Passing Interface (MPI) processes. The xSim toolkit strives to limit simulation overheads in order to maintain performance and productivity criteria. This paper documents two improvements to xSim: (1) a new deadlock resolution protocol to reduce the parallel discrete event simulation overhead and (2) a new simulated MPI message matching algorithm to reduce the oversubscription management cost. These enhancements resulted in significant performance improvements. The simulation overhead for running the NASA Advanced Supercomputing Parallel Benchmark suite dropped from 1,020% to 238% for the conjugate gradient benchmark and 102% to 0% for the embarrassingly parallel benchmark. Additionally, the improvements were beneficial for reducing overheads in the highly accurate simulation mode of xSim, which is useful for resilience investigation studies for tracking intentional MPI process failures. In the highly accurate mode, the simulation overhead was reduced from 37,511% to 13,808% for conjugate gradient and from 3,332% to 204% for embarrassingly parallel. Copyright © 2016 John Wiley & Sons, Ltd.

Research Organization:
Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
Sponsoring Organization:
USDOE Laboratory Directed Research and Development (LDRD) Program; USDOE Office of Science (SC)
Grant/Contract Number:
AC05-00OR22725
OSTI ID:
1286913
Alternate ID(s):
OSTI ID: 1401192
Journal Information:
Concurrency and Computation. Practice and Experience, Vol. 28, Issue 12; ISSN 1532-0626
Publisher:
WileyCopyright Statement
Country of Publication:
United States
Language:
English
Citation Metrics:
Cited by: 1 work
Citation information provided by
Web of Science