skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Identifying the Root Causes of Wait States in Large-Scale Parallel Applications

Abstract

Driven by growing application requirements and accelerated by current trends in microprocessor design, the number of processor cores on modern supercomputers is increasing from generation to generation. However, load or communication imbalance prevents many codes from taking advantage of the available parallelism, as delays of single processes may spread wait states across the entire machine. Moreover, when employing complex point-to-point communication patterns, wait states may propagate along far-reaching cause-effect chains that are hard to track manually and that complicate an assessment of the actual costs of an imbalance. Building on earlier work by Meira Jr. et al., we present a scalable approach that identifies program wait states and attributes their costs in terms of resource waste to their original cause. Ultimately, by replaying event traces in parallel both forward and backward, we can identify the processes and call paths responsible for the most severe imbalances even for runs with hundreds of thousands of processes.

Authors:
 [1];  [2];  [2];  [3];  [4]
  1. Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)
  2. Forschungszentrum Julich (Germany). Julich Supercomputing Centre (JSC)
  3. RWTH Aachen Univ. (Germany)
  4. Technical Univ. of Darmstadt (Germany)
Publication Date:
Research Org.:
Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)
Sponsoring Org.:
USDOE
OSTI Identifier:
1305834
Report Number(s):
LLNL-JRNL-663039
Journal ID: ISSN 2329-4949
Grant/Contract Number:
AC52-07NA27344; GSC 111; VH-NG-118
Resource Type:
Journal Article: Accepted Manuscript
Journal Name:
ACM Transactions on Parallel Computing
Additional Journal Information:
Journal Volume: 3; Journal Issue: 2; Journal ID: ISSN 2329-4949
Publisher:
Association for Computing Machinery
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING; Performance analysis; cause analysis; load imbalance; event tracing; MPI; OpenMP

Citation Formats

Böhme, David, Geimer, Markus, Arnold, Lukas, Voigtlaender, Felix, and Wolf, Felix. Identifying the Root Causes of Wait States in Large-Scale Parallel Applications. United States: N. p., 2016. Web. doi:10.1145/2934661.
Böhme, David, Geimer, Markus, Arnold, Lukas, Voigtlaender, Felix, & Wolf, Felix. Identifying the Root Causes of Wait States in Large-Scale Parallel Applications. United States. doi:10.1145/2934661.
Böhme, David, Geimer, Markus, Arnold, Lukas, Voigtlaender, Felix, and Wolf, Felix. 2016. "Identifying the Root Causes of Wait States in Large-Scale Parallel Applications". United States. doi:10.1145/2934661. https://www.osti.gov/servlets/purl/1305834.
@article{osti_1305834,
title = {Identifying the Root Causes of Wait States in Large-Scale Parallel Applications},
author = {Böhme, David and Geimer, Markus and Arnold, Lukas and Voigtlaender, Felix and Wolf, Felix},
abstractNote = {Driven by growing application requirements and accelerated by current trends in microprocessor design, the number of processor cores on modern supercomputers is increasing from generation to generation. However, load or communication imbalance prevents many codes from taking advantage of the available parallelism, as delays of single processes may spread wait states across the entire machine. Moreover, when employing complex point-to-point communication patterns, wait states may propagate along far-reaching cause-effect chains that are hard to track manually and that complicate an assessment of the actual costs of an imbalance. Building on earlier work by Meira Jr. et al., we present a scalable approach that identifies program wait states and attributes their costs in terms of resource waste to their original cause. Ultimately, by replaying event traces in parallel both forward and backward, we can identify the processes and call paths responsible for the most severe imbalances even for runs with hundreds of thousands of processes.},
doi = {10.1145/2934661},
journal = {ACM Transactions on Parallel Computing},
number = 2,
volume = 3,
place = {United States},
year = 2016,
month = 7
}

Journal Article:
Free Publicly Available Full Text
Publisher's Version of Record

Save / Share:
  • Computing distance fields is fundamental to many scientific and engineering applications. Distance fields can be used to direct analysis and reduce data. In this paper, we present a highly scalable method for computing 3D distance fields on massively parallel distributed-memory machines. Anew distributed spatial data structure, named parallel distance tree, is introduced to manage the level sets of data and facilitate surface tracking overtime, resulting in significantly reduced computation and communication costs for calculating the distance to the surface of interest from any spatial locations. Our method supports several data types and distance metrics from real-world applications. We demonstrate itsmore » efficiency and scalability on state-of-the-art supercomputers using both large-scale volume datasets and surface models. We also demonstrate in-situ distance field computation on dynamic turbulent flame surfaces for a petascale combustion simulation. In conclusion, our work greatly extends the usability of distance fields for demanding applications.« less
  • Simulation is a widely adopted method to analyze and predict the performance of large-scale parallel applications. Validating the hardware model is highly important for complex simulations with a large number of parameters. Common practice involves calculating the percent error between the projected and the real execution time of a benchmark program. However, in a high-dimensional parameter space, this coarse-grained approach often suffers from parameter insensitivity, which may not be known a priori. Moreover, the traditional approach cannot be applied to the validation of software models, such as application skeletons used in online simulations. In this work, we present a methodologymore » and a toolset for validating both hardware and software models by quantitatively comparing fine-grained statistical characteristics obtained from execution traces. Although statistical information has been used in tasks like performance optimization, this is the first attempt to apply it to simulation validation. Lastly, our experimental results show that the proposed evaluation approach offers significant improvement in fidelity when compared to evaluation using total execution time, and the proposed metrics serve as reliable criteria that progress toward automating the simulation tuning process.« less
  • This paper discusses design of large scale (1000x1000) optical crossbar switching networks for use in parallel processing supercomputers. Alternative design sketches for an optical crossbar switching network are presented using free-space optical transmission with either a beam spreading/masking model or a beam steering model for internodal communications. The performances of alternative multiple access channel communications protocols - unslotted and slotted ALOHA and carrier sense multiple access (CSMA) - are compared with the performance of their classic arbitrated bus crossbar of conventional electronic parallel computing. These comparisons indicate an almost inverse relationship between ease of implementation and speed of operation. Practicalmore » issues of optical system design are addressed, and an optically addressed, composite spatial light modulator design is presented for fabrication to arbitrarily large scale.« less
  • In this paper, we introduce a parallel continuous simulated tempering (PCST) method for enhanced sampling in studying large complex systems. It mainly inherits the continuous simulated tempering (CST) method in our previous studies [C. Zhang and J. Ma, J. Chem. Phys. 130, 194112 (2009); C. Zhang and J. Ma, J. Chem. Phys. 132, 244101 (2010)], while adopts the spirit of parallel tempering (PT), or replica exchange method, by employing multiple copies with different temperature distributions. Differing from conventional PT methods, despite the large stride of total temperature range, the PCST method requires very few copies of simulations, typically 2–3 copies,more » yet it is still capable of maintaining a high rate of exchange between neighboring copies. Furthermore, in PCST method, the size of the system does not dramatically affect the number of copy needed because the exchange rate is independent of total potential energy, thus providing an enormous advantage over conventional PT methods in studying very large systems. The sampling efficiency of PCST was tested in two-dimensional Ising model, Lennard-Jones liquid and all-atom folding simulation of a small globular protein trp-cage in explicit solvent. The results demonstrate that the PCST method significantly improves sampling efficiency compared with other methods and it is particularly effective in simulating systems with long relaxation time or correlation time. We expect the PCST method to be a good alternative to parallel tempering methods in simulating large systems such as phase transition and dynamics of macromolecules in explicit solvent.« less