Identifying the Root Causes of Wait States in Large-Scale Parallel Applications
- Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)
- Forschungszentrum Julich (Germany). Julich Supercomputing Centre (JSC)
- RWTH Aachen Univ. (Germany)
- Technical Univ. of Darmstadt (Germany)
Driven by growing application requirements and accelerated by current trends in microprocessor design, the number of processor cores on modern supercomputers is increasing from generation to generation. However, load or communication imbalance prevents many codes from taking advantage of the available parallelism, as delays of single processes may spread wait states across the entire machine. Moreover, when employing complex point-to-point communication patterns, wait states may propagate along far-reaching cause-effect chains that are hard to track manually and that complicate an assessment of the actual costs of an imbalance. Building on earlier work by Meira Jr. et al., we present a scalable approach that identifies program wait states and attributes their costs in terms of resource waste to their original cause. Ultimately, by replaying event traces in parallel both forward and backward, we can identify the processes and call paths responsible for the most severe imbalances even for runs with hundreds of thousands of processes.
- Research Organization:
- Lawrence Livermore National Laboratory (LLNL), Livermore, CA (United States)
- Sponsoring Organization:
- USDOE
- Grant/Contract Number:
- AC52-07NA27344; GSC 111; VH-NG-118
- OSTI ID:
- 1305834
- Report Number(s):
- LLNL-JRNL-663039
- Journal Information:
- ACM Transactions on Parallel Computing, Vol. 3, Issue 2; ISSN 2329-4949
- Publisher:
- Association for Computing MachineryCopyright Statement
- Country of Publication:
- United States
- Language:
- English
Automated Analysis of Time Series Data to Understand Parallel Program Behaviors
|
conference | June 2018 |
Similar Records
Processing communications events in parallel active messaging interface by awakening thread from wait state
Multitarget tracking algorithm parallelization for distributed-memory computing systems