skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Diagnosis of Performance Faults in LargeScale MPI Applications via Probabilistic Progress-Dependence Inference

Journal Article · · IEEE Transactions on Parallel and Distributed Systems
 [1];  [1];  [1];  [2];  [1]
  1. Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)
  2. Purdue Univ., West Lafayette, IN (United States)

Debugging large-scale parallel applications is challenging. Most existing techniques provide little information about failure root causes. Further, most debuggers significantly slow down program execution, and run sluggishly with massively parallel applications. This paper presents a novel technique that scalably infers the tasks in a parallel program on which a failure occurred, as well as the code in which it originated. Our technique combines scalable runtime analysis with static analysis to determine the least-progressed task(s) and to identify the code lines at which the failure arose. We present a novel algorithm that infers probabilistically progress dependence among MPI tasks using a globally constructed Markov model that represents tasks' control-flow behavior. In comparison to previous work, our algorithm infers more precisely the least-progressed task. Further, we combine this technique with static backward slicing analysis, further isolating the code responsible for the current state. A blind study demonstrates that our technique isolates the root cause of a concurrency bug in a molecular dynamics simulation, which only manifests itself at 7,996 tasks or more. We extensively evaluate fault coverage of our technique via fault injections in 10 HPC benchmarks and show that our analysis takes less than a few seconds on thousands of parallel tasks.

Research Organization:
Lawrence Livermore National Laboratory (LLNL), Livermore, CA (United States)
Sponsoring Organization:
USDOE National Nuclear Security Administration (NNSA); National Science Foundation (NSF)
Grant/Contract Number:
AC52-07NA27344; CNS- 0916337
OSTI ID:
1769172
Report Number(s):
LLNL-JRNL-643939; 763372
Journal Information:
IEEE Transactions on Parallel and Distributed Systems, Vol. 26, Issue 5; ISSN 1045-9219
Publisher:
IEEECopyright Statement
Country of Publication:
United States
Language:
English

References (17)

Dynamic slicing of distributed programs conference October 1995
Data-Flow Analysis for MPI Programs conference January 2006
Extending a traditional debugger to debug massively parallel applications journal May 2004
Developing scientific applications using eclipse journal July 2006
Parallel program performance metrics: a comparison and validation conference November 1992
Optimization of Collective Communication Operations in MPICH journal February 2005
A graph based approach for MPI deadlock detection conference January 2009
Simulating solidification in metals at high pressure: The drive to petascale computing journal September 2006
Dynamic slicing of computer programs journal November 1990
The program dependence graph and its use in optimization journal July 1987
Program Slicing journal July 1984
Predicate-based dynamic slicing of message passing programs conference October 2002
Dynamic slicing of parallel message-passing programs conference January 1996
Probabilistic diagnosis of performance faults in large-scale parallel applications
  • Laguna, Ignacio; Ahn, Dong H.; de Supinski, Bronis R.
  • Proceedings of the 21st international conference on Parallel architectures and compilation techniques - PACT '12 https://doi.org/10.1145/2370816.2370848
conference January 2012
Scalable temporal order analysis for large scale debugging
  • Ahn, Dong H.; de Supinski, Bronis R.; Laguna, Ignacio
  • Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis - SC '09 https://doi.org/10.1145/1654059.1654104
conference January 2009
DMTracker: finding bugs in large-scale parallel programs by detecting anomaly in data movements conference January 2007
Problem Diagnosis in Large-Scale Computing Environments conference November 2006