skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Debugging high-performance computing applications at massive scales

Abstract

In this work, dynamic analysis techniques help programmers find the root cause of bugs in large-scale parallel applications.

Authors:
 [1];  [1];  [1];  [1];  [1];  [1];  [2];  [2];  [2];  [3];  [3]
  1. Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)
  2. Purdue Univ., West Lafayette, IN (United States)
  3. The Ohio State Univ., Columbus, OH (United States)
Publication Date:
Research Org.:
Lawrence Livermore National Laboratory (LLNL), Livermore, CA (United States)
Sponsoring Org.:
USDOE National Nuclear Security Administration (NNSA); National Science Foundation (NSF)
OSTI Identifier:
1769101
Report Number(s):
LLNL-JRNL-652400
Journal ID: ISSN 0001-0782; 772773
Grant/Contract Number:  
AC52-07NA27344; CNS-0916337; CCF-1337158; CCF-0953759; CNS-0403342
Resource Type:
Journal Article: Accepted Manuscript
Journal Name:
Communications of the ACM
Additional Journal Information:
Journal Volume: 58; Journal Issue: 9; Journal ID: ISSN 0001-0782
Publisher:
Association for Computing Machinery
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING; security and privacy; systems security; operating systems security; software; software verification and validation; software defect analysis; software testing and debugging

Citation Formats

Laguna, Ignacio, Ahn, Dong H., de Supinski, Bronis R., Gamblin, Todd, Lee, Gregory L., Schulz, Martin, Bagchi, Saurabh, Kulkarni, Milind, Zhou, Bowen, Chen, Zhezhe, and Qin, Feng. Debugging high-performance computing applications at massive scales. United States: N. p., 2015. Web. doi:10.1145/2667219.
Laguna, Ignacio, Ahn, Dong H., de Supinski, Bronis R., Gamblin, Todd, Lee, Gregory L., Schulz, Martin, Bagchi, Saurabh, Kulkarni, Milind, Zhou, Bowen, Chen, Zhezhe, & Qin, Feng. Debugging high-performance computing applications at massive scales. United States. https://doi.org/10.1145/2667219
Laguna, Ignacio, Ahn, Dong H., de Supinski, Bronis R., Gamblin, Todd, Lee, Gregory L., Schulz, Martin, Bagchi, Saurabh, Kulkarni, Milind, Zhou, Bowen, Chen, Zhezhe, and Qin, Feng. 2015. "Debugging high-performance computing applications at massive scales". United States. https://doi.org/10.1145/2667219. https://www.osti.gov/servlets/purl/1769101.
@article{osti_1769101,
title = {Debugging high-performance computing applications at massive scales},
author = {Laguna, Ignacio and Ahn, Dong H. and de Supinski, Bronis R. and Gamblin, Todd and Lee, Gregory L. and Schulz, Martin and Bagchi, Saurabh and Kulkarni, Milind and Zhou, Bowen and Chen, Zhezhe and Qin, Feng},
abstractNote = {In this work, dynamic analysis techniques help programmers find the root cause of bugs in large-scale parallel applications.},
doi = {10.1145/2667219},
url = {https://www.osti.gov/biblio/1769101}, journal = {Communications of the ACM},
issn = {0001-0782},
number = 9,
volume = 58,
place = {United States},
year = {2015},
month = {8}
}

Journal Article:
Free Publicly Available Full Text
Publisher's Version of Record

Works referenced in this record:

ISP: a tool for model checking MPI programs
conference, January 2008

  • Vakkalanka, Sarvani S.; Sharma, Subodh; Gopalakrishnan, Ganesh
  • Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming - PPoPP '08
  • https://doi.org/10.1145/1345206.1345258

Formal analysis of MPI-based parallel programs
journal, December 2011


FlowChecker: Detecting Bugs in MPI Libraries via Message Flow Checking
conference, November 2010

  • Chen, Zhezhe; Gao, Qi; Zhang, Wenbin
  • 2010 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
  • https://doi.org/10.1109/SC.2010.27

Probabilistic diagnosis of performance faults in large-scale parallel applications
conference, January 2012

  • Laguna, Ignacio; Ahn, Dong H.; de Supinski, Bronis R.
  • Proceedings of the 21st international conference on Parallel architectures and compilation techniques - PACT '12
  • https://doi.org/10.1145/2370816.2370848

A scalable debugger for massively parallel message-passing programs
journal, July 1994


Overcoming Scalability Challenges for Tool Daemon Launching
conference, September 2008


Scalable Relative Debugging
journal, March 2014


Debugging in the (very) large: ten years of implementation and experience
journal, July 2011


Stack Trace Analysis for Large Scale Debugging
conference, March 2007


Making parallel programs reliable with stable multithreading
journal, March 2014


Large scale debugging of parallel tasks with AutomaDeD
conference, January 2011

  • Laguna, Ignacio; Gamblin, Todd; de Supinski, Bronis R.
  • Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11
  • https://doi.org/10.1145/2063384.2063451

Symbolic execution for software testing: three decades later
journal, February 2013


WuKong: automatically detecting and localizing bugs that manifest at large system scales
conference, January 2013

  • Zhou, Bowen; Too, Jonathan; Kulkarni, Milind
  • Proceedings of the 22nd international symposium on High-performance parallel and distributed computing - HPDC '13
  • https://doi.org/10.1145/2493123.2462907

DMTracker: finding bugs in large-scale parallel programs by detecting anomaly in data movements
conference, January 2007


Clustering performance data efficiently at massive scales
conference, January 2010


Vrisha: using scaling properties of parallel programs for bug detection and localization
conference, January 2011


A high-performance, portable implementation of the MPI message passing interface standard
journal, September 1996


Accurate application progress analysis for large-scale parallel debugging
conference, June 2014

  • Mitra, Subrata; Laguna, Ignacio; Ahn, Dong H.
  • PLDI '14: ACM SIGPLAN Conference on Programming Language Design and Implementation, Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation
  • https://doi.org/10.1145/2594291.2594336

AutomaDeD: Automata-based debugging for dissimilar parallel tasks
conference, June 2010


MRNet: A Software-Based Multicast/Reduction Network for Scalable Tools
conference, January 2003


Works referencing / citing this record:

Dyninst and MRNet: Foundational Infrastructure for Parallel Tools
book, January 2016


Efficient noise injection for exposing hidden data races
journal, October 2019