skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: AutomaDeD: Automata-Based Debugging for Dissimilar Parallel Tasks

Abstract

Today's largest systems have over 100,000 cores, with million-core systems expected over the next few years. This growing scale makes debugging the applications that run on them a daunting challenge. Few debugging tools perform well at this scale and most provide an overload of information about the entire job. Developers need tools that quickly direct them to the root cause of the problem. This paper presents AutomaDeD, a tool that identifies which tasks of a large-scale application first manifest a bug at a specific code region at a specific point during program execution. AutomaDeD creates a statistical model of the application's control-flow and timing behavior that organizes tasks into groups and identifies deviations from normal execution, thus significantly reducing debugging effort. In addition to a case study in which AutomaDeD locates a bug that occurred during development of MVAPICH, we evaluate AutomaDeD on a range of bugs injected into the NAS parallel benchmarks. Our results demonstrate that detects the time period when a bug first manifested itself with 90% accuracy for stalls and hangs and 70% accuracy for interference faults. It identifies the subset of processes first affected by the fault with 80% accuracy and 70% accuracy, respectively and themore » code region where where the fault first manifested with 90% and 50% accuracy, respectively.« less

Authors:
; ; ; ; ;
Publication Date:
Research Org.:
Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)
Sponsoring Org.:
USDOE
OSTI Identifier:
1010829
Report Number(s):
LLNL-CONF-426270
TRN: US201108%%533
DOE Contract Number:  
W-7405-ENG-48
Resource Type:
Conference
Resource Relation:
Conference: Presented at: International Conference on Dependable Systems and Networks, Chicago, IL, United States, Jun 28 - Jul 01, 2010
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS, COMPUTING, AND INFORMATION SCIENCE; ACCURACY; BENCHMARKS; STATISTICAL MODELS

Citation Formats

Bronevetsky, G, Laguna, I, Bagchi, S, de Supinski, B R, Ahn, D, and Schulz, M. AutomaDeD: Automata-Based Debugging for Dissimilar Parallel Tasks. United States: N. p., 2010. Web.
Bronevetsky, G, Laguna, I, Bagchi, S, de Supinski, B R, Ahn, D, & Schulz, M. AutomaDeD: Automata-Based Debugging for Dissimilar Parallel Tasks. United States.
Bronevetsky, G, Laguna, I, Bagchi, S, de Supinski, B R, Ahn, D, and Schulz, M. Tue . "AutomaDeD: Automata-Based Debugging for Dissimilar Parallel Tasks". United States. https://www.osti.gov/servlets/purl/1010829.
@article{osti_1010829,
title = {AutomaDeD: Automata-Based Debugging for Dissimilar Parallel Tasks},
author = {Bronevetsky, G and Laguna, I and Bagchi, S and de Supinski, B R and Ahn, D and Schulz, M},
abstractNote = {Today's largest systems have over 100,000 cores, with million-core systems expected over the next few years. This growing scale makes debugging the applications that run on them a daunting challenge. Few debugging tools perform well at this scale and most provide an overload of information about the entire job. Developers need tools that quickly direct them to the root cause of the problem. This paper presents AutomaDeD, a tool that identifies which tasks of a large-scale application first manifest a bug at a specific code region at a specific point during program execution. AutomaDeD creates a statistical model of the application's control-flow and timing behavior that organizes tasks into groups and identifies deviations from normal execution, thus significantly reducing debugging effort. In addition to a case study in which AutomaDeD locates a bug that occurred during development of MVAPICH, we evaluate AutomaDeD on a range of bugs injected into the NAS parallel benchmarks. Our results demonstrate that detects the time period when a bug first manifested itself with 90% accuracy for stalls and hangs and 70% accuracy for interference faults. It identifies the subset of processes first affected by the fault with 80% accuracy and 70% accuracy, respectively and the code region where where the fault first manifested with 90% and 50% accuracy, respectively.},
doi = {},
journal = {},
number = ,
volume = ,
place = {United States},
year = {2010},
month = {3}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share: