Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Parallelizing heavyweight debugging tools with mpiecho

Journal Article · · Parallel Computing
 [1];  [1];  [1];  [1];  [2];  [3];  [4]
  1. Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)
  2. Univ. of Arizona, Tucson, AZ (United States)
  3. Google, Inc. (United States)
  4. Univ. of Colorado, Boulder, CO (United States)
Idioms created for debugging execution on single processors and multicore systems have been successfully scaled to thousands of processors, but there is little hope that this class of techniques can continue to be scaled out to tens of millions of cores. In order to allow development of more scalable debugging idioms we introduce mpiecho, a novel runtime platform that enables cloning of MPI ranks. Given identical execution on each clone, we then show how heavyweight debugging approaches can be parallelized, reducing their overhead to a fraction of the serialized case. We also show how this platform can be useful in isolating the source of hardware-based nondeterministic behavior and provide a case study based on a recent processor bug at LLNL.
Research Organization:
Lawrence Livermore National Laboratory (LLNL), Livermore, CA (United States)
Sponsoring Organization:
USDOE National Nuclear Security Administration (NNSA)
Grant/Contract Number:
AC52-07NA27344
OSTI ID:
1784617
Alternate ID(s):
OSTI ID: 1090002
Report Number(s):
LLNL-JRNL--604292; 699892
Journal Information:
Parallel Computing, Journal Name: Parallel Computing Journal Issue: 3 Vol. 39; ISSN 0167-8191
Publisher:
ElsevierCopyright Statement
Country of Publication:
United States
Language:
English

References (6)

VolpexMPI: An MPI Library for Execution of Parallel Applications on Volatile Nodes book January 2009
Parallelisation of the Valgrind Dynamic Binary Instrumentation Framework conference December 2008
Valgrind: a framework for heavyweight dynamic binary instrumentation conference January 2007
P N MPI tools : a whole lot greater than the sum of their parts conference January 2007
An API for Runtime Code Patching journal November 2000
Redundant Execution of HPC Applications with MR-MPI
  • Engelmann, Christian; Böhm, Swen
  • Parallel and Distributed Computing and Networks / Software Engineering, Parallel and Distributed Computing and Networks / 720: Software Engineering https://doi.org/10.2316/P.2011.719-031
conference January 2011

Similar Records

Parallel program debugging with flowback analysis
Thesis/Dissertation · Sat Dec 31 23:00:00 EST 1988 · OSTI ID:5815393

AutomaDeD: Automata-Based Debugging for Dissimilar Parallel Tasks
Conference · Tue Mar 23 00:00:00 EDT 2010 · OSTI ID:1010829

Checkpoint/restart-enabled parallel debugging
Conference · Thu Nov 11 23:00:00 EST 2010 · OSTI ID:1407087

Related Subjects