Parallelizing heavyweight debugging tools with mpiecho
Journal Article
·
· Parallel Computing
- Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)
- Univ. of Arizona, Tucson, AZ (United States)
- Google, Inc. (United States)
- Univ. of Colorado, Boulder, CO (United States)
Idioms created for debugging execution on single processors and multicore systems have been successfully scaled to thousands of processors, but there is little hope that this class of techniques can continue to be scaled out to tens of millions of cores. In order to allow development of more scalable debugging idioms we introduce mpiecho, a novel runtime platform that enables cloning of MPI ranks. Given identical execution on each clone, we then show how heavyweight debugging approaches can be parallelized, reducing their overhead to a fraction of the serialized case. We also show how this platform can be useful in isolating the source of hardware-based nondeterministic behavior and provide a case study based on a recent processor bug at LLNL.
- Research Organization:
- Lawrence Livermore National Laboratory (LLNL), Livermore, CA (United States)
- Sponsoring Organization:
- USDOE National Nuclear Security Administration (NNSA)
- Grant/Contract Number:
- AC52-07NA27344
- OSTI ID:
- 1784617
- Alternate ID(s):
- OSTI ID: 1090002
- Report Number(s):
- LLNL-JRNL--604292; 699892
- Journal Information:
- Parallel Computing, Journal Name: Parallel Computing Journal Issue: 3 Vol. 39; ISSN 0167-8191
- Publisher:
- ElsevierCopyright Statement
- Country of Publication:
- United States
- Language:
- English
VolpexMPI: An MPI Library for Execution of Parallel Applications on Volatile Nodes
|
book | January 2009 |
Parallelisation of the Valgrind Dynamic Binary Instrumentation Framework
|
conference | December 2008 |
Valgrind: a framework for heavyweight dynamic binary instrumentation
|
conference | January 2007 |
P N MPI tools : a whole lot greater than the sum of their parts
|
conference | January 2007 |
An API for Runtime Code Patching
|
journal | November 2000 |
Redundant Execution of HPC Applications with MR-MPI
|
conference | January 2011 |
Similar Records
Parallel program debugging with flowback analysis
AutomaDeD: Automata-Based Debugging for Dissimilar Parallel Tasks
Checkpoint/restart-enabled parallel debugging
Thesis/Dissertation
·
Sat Dec 31 23:00:00 EST 1988
·
OSTI ID:5815393
AutomaDeD: Automata-Based Debugging for Dissimilar Parallel Tasks
Conference
·
Tue Mar 23 00:00:00 EDT 2010
·
OSTI ID:1010829
Checkpoint/restart-enabled parallel debugging
Conference
·
Thu Nov 11 23:00:00 EST 2010
·
OSTI ID:1407087