Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Checkpoint/restart-enabled parallel debugging

Conference ·
 [1];  [2];  [2];  [3];  [2];  [4];  [1]
  1. Indiana Univ., Bloomington, IN (United States)
  2. Allinea Software Ltd., Warwick (United Kingdom)
  3. Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
  4. Cisco Systems, Inc., San Jose, CA (United States)

Debugging is often the most time consuming part of software development. HPC applications prolong the debugging process by adding more processes interacting in dynamic ways for longer periods of time. Checkpoint/restart- enabled parallel debugging returns the developer to an intermediate state closer to the bug. This focuses the debugging process, saving developers considerable amounts of time, but requires parallel debuggers cooperating with MPI implementations and checkpointers. This paper presents a design specification for such a cooperative relationship. Additionally, this paper discusses the application of this design to the GDB and DDT debuggers, Open MPI, and BLCR projects. © 2010 Springer-Verlag.

Research Organization:
Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)
Sponsoring Organization:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR) (SC-21)
DOE Contract Number:
AC02-05CH11231
OSTI ID:
1407087
Country of Publication:
United States
Language:
English

Similar Records

Berkeley Lab Checkpoint/Restart for Linux
Software · 2003 · OSTI ID:code-54577

Affinity-aware checkpoint restart
Journal Article · 2014 · ACM Digital Library · OSTI ID:1342535

The design and implementation of Berkeley Lab's linuxcheckpoint/restart
Technical Report · 2005 · OSTI ID:891617

Related Subjects