Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Probabilistic Communication and I/O Tracing with Deterministic Replay at Scale

Conference ·
OSTI ID:1026746

With today's petascale supercomputers, applications often exhibit low efficiency, such as poor communication and I/O performance, that can be diagnosed by analysis tools. However, these tools either produce extremely large trace files that complicate performance analysis, or sacrifice accuracy to collect high-level statistical information using crude averaging. This work contributes Scala-H-Trace, which features more aggressive trace compression than any previous approach, particularly for applications that do not show strict regularity in SPMD behavior. Scala-H-Trace uses histograms expressing the probabilistic distribution of arbitrary communication and I/O parameters to capture variations. Yet, where other tools fail to scale, Scala-H-Trace guarantees trace files of near constant size, even for variable communication and I/O patterns, producing trace files orders of magnitudes smaller than using prior approaches. We demonstrate the ability to collect traces of applications running on thousands of processors with the potential to scale well beyond this level. We further present the first approach to deterministically replay such probabilistic traces (a) without deadlocks and (b) in a manner closely resembling the original applications. Our results show either near constant sized traces or only sub-linear increases in trace file sizes irrespective of the number of nodes utilized. Even with the aggressively compressed histogram-based traces, our replay times are within 12% to 15% of the runtime of original codes. Such concise traces resembling the behavior of production-style codes closely and our approach of deterministic replay of probabilistic traces are without precedence.

Research Organization:
Oak Ridge National Laboratory (ORNL)
Sponsoring Organization:
SC USDOE - Office of Science (SC)
DOE Contract Number:
AC05-00OR22725
OSTI ID:
1026746
Country of Publication:
United States
Language:
English

Similar Records

ScalaTrace: Scalable Compression and Replay of Communication Traces for High Performance Computing
Journal Article · Fri May 16 00:00:00 EDT 2008 · Journal of Parallel and Distributed Computing, vol. 69, no. 8, August 1, 2009, pp. 696-710 · OSTI ID:965094

Scalable I/O Tracing and Analysis
Conference · Wed Dec 31 23:00:00 EST 2008 · OSTI ID:986831

Distributed Order Recording Techniques for Efficient Record-and-Replay of Multi-threaded Programs
Conference · Wed Nov 06 23:00:00 EST 2024 · OSTI ID:2562112