DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: A three-phase workflow for general and expressive representations of nondeterminism in HPC applications

Journal Article · · International Journal of High Performance Computing Applications
ORCiD logo [1]; ORCiD logo [2];  [3];  [4];  [2]
  1. Univ. of Tennessee, Knoxville, TN (United States); Univ. of Delaware, Newark, DE (United States)
  2. Univ. of Tennessee, Knoxville, TN (United States)
  3. RIKEN Center for Computational Science, Tokyo (Japan)
  4. Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)

Nondeterminism is an increasingly entrenched property of high-performance computing (HPC) applications and has recently been shown to seriously hamper debugging and reproducibility efforts. Additionally, tools for addressing the nondeterministic debugging problem have emerged, but they do not provide methods for systematically cataloging the nondeterminism in a given application. We propose a three-phase workflow for representing executions of nondeterministic message passing interface programs as event graphs, quantifying their structural similarity with graph kernels, and applying machine learning techniques to investigate shared properties across applications. We present an empirical study comparing two graph kernels’ suitability for this task and propose future uses of the methodology.

Research Organization:
Lawrence Livermore National Laboratory (LLNL), Livermore, CA (United States)
Sponsoring Organization:
USDOE National Nuclear Security Administration (NNSA)
Grant/Contract Number:
AC52-07NA27344
OSTI ID:
1809185
Report Number(s):
LLNL-JRNL-819205; 1030102
Journal Information:
International Journal of High Performance Computing Applications, Vol. 33, Issue 6; ISSN 1094-3420
Publisher:
SAGECopyright Statement
Country of Publication:
United States
Language:
English

References (11)

Homogeneous Redundancy: a Technique to Ensure Integrity of Molecular Simulation Results Using Public Computing conference January 2005
Event graph visualization for debugging large applications conference January 1996
Clock delta compression for scalable order-replay of non-deterministic parallel applications
  • Sato, Kento; Ahn, Dong H.; Laguna, Ignacio
  • SC15: The International Conference for High Performance Computing, Networking, Storage and Analysis, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1145/2807591.2807642
conference November 2015
Noise Injection Techniques to Expose Subtle and Unintended Message Races
  • Sato, Kento; Ahn, Dong H.; Laguna, Ignacio
  • PPoPP '17: 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming https://doi.org/10.1145/3018743.3018767
conference January 2017
Overcoming extreme-scale reproducibility challenges through a unified, targeted, and multilevel toolset
  • Ahn, Dong H.; Lee, Gregory L.; Gopalakrishnan, Ganesh
  • Proceedings of the 1st International Workshop on Software Engineering for High Performance Computing in Computational Science and Engineering - SE-HPCCSE '13 https://doi.org/10.1145/2532352.2532357
conference January 2013
Time, clocks, and the ordering of events in a distributed system journal July 1978
On the Need for Reproducible Numerical Accuracy through Intelligent Runtime Selection of Reduction Algorithms at the Extreme Scale conference September 2015
Exposing Complex Bug-Triggering Conditions in Distributed Systems via Graph Mining conference September 2011
Obtaining identical results with double precision global accuracy on different numbers of processors in parallel particle Monte Carlo simulations journal October 2013
The journey of graph kernels through two decades journal February 2018
Noise Injection Techniques to Expose Subtle and Unintended Message Races journal October 2017