Exploratory Visual Analysis of Anomalous Runtime Behavior in Streaming High Performance Computing Applications
- Stony Brook Univ., NY (United States)
- Brookhaven National Lab. (BNL), Upton, NY (United States)
- Stony Brook Univ., NY (United States); Brookhaven National Lab. (BNL), Upton, NY (United States)
Online analysis of runtime behavior is essential for performance tuning in streaming scientific workflows. Integration of anomaly detection and visualization is necessary to support human-centered analysis, such as verification of candidate anomalies utilizing domain knowledge. In this work, we propose an efficient and scalable visual analytics system for online performance analysis of scientific workflows toward the exascale scenario. Here, our approach uses a call stack tree representation to encode the structural and temporal information of the function executions. Based on the call stack tree features (e.g., execution time of the root function or vector representation of the tree structure), we employ online anomaly detection approaches to identify candidate anomalous function executions. We also present a set of visualization tools for verification and exploration in a level-of-detailed manner. General information, such as distribution of execution times, are provided in an overview visualization. The detailed structure (e.g., function invocation relations) and the temporal information (e.g., message communication) of the execution call stack of interest are also visualized. The usability and efficiency of our methods are verified in a real-world HPC application.
- Research Organization:
- Brookhaven National Laboratory (BNL), Upton, NY (United States)
- Sponsoring Organization:
- USDOE Office of Science (SC), Advanced Scientific Computing Research
- Grant/Contract Number:
- SC0012704
- OSTI ID:
- 1560000
- Report Number(s):
- BNL-212038-2019-JAAM
- Journal Information:
- Lecture Notes in Computer Science, Vol. LNCS 11536; Conference: International Conference on Computational Science (ICCS 2019), Faro (Portugal), 12-14 Jun 2019; Related Information: Lecture Notes in Computer Science book series (LNCS, volume 11536); ISSN 0302-9743
- Publisher:
- SpringerCopyright Statement
- Country of Publication:
- United States
- Language:
- English
Similar Records
PRIMA-X - Performance Retargeting of Instrumentation, Measurement, and Analysis Technologies for Exascale Computing
RADICAL-Pilot and PMIx/PRRTE: Executing Heterogeneous Workloads at Large Scale on Partitioned HPC Resources