Exploratory Visual Analysis of Anomalous Runtime Behavior in Streaming High Performance Computing Applications
Abstract
Online analysis of runtime behavior is essential for performance tuning in streaming scientific workflows. Integration of anomaly detection and visualization is necessary to support human-centered analysis, such as verification of candidate anomalies utilizing domain knowledge. In this work, we propose an efficient and scalable visual analytics system for online performance analysis of scientific workflows toward the exascale scenario. Here, our approach uses a call stack tree representation to encode the structural and temporal information of the function executions. Based on the call stack tree features (e.g., execution time of the root function or vector representation of the tree structure), we employ online anomaly detection approaches to identify candidate anomalous function executions. We also present a set of visualization tools for verification and exploration in a level-of-detailed manner. General information, such as distribution of execution times, are provided in an overview visualization. The detailed structure (e.g., function invocation relations) and the temporal information (e.g., message communication) of the execution call stack of interest are also visualized. The usability and efficiency of our methods are verified in a real-world HPC application.
- Authors:
-
- Stony Brook Univ., NY (United States)
- Brookhaven National Lab. (BNL), Upton, NY (United States)
- Stony Brook Univ., NY (United States); Brookhaven National Lab. (BNL), Upton, NY (United States)
- Publication Date:
- Research Org.:
- Brookhaven National Laboratory (BNL), Upton, NY (United States)
- Sponsoring Org.:
- USDOE Office of Science (SC), Advanced Scientific Computing Research
- OSTI Identifier:
- 1560000
- Report Number(s):
- BNL-212038-2019-JAAM
Journal ID: ISSN 0302-9743
- Grant/Contract Number:
- SC0012704
- Resource Type:
- Journal Article: Accepted Manuscript
- Journal Name:
- Lecture Notes in Computer Science
- Additional Journal Information:
- Journal Volume: LNCS 11536; Conference: International Conference on Computational Science (ICCS 2019), Faro (Portugal), 12-14 Jun 2019; Related Information: Lecture Notes in Computer Science book series (LNCS, volume 11536); Journal ID: ISSN 0302-9743
- Publisher:
- Springer
- Country of Publication:
- United States
- Language:
- English
- Subject:
- 97 MATHEMATICS AND COMPUTING; Anomaly Detection; High Performance Computing; Streaming Analysis; Trace Events; Visual Analytics
Citation Formats
Xie, Cong, Jeong, Wonyong, Matyasfalvi, Gyorgy, Van Dam, Hubertus, Mueller, Klaus, Yoo, Shinjae, and Xu, Wei. Exploratory Visual Analysis of Anomalous Runtime Behavior in Streaming High Performance Computing Applications. United States: N. p., 2019.
Web. doi:10.1007/978-3-030-22734-0_12.
Xie, Cong, Jeong, Wonyong, Matyasfalvi, Gyorgy, Van Dam, Hubertus, Mueller, Klaus, Yoo, Shinjae, & Xu, Wei. Exploratory Visual Analysis of Anomalous Runtime Behavior in Streaming High Performance Computing Applications. United States. https://doi.org/10.1007/978-3-030-22734-0_12
Xie, Cong, Jeong, Wonyong, Matyasfalvi, Gyorgy, Van Dam, Hubertus, Mueller, Klaus, Yoo, Shinjae, and Xu, Wei. 2019.
"Exploratory Visual Analysis of Anomalous Runtime Behavior in Streaming High Performance Computing Applications". United States. https://doi.org/10.1007/978-3-030-22734-0_12. https://www.osti.gov/servlets/purl/1560000.
@article{osti_1560000,
title = {Exploratory Visual Analysis of Anomalous Runtime Behavior in Streaming High Performance Computing Applications},
author = {Xie, Cong and Jeong, Wonyong and Matyasfalvi, Gyorgy and Van Dam, Hubertus and Mueller, Klaus and Yoo, Shinjae and Xu, Wei},
abstractNote = {Online analysis of runtime behavior is essential for performance tuning in streaming scientific workflows. Integration of anomaly detection and visualization is necessary to support human-centered analysis, such as verification of candidate anomalies utilizing domain knowledge. In this work, we propose an efficient and scalable visual analytics system for online performance analysis of scientific workflows toward the exascale scenario. Here, our approach uses a call stack tree representation to encode the structural and temporal information of the function executions. Based on the call stack tree features (e.g., execution time of the root function or vector representation of the tree structure), we employ online anomaly detection approaches to identify candidate anomalous function executions. We also present a set of visualization tools for verification and exploration in a level-of-detailed manner. General information, such as distribution of execution times, are provided in an overview visualization. The detailed structure (e.g., function invocation relations) and the temporal information (e.g., message communication) of the execution call stack of interest are also visualized. The usability and efficiency of our methods are verified in a real-world HPC application.},
doi = {10.1007/978-3-030-22734-0_12},
url = {https://www.osti.gov/biblio/1560000},
journal = {Lecture Notes in Computer Science},
issn = {0302-9743},
number = ,
volume = LNCS 11536,
place = {United States},
year = {Sat Jun 08 00:00:00 EDT 2019},
month = {Sat Jun 08 00:00:00 EDT 2019}
}
Figures / Tables:
Works referenced in this record:
Score-P: A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir
book, January 2012
- Knüpfer, Andreas; Rössel, Christian; Mey, Dieter an
- Tools for High Performance Computing 2011
A Visual Network Analysis Method for Large-Scale Parallel I/O Systems
conference, May 2013
- Sigovan, Carmen; Muelder, Chris; Ma, Kwan-Liu
- 2013 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on Parallel and Distributed Processing
A Visual Analytics Framework for the Detection of Anomalous Call Stack Trees in High Performance Computing Applications
journal, January 2019
- Xie, Cong; Xu, Wei; Mueller, Klaus
- IEEE Transactions on Visualization and Computer Graphics, Vol. 25, Issue 1
Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis
journal, March 1964
- Kruskal, J. B.
- Psychometrika, Vol. 29, Issue 1
The Tau Parallel Performance System
journal, May 2006
- Shende, Sameer S.; Malony, Allen D.
- The International Journal of High Performance Computing Applications, Vol. 20, Issue 2
Toward Scalable Performance Visualization with Jumpshot
journal, August 1999
- Zaki, Omer; Lusk, Ewing; Gropp, William
- The International Journal of High Performance Computing Applications, Vol. 13, Issue 3
Stack Trace Analysis for Large Scale Debugging
conference, March 2007
- Arnold, Dorian C.; Ahn, Dong H.; de Supinski, Bronis R.
- 2007 IEEE International Parallel and Distributed Processing Symposium
NWChem: A comprehensive and scalable open-source solution for large scale molecular simulations
journal, September 2010
- Valiev, M.; Bylaska, E. J.; Govind, N.
- Computer Physics Communications, Vol. 181, Issue 9, p. 1477-1489
A Scalable Observation System for Introspection and In Situ Analytics
conference, November 2016
- Wood, Chad; Sane, Sudhanshu; Ellsworth, Daniel
- 2016 5th Workshop on Extreme-Scale Programming Tools (ESPT)
Performance Visualization for TAU Instrumented Scientific Workflows [Performance Visualization for TAU Instrumented Scientific Workflows]
conference, January 2018
- Xie, Cong; Xu, Wei; Ha, Sungsoo
- International Conference on Information Visualization Theory and Applications, Proceedings of the 13th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications
LOF: identifying density-based local outliers
journal, June 2000
- Breunig, Markus M.; Kriegel, Hans-Peter; Ng, Raymond T.
- ACM SIGMOD Record, Vol. 29, Issue 2
SYNCTRACE: Visual thread-interplay analysis
conference, September 2013
- Karran, Benjamin; Trumper, Jonas; Dollner, Jurgen
- 2013 First IEEE Working Conference on Software Visualization (VISSOFT)
Multi-scale navigation of large trace data: A survey: Multi-scale Navigation of Large Trace Data: A Survey
journal, March 2017
- Ezzati-Jivan, Naser; Dagenais, Michel R.
- Concurrency and Computation: Practice and Experience, Vol. 29, Issue 10
Figures / Tables found in this record: