Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Graph neural networks for detecting anomalies in scientific workflows

Journal Article · · International Journal of High Performance Computing Applications
 [1];  [1];  [2];  [3];  [3];  [4];  [2];  [5]
  1. Argonne National Laboratory, Lemont, IL, USA
  2. University of Southern California, Los Angeles, CA, USA
  3. Renaissance Computing Institute (RENCI), Chapel Hill, NC, USA
  4. Energy Sciences Network (ESnet), Berkeley, CA, USA
  5. Oak Ridge National Laboratory, Oak Ridge, TN, USA

Identifying and addressing anomalies in complex, distributed systems can be challenging for reliable execution of scientific workflows. We model these workflows as directed acyclic graphs (DAGs), where the nodes and edges of the DAGs represent jobs and their dependencies, respectively. We develop graph neural networks (GNNs) to learn patterns in the DAGs and to detect anomalies at the node (job) and graph (workflow) levels. We investigate workflow-specific GNN models that are trained on a particular workflow and workflow-agnostic GNN models that are trained across the workflows. Our GNN models, which incorporate both individual job features and topological information from the workflow, show improved accuracy and efficiency compared to conventional learning methods for detecting anomalies. While joint trained with multiple scientific workflows, our GNN models reached an accuracy more than 80% for workflow level and 75% for job level anomalies. In addition, we illustrate the importance of hyperparameter tuning method in our study that can significantly improve the metric(s) measure of evaluating the GNN models. Finally, we integrate explainable GNN methods to provide insights on job features in the workflow that cause an anomaly.

Sponsoring Organization:
USDOE
Grant/Contract Number:
SC0022328
OSTI ID:
1975863
Alternate ID(s):
OSTI ID: 2404548
Journal Information:
International Journal of High Performance Computing Applications, Journal Name: International Journal of High Performance Computing Applications Journal Issue: 3-4 Vol. 37; ISSN 1094-3420
Publisher:
SAGE PublicationsCopyright Statement
Country of Publication:
United States
Language:
English

References (22)

Building a scientific workflow framework to enable real‐time machine learning and visualization journal June 2018
Deep Learning for Enhancing Fault Tolerant Capabilities of Scientific Workflows conference December 2018
Weather Radar Data Interpolation Using a Kernel-Based Lagrangian Nowcasting Technique journal June 2015
Identifying Execution Anomalies for Data Intensive Workflows Using Lightweight ML Techniques conference September 2020
Measuring and Relieving the Over-Smoothing Problem for Graph Neural Networks from the Topological View journal April 2020
Anomaly Detection in Scientific Workflows using End-to-End Execution Gantt Charts and Convolutional Neural Networks conference July 2021
Detecting performance anomalies in scientific workflows using hierarchical temporal memory journal November 2018
The open science grid journal July 2007
XSEDE: Accelerating Scientific Discovery journal September 2014
AI4IO: A Suite of Ai-Based Tools for IO-Aware HPC Resource Management conference December 2021
ExoGENI: A Multi-Domain Infrastructure-as-a-Service Testbed book January 2016
End-to-end online performance data capture and analysis for scientific workflows journal April 2021
Pegasus, a workflow management system for science automation journal May 2015
PANORAMA: An approach to performance modeling and diagnosis of extreme-scale workflows journal July 2016
Anomaly detection for scientific workflow applications on networked clouds conference July 2016
A global reference for human genetic variation journal January 2015
The role of machine learning in scientific workflows journal May 2019
Using simple PID-inspired controllers for online resilient resource management of distributed scientific workflows journal June 2019
DeepHyper: Asynchronous Hyperparameter Search for Deep Neural Networks conference December 2018
Training Classifiers to Identify TCP Signatures in Scientific Workflows conference November 2019
Toward a Dynamic Network-Centric Distributed Cloud Platform for Scientific Workflows: A Case Study for Adaptive Weather Sensing conference September 2019
Workflow Anomaly Detection with Graph Neural Networks conference November 2022

Similar Records

Related Subjects