Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Signal Processing Based Method for Real-Time Anomaly Detection in High-Performance Computing

Conference ·

Performance anomalies can manifest as irregular execution times or abnormal execution events for many reasons, including network congestion and resource contention. Detecting such anomalies in real-time by analyzing the details of performance traces at scale is impractical due to the sheer volume of data High-Performance Computing (HPC) applications produce. In this paper, we propose formulating HPC performance anomaly detection as a signal-processing problem where anomalies can be treated as noise. We evaluate our proposed method in comparison with two other commonly used anomaly detection techniques of varying complexity based on their detection accuracy and scalability. Since real-time in-situ anomaly detection at a large scale requires lightweight methods that can handle a large volume of streaming data, we find that our proposed method provides the best trade-off. We then implement the proposed method in Chimbuko, the first online, distributed, and scalable workflow-level performance trace analysis framework. We compare our proposed signal-based anomaly detection algorithm with two other methods using a function of their accuracy, F1 score, and detection overhead. Our experiments demonstrate that our proposed approach achieves a 99% improvement for the benchmark datasets and a 93% improvement with Chimbuko traces.

Research Organization:
Brookhaven National Laboratory (BNL), Upton, NY (United States)
Sponsoring Organization:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)
DOE Contract Number:
SC0012704
OSTI ID:
2326918
Report Number(s):
BNL-225067-2023-COPA
Resource Relation:
Conference: 2023 IEEE 47th Annual Computers, Software, and Applications Conference (COMPSAC), Torino, Italy, 6/26/2023 - 6/30/2023
Country of Publication:
United States
Language:
English

References (17)

Evaluating Real-Time Anomaly Detection Algorithms -- The Numenta Anomaly Benchmark December 2015
Chimbuko: A Workflow-Level Scalable Performance Trace Analysis Tool November 2020
Saliency Detection: A Spectral Residual Approach June 2007
Anomaly Detection Using Forecasting Methods ARIMA and HWDS November 2013
Time-Series Anomaly Detection Service at Microsoft July 2019
Fourier Transform Based Spatial Outlier Mining January 2009
Anomaly Detection for Time Series Data Stream March 2021
Arima model for network traffic prediction and anomaly detection January 2008
ARIMA Based Network Anomaly Detection January 2010
Percentage Points for a Generalized ESD Many-Outlier Procedure May 1983
Time Series Analysis: Forecasting and Control, 5th Edition, by George E. P. Box, Gwilym M. Jenkins, Gregory C. Reinsel and Greta M. Ljung, 2015. Published by John Wiley and Sons Inc., Hoboken, New Jersey, pp. 712. ISBN: 978‐1‐118‐67502‐1 March 2016
Network anomography January 2005
Fractional Fourier Transform-Based Tensor RX for Hyperspectral Anomaly Detection February 2022
Outlier Detection in Regression Models with ARIMA Errors using Robust Estimates December 2001
Unsupervised real-time anomaly detection for streaming data November 2017
The future of scientific workflows April 2017
Box-Jenkins Seasonal Forecasting: Problems in a Case-Study January 1973