Signal Processing Based Method for Real-Time Anomaly Detection in High-Performance Computing
Performance anomalies can manifest as irregular execution times or abnormal execution events for many reasons, including network congestion and resource contention. Detecting such anomalies in real-time by analyzing the details of performance traces at scale is impractical due to the sheer volume of data High-Performance Computing (HPC) applications produce. In this paper, we propose formulating HPC performance anomaly detection as a signal-processing problem where anomalies can be treated as noise. We evaluate our proposed method in comparison with two other commonly used anomaly detection techniques of varying complexity based on their detection accuracy and scalability. Since real-time in-situ anomaly detection at a large scale requires lightweight methods that can handle a large volume of streaming data, we find that our proposed method provides the best trade-off. We then implement the proposed method in Chimbuko, the first online, distributed, and scalable workflow-level performance trace analysis framework. We compare our proposed signal-based anomaly detection algorithm with two other methods using a function of their accuracy, F1 score, and detection overhead. Our experiments demonstrate that our proposed approach achieves a 99% improvement for the benchmark datasets and a 93% improvement with Chimbuko traces.
- Research Organization:
- Brookhaven National Laboratory (BNL), Upton, NY (United States)
- Sponsoring Organization:
- USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)
- DOE Contract Number:
- SC0012704
- OSTI ID:
- 2326918
- Report Number(s):
- BNL-225067-2023-COPA
- Resource Relation:
- Conference: 2023 IEEE 47th Annual Computers, Software, and Applications Conference (COMPSAC), Torino, Italy, 6/26/2023 - 6/30/2023
- Country of Publication:
- United States
- Language:
- English
Similar Records
Chimbuko: A Workflow-Level Scalable Performance Trace Analysis Tool
Durango: Scalable Synthetic Workload Generation for Extreme-Scale Application Performance Modeling and Simulation