Online Diagnosis of Performance Variation in HPC Systems Using Machine Learning

Tuncer, Ozan; Ates, Emre; Zhang, Yijia; Turk, Ata; Brandt, Jim M.; Leung, Vitus J.; Egele, Manuel; Coskun, Ayse K.

doi:10.1109/TPDS.2018.2870403

Online Diagnosis of Performance Variation in HPC Systems Using Machine Learning

Journal Article · Fri Sep 14 00:00:00 EDT 2018 · IEEE Transactions on Parallel and Distributed Systems

DOI:https://doi.org/10.1109/TPDS.2018.2870403· OSTI ID:1474092

Tuncer, Ozan ^[1]; Ates, Emre ^[1]; Zhang, Yijia ^[1]; Turk, Ata ^[1]; Brandt, Jim M. ^[2]; Leung, Vitus J. ^[2]; Egele, Manuel ^[1]; Coskun, Ayse K. ^[1]

Boston Univ., Boston, MA (United States)
Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)

As the size and complexity of HPC systems grow in line with advancements in hardware and software technology, HPC systems increasingly suffer from performance variation due to shared resource contention as well as software- and hardware-related problems. Such performance variations can lead to failures and inefficiencies, and are among the main challenges in system resiliency. To minimize the impact of performance variation, one must quickly and accurately detect and diagnose the anomalies that cause the variation and take mitigating actions. However, it is difficult to identify anomalies based on the voluminous, high-dimensional, and noisy data collected by system monitoring infrastructures. This paper presents a novel machine learning based framework to automatically diagnose performance anomalies at runtime. Our framework leverages historical resource usage data to extract signatures of previously-observed anomalies. We first convert the collected time series data into easy-to-compute statistical features. We then identify the features that are required to detect anomalies, and extract the signatures of these anomalies. At runtime, we use these signatures to diagnose anomalies with negligible overhead. Here, we evaluate our framework using experiments on a real-world HPC supercomputer and demonstrate that our approach successfully identifies 98% of injected anomalies and consistently outperforms existing anomaly diagnosis techniques.

Research Organization:: Sandia National Laboratories (SNL-NM), Albuquerque, NM (United States); Sandia National Laboratories, Livermore, CA (United States)

Sponsoring Organization:: USDOE National Nuclear Security Administration (NNSA)

Grant/Contract Number:: AC04-94AL85000

OSTI ID:: 1474092

Report Number(s):: SAND--2018-10202J; 667958

Journal Information:: IEEE Transactions on Parallel and Distributed Systems, Journal Name: IEEE Transactions on Parallel and Distributed Systems Journal Issue: 4 Vol. 30; ISSN 1045-9219

Publisher:: IEEECopyright Statement

Country of Publication:: United States

Language:: English

Similar Records

Bridging paradigms: Designing for HPC-Quantum convergence

Journal Article · Fri Jun 27 20:00:00 EDT 2025 · Future Generations Computer Systems · OSTI ID:2573328

Spatio-Temporal Analysis of HPC I/O and Connection Data

Conference · Sun Jul 01 00:00:00 EDT 2018 · OSTI ID:1544245

Big Data Meets HPC Log Analytics: Scalable Approach to Understanding Systems at Extreme Scale

Conference · Fri Sep 01 00:00:00 EDT 2017 · OSTI ID:1460236

Related Subjects

97 MATHEMATICS AND COMPUTING
anomaly detection
high performance computing
machine learning
performance variation

Online Diagnosis of Performance Variation in HPC Systems Using Machine Learning

Citation Formats

Similar Records

Related Subjects