DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Online Diagnosis of Performance Variation in HPC Systems Using Machine Learning

Abstract

As the size and complexity of HPC systems grow in line with advancements in hardware and software technology, HPC systems increasingly suffer from performance variation due to shared resource contention as well as software- and hardware-related problems. Such performance variations can lead to failures and inefficiencies, and are among the main challenges in system resiliency. To minimize the impact of performance variation, one must quickly and accurately detect and diagnose the anomalies that cause the variation and take mitigating actions. However, it is difficult to identify anomalies based on the voluminous, high-dimensional, and noisy data collected by system monitoring infrastructures. This paper presents a novel machine learning based framework to automatically diagnose performance anomalies at runtime. Our framework leverages historical resource usage data to extract signatures of previously-observed anomalies. We first convert the collected time series data into easy-to-compute statistical features. We then identify the features that are required to detect anomalies, and extract the signatures of these anomalies. At runtime, we use these signatures to diagnose anomalies with negligible overhead. Here, we evaluate our framework using experiments on a real-world HPC supercomputer and demonstrate that our approach successfully identifies 98% of injected anomalies and consistently outperforms existing anomaly diagnosismore » techniques.« less

Authors:
 [1];  [1];  [1];  [1];  [2];  [2];  [1];  [1]
  1. Boston Univ., Boston, MA (United States)
  2. Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
Publication Date:
Research Org.:
Sandia National Lab. (SNL-NM), Albuquerque, NM (United States); Sandia National Lab. (SNL-CA), Livermore, CA (United States)
Sponsoring Org.:
USDOE National Nuclear Security Administration (NNSA)
OSTI Identifier:
1474092
Report Number(s):
SAND-2018-10202J
Journal ID: ISSN 1045-9219; 667958
Grant/Contract Number:  
AC04-94AL85000
Resource Type:
Accepted Manuscript
Journal Name:
IEEE Transactions on Parallel and Distributed Systems
Additional Journal Information:
Journal Volume: 30; Journal Issue: 4; Journal ID: ISSN 1045-9219
Publisher:
IEEE
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING; high performance computing; anomaly detection; machine learning; performance variation

Citation Formats

Tuncer, Ozan, Ates, Emre, Zhang, Yijia, Turk, Ata, Brandt, Jim M., Leung, Vitus J., Egele, Manuel, and Coskun, Ayse K. Online Diagnosis of Performance Variation in HPC Systems Using Machine Learning. United States: N. p., 2018. Web. doi:10.1109/TPDS.2018.2870403.
Tuncer, Ozan, Ates, Emre, Zhang, Yijia, Turk, Ata, Brandt, Jim M., Leung, Vitus J., Egele, Manuel, & Coskun, Ayse K. Online Diagnosis of Performance Variation in HPC Systems Using Machine Learning. United States. https://doi.org/10.1109/TPDS.2018.2870403
Tuncer, Ozan, Ates, Emre, Zhang, Yijia, Turk, Ata, Brandt, Jim M., Leung, Vitus J., Egele, Manuel, and Coskun, Ayse K. Fri . "Online Diagnosis of Performance Variation in HPC Systems Using Machine Learning". United States. https://doi.org/10.1109/TPDS.2018.2870403. https://www.osti.gov/servlets/purl/1474092.
@article{osti_1474092,
title = {Online Diagnosis of Performance Variation in HPC Systems Using Machine Learning},
author = {Tuncer, Ozan and Ates, Emre and Zhang, Yijia and Turk, Ata and Brandt, Jim M. and Leung, Vitus J. and Egele, Manuel and Coskun, Ayse K.},
abstractNote = {As the size and complexity of HPC systems grow in line with advancements in hardware and software technology, HPC systems increasingly suffer from performance variation due to shared resource contention as well as software- and hardware-related problems. Such performance variations can lead to failures and inefficiencies, and are among the main challenges in system resiliency. To minimize the impact of performance variation, one must quickly and accurately detect and diagnose the anomalies that cause the variation and take mitigating actions. However, it is difficult to identify anomalies based on the voluminous, high-dimensional, and noisy data collected by system monitoring infrastructures. This paper presents a novel machine learning based framework to automatically diagnose performance anomalies at runtime. Our framework leverages historical resource usage data to extract signatures of previously-observed anomalies. We first convert the collected time series data into easy-to-compute statistical features. We then identify the features that are required to detect anomalies, and extract the signatures of these anomalies. At runtime, we use these signatures to diagnose anomalies with negligible overhead. Here, we evaluate our framework using experiments on a real-world HPC supercomputer and demonstrate that our approach successfully identifies 98% of injected anomalies and consistently outperforms existing anomaly diagnosis techniques.},
doi = {10.1109/TPDS.2018.2870403},
journal = {IEEE Transactions on Parallel and Distributed Systems},
number = 4,
volume = 30,
place = {United States},
year = {Fri Sep 14 00:00:00 EDT 2018},
month = {Fri Sep 14 00:00:00 EDT 2018}
}

Journal Article:
Free Publicly Available Full Text
Publisher's Version of Record

Citation Metrics:
Cited by: 26 works
Citation information provided by
Web of Science

Save / Share: