Online Diagnosis of Performance Variation in HPC Systems Using Machine Learning

Tuncer, Ozan; Ates, Emre; Zhang, Yijia; Turk, Ata; Brandt, Jim M.; Leung, Vitus J.; Egele, Manuel; Coskun, Ayse K.

doi:10.1109/TPDS.2018.2870403

Title: Online Diagnosis of Performance Variation in HPC Systems Using Machine Learning

Full Record
Other Related Research

Abstract

As the size and complexity of HPC systems grow in line with advancements in hardware and software technology, HPC systems increasingly suffer from performance variation due to shared resource contention as well as software- and hardware-related problems. Such performance variations can lead to failures and inefficiencies, and are among the main challenges in system resiliency. To minimize the impact of performance variation, one must quickly and accurately detect and diagnose the anomalies that cause the variation and take mitigating actions. However, it is difficult to identify anomalies based on the voluminous, high-dimensional, and noisy data collected by system monitoring infrastructures. This paper presents a novel machine learning based framework to automatically diagnose performance anomalies at runtime. Our framework leverages historical resource usage data to extract signatures of previously-observed anomalies. We first convert the collected time series data into easy-to-compute statistical features. We then identify the features that are required to detect anomalies, and extract the signatures of these anomalies. At runtime, we use these signatures to diagnose anomalies with negligible overhead. Here, we evaluate our framework using experiments on a real-world HPC supercomputer and demonstrate that our approach successfully identifies 98% of injected anomalies and consistently outperforms existing anomaly diagnosismore » techniques.« less

Authors:

Tuncer, Ozan ^[1]; Ates, Emre ^[1]; Zhang, Yijia ^[1]; Turk, Ata ^[1]; Brandt, Jim M. ^[2]; Leung, Vitus J. ^[2]; Egele, Manuel ^[1]; Coskun, Ayse K. ^[1]

Boston Univ., Boston, MA (United States)
Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)

Publication Date:: Fri Sep 14 00:00:00 EDT 2018

Research Org.:: Sandia National Lab. (SNL-NM), Albuquerque, NM (United States); Sandia National Lab. (SNL-CA), Livermore, CA (United States)

Sponsoring Org.:: USDOE National Nuclear Security Administration (NNSA)

OSTI Identifier:: 1474092

Report Number(s):: SAND-2018-10202J
Journal ID: ISSN 1045-9219; 667958

Grant/Contract Number:: AC04-94AL85000

Resource Type:: Accepted Manuscript

Journal Name:: IEEE Transactions on Parallel and Distributed Systems

Additional Journal Information:: Journal Volume: 30; Journal Issue: 4; Journal ID: ISSN 1045-9219

Publisher:: IEEE

Country of Publication:: United States

Language:: English

Subject:: 97 MATHEMATICS AND COMPUTING; high performance computing; anomaly detection; machine learning; performance variation

Citation Formats


                    Tuncer, Ozan, Ates, Emre, Zhang, Yijia, Turk, Ata, Brandt, Jim M., Leung, Vitus J., Egele, Manuel, and Coskun, Ayse K. Online Diagnosis of Performance Variation in HPC Systems Using Machine Learning.  United States: N. p., 2018. 
Web.  doi:10.1109/TPDS.2018.2870403.

Copy to clipboard


                    Tuncer, Ozan, Ates, Emre, Zhang, Yijia, Turk, Ata, Brandt, Jim M., Leung, Vitus J., Egele, Manuel, & Coskun, Ayse K. Online Diagnosis of Performance Variation in HPC Systems Using Machine Learning.  United States.  https://doi.org/10.1109/TPDS.2018.2870403

Copy to clipboard


                    Tuncer, Ozan, Ates, Emre, Zhang, Yijia, Turk, Ata, Brandt, Jim M., Leung, Vitus J., Egele, Manuel, and Coskun, Ayse K. Fri .  
"Online Diagnosis of Performance Variation in HPC Systems Using Machine Learning".  United States.  https://doi.org/10.1109/TPDS.2018.2870403.  https://www.osti.gov/servlets/purl/1474092.

Copy to clipboard


                    
@article{osti_1474092,

  title        = {Online Diagnosis of Performance Variation in HPC Systems Using Machine Learning},

  author       = {Tuncer, Ozan and Ates, Emre and Zhang, Yijia and Turk, Ata and Brandt, Jim M. and Leung, Vitus J. and Egele, Manuel and Coskun, Ayse K.},

  abstractNote = {As the size and complexity of HPC systems grow in line with advancements in hardware and software technology, HPC systems increasingly suffer from performance variation due to shared resource contention as well as software- and hardware-related problems. Such performance variations can lead to failures and inefficiencies, and are among the main challenges in system resiliency. To minimize the impact of performance variation, one must quickly and accurately detect and diagnose the anomalies that cause the variation and take mitigating actions. However, it is difficult to identify anomalies based on the voluminous, high-dimensional, and noisy data collected by system monitoring infrastructures. This paper presents a novel machine learning based framework to automatically diagnose performance anomalies at runtime. Our framework leverages historical resource usage data to extract signatures of previously-observed anomalies. We first convert the collected time series data into easy-to-compute statistical features. We then identify the features that are required to detect anomalies, and extract the signatures of these anomalies. At runtime, we use these signatures to diagnose anomalies with negligible overhead. Here, we evaluate our framework using experiments on a real-world HPC supercomputer and demonstrate that our approach successfully identifies 98% of injected anomalies and consistently outperforms existing anomaly diagnosis techniques.},

  doi          = {10.1109/TPDS.2018.2870403},

  journal      = {IEEE Transactions on Parallel and Distributed Systems},

  number       = 4,

  volume       = 30,

  place        = {United States},

  year         = {Fri Sep 14 00:00:00 EDT 2018},

  month        = {Fri Sep 14 00:00:00 EDT 2018}

}

Copy to clipboard

Journal Article:

Free Publicly Available Full Text

Accepted Manuscript (DOE)

Publisher's Version of Record

https://doi.org/10.1109/TPDS.2018.2870403

Other availability

Search WorldCat to find libraries that may hold this journal

Citation Metrics:

Cited by: 26 works

Citation information provided by
Web of Science

Save / Share:

Export Metadata

Save to My Library

Similar Records in DOE PAGES and OSTI.GOV collections:

Big Data Meets HPC Log Analytics: Scalable Approach to Understanding Systems at Extreme Scale

Conference Park, Byung ; Hukerikar, Saurabh ; Adamson, Ryan ; ...

Today's high-performance computing (HPC) systems are heavily instrumented, generating logs containing information about abnormal events, such as critical conditions, faults, errors and failures, system resource utilization, and about the resource usage of user applications. These logs, once fully analyzed and correlated, can produce detailed information about the system health, root causes of failures, and analyze an application's interactions with the system, providing valuable insights to domain scientists and system administrators. However, processing HPC logs requires a deep understanding of hardware and software components at multiple layers of the system stack. Moreover, most log data is unstructured and voluminous, making itmore »« less
https://doi.org/10.1109/CLUSTER.2017.113

Full Text Available
Spatio-Temporal Analysis of HPC I/O and Connection Data

Conference Choi, Jinhwan ; Sim, Alex

The HPC system consists of a set of layers of software and hardware for I/O and networking. System logs are helpful resources to understand what is going on in the system. A challenge is that it is non-trivial to analyze the logs maintained in various levels of the stack. Independent analysis might lead to an incomplete conclusion due to the limited coverage of each log. This work takes a comprehensive approach to analysis that incorporates the logs in the multiple layers and components, in order to facilitate the detection of anomalous activities. This research aims to identify and predict potentialmore »« less
https://doi.org/10.1109/ICDCS.2018.00176
MACORD: Online Adaptive Machine Learning Framework for Silent Error Detection

Conference Subasi, Omer ; Di, Sheng ; Balaprakash, Prasanna ; ...

Future HPC systems with ever-increasing resource capacity (such as compute cores, memory and storage) may significantly increase the risks on reliability. Silent data corruptions (SDCs) or silent errors are one of the major sources that corrupt HPC execution results. Unlike fail-stop errors, SDCs are rather harmful and dangerous in that they cannot be detected by hardware. We propose an online machine-learning based silent data corruption detection framework (abbreviated as MACORD) for detecting SDCs in HPC applications. In particular, we comprehensively investigate the prediction ability of a multitude of machine-learning algorithms in our study, and enable the detector to automatically selectmore »« less
https://doi.org/10.1109/CLUSTER.2017.128
HPC-Colony: Services and Interfaces to Aupport Systems With Very Large Numbers of Processors

Technical Report Jones, T ; Kale, L ; Moreira, J ; ...

The HPC-Colony Project, a collaboration with Lawrence Livermore National Laboratory, the University of Illinois at Urbana-Champaign and IBM, is focused on services and interfaces for very large numbers of processors. Advances in parallel systems in the last decade have delivered phenomenal progress in the overall capability available to a single parallel application. Several systems with peak capability of over 100TF are already available and systems are expected to exceed 1PF within a few years. Despite these impressive advances in peak performance capability, the sustained performance of these systems continues to fall as a percentage of the peak capability. Initial analysismore »« less
https://doi.org/10.2172/902273

Full Text Available
Automatic Identification of Application I/O Signatures from Noisy Server-Side Traces

Conference Liu, Yang ; Gunasekaran, Raghul ; Ma, Xiaosong ; ...

Competing workloads on a shared storage system cause I/O resource contention and application performance vagaries. This problem is already evident in today s HPC storage systems and is likely to become acute at exascale. We need more interaction between application I/O requirements and system software tools to help alleviate the I/O bottleneck, moving towards I/O-aware job scheduling. However, this requires rich techniques to capture application I/O characteristics, which remain evasive in production systems. Traditionally, I/O characteristics have been obtained using client-side tracing tools, with drawbacks such as non-trivial instrumentation/development costs, large trace traffic, and inconsistent adoption. We present a novelmore »« less

Similar Records