skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Machine Learning Models for GPU Error Prediction in a Large Scale HPC System

Abstract

GPUs are widely deployed on large-scale HPC systems to provide powerful computational capability for scientific applications from various domains. As those applications are normally long-running, investigating the characteristics of GPU errors becomes imperative for reliability. In this paper, we first study the system conditions that trigger GPU errors using six-month trace data collected from a large-scale, operational HPC system. Then, we use machine learning to predict the occurrence of GPU errors, by taking advantage of temporal and spatial dependencies of the trace data. The resulting machine learning prediction framework is robust and accurate under different workloads.

Authors:
 [1];  [1];  [2];  [3]; ORCiD logo [4];  [1];  [3]
  1. College of William and Mary, Williamsburg, VA
  2. Intel Corporation
  3. Northeastern University, Boston
  4. ORNL
Publication Date:
Research Org.:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF)
Sponsoring Org.:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)
OSTI Identifier:
1462859
DOE Contract Number:  
AC05-00OR22725
Resource Type:
Conference
Resource Relation:
Conference: 48th IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) 2018 - Luxembourg City, , Luxembourg - 6/25/2018 4:00:00 AM-6/28/2018 4:00:00 AM
Country of Publication:
United States
Language:
English

Citation Formats

Nie, Bin, Xue, Ji, Gupta, Saurabh, Patel, Tirthak, Engelmann, Christian, Smirni, Evgenia, and Tiwari, Devesh. Machine Learning Models for GPU Error Prediction in a Large Scale HPC System. United States: N. p., 2018. Web. doi:10.1109/DSN.2018.00022.
Nie, Bin, Xue, Ji, Gupta, Saurabh, Patel, Tirthak, Engelmann, Christian, Smirni, Evgenia, & Tiwari, Devesh. Machine Learning Models for GPU Error Prediction in a Large Scale HPC System. United States. https://doi.org/10.1109/DSN.2018.00022
Nie, Bin, Xue, Ji, Gupta, Saurabh, Patel, Tirthak, Engelmann, Christian, Smirni, Evgenia, and Tiwari, Devesh. Fri . "Machine Learning Models for GPU Error Prediction in a Large Scale HPC System". United States. https://doi.org/10.1109/DSN.2018.00022. https://www.osti.gov/servlets/purl/1462859.
@article{osti_1462859,
title = {Machine Learning Models for GPU Error Prediction in a Large Scale HPC System},
author = {Nie, Bin and Xue, Ji and Gupta, Saurabh and Patel, Tirthak and Engelmann, Christian and Smirni, Evgenia and Tiwari, Devesh},
abstractNote = {GPUs are widely deployed on large-scale HPC systems to provide powerful computational capability for scientific applications from various domains. As those applications are normally long-running, investigating the characteristics of GPU errors becomes imperative for reliability. In this paper, we first study the system conditions that trigger GPU errors using six-month trace data collected from a large-scale, operational HPC system. Then, we use machine learning to predict the occurrence of GPU errors, by taking advantage of temporal and spatial dependencies of the trace data. The resulting machine learning prediction framework is robust and accurate under different workloads.},
doi = {10.1109/DSN.2018.00022},
url = {https://www.osti.gov/biblio/1462859}, journal = {},
number = ,
volume = ,
place = {United States},
year = {2018},
month = {6}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share: