Machine Learning Models for GPU Error Prediction in a Large Scale HPC System
- College of William and Mary, Williamsburg, VA
- Intel Corporation
- Northeastern University, Boston
- ORNL
GPUs are widely deployed on large-scale HPC systems to provide powerful computational capability for scientific applications from various domains. As those applications are normally long-running, investigating the characteristics of GPU errors becomes imperative for reliability. In this paper, we first study the system conditions that trigger GPU errors using six-month trace data collected from a large-scale, operational HPC system. Then, we use machine learning to predict the occurrence of GPU errors, by taking advantage of temporal and spatial dependencies of the trace data. The resulting machine learning prediction framework is robust and accurate under different workloads.
- Research Organization:
- Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
- Sponsoring Organization:
- USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)
- DOE Contract Number:
- AC05-00OR22725
- OSTI ID:
- 1462859
- Resource Relation:
- Conference: 48th IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) 2018 - Luxembourg City, , Luxembourg - 6/25/2018 4:00:00 AM-6/28/2018 4:00:00 AM
- Country of Publication:
- United States
- Language:
- English
Similar Records
Machine Learning Assisted HPC Workload Trace Generation for Leadership Scale Storage Systems
Characterizing Machine Learning I/O Workloads on Leadership Scale HPC Systems
Efficient Machine Learning Approach for Optimizing Scientific Computing Applications on Emerging HPC Architectures
Conference
·
Wed Jun 01 00:00:00 EDT 2022
·
OSTI ID:1462859
+1 more
Characterizing Machine Learning I/O Workloads on Leadership Scale HPC Systems
Conference
·
Mon Nov 01 00:00:00 EDT 2021
·
OSTI ID:1462859
Efficient Machine Learning Approach for Optimizing Scientific Computing Applications on Emerging HPC Architectures
Thesis/Dissertation
·
Mon May 01 00:00:00 EDT 2017
·
OSTI ID:1462859