skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Reliability Lessons Learned From GPU Experience With The Titan Supercomputer at Oak Ridge Leadership Computing Facility

Conference ·
OSTI ID:1248768

The high computational capability of graphics processing units (GPUs) is enabling and driving the scientific discovery process at large-scale. The world s second fastest supercomputer for open science, Titan, has more than 18,000 GPUs that computational scientists use to perform scientific simu- lations and data analysis. Understanding of GPU reliability characteristics, however, is still in its nascent stage since GPUs have only recently been deployed at large-scale. This paper presents a detailed study of GPU errors and their impact on system operations and applications, describing experiences with the 18,688 GPUs on the Titan supercom- puter as well as lessons learned in the process of efficient operation of GPUs at scale. These experiences are helpful to HPC sites which already have large-scale GPU clusters or plan to deploy GPUs in the future.

Research Organization:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF)
Sponsoring Organization:
USDOE
DOE Contract Number:
AC05-00OR22725
OSTI ID:
1248768
Resource Relation:
Conference: Supercomputing (SC), Austin, TX, USA, 20151115, 20151115
Country of Publication:
United States
Language:
English

Similar Records

Reliability lessons learned from GPU experience with the Titan supercomputer at Oak Ridge leadership computing facility, In: SC '15 Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
Conference · Thu Jan 01 00:00:00 EST 2015 · PROCEEDINGS OF SC15: THE INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS · OSTI ID:1248768

GPU Lifetimes on Titan Supercomputer: Survival Analysis and Reliability
Conference · Sun Nov 01 00:00:00 EDT 2020 · OSTI ID:1248768

Understanding GPU errors on large-scale HPC systems and the implications for system design and operation
Conference · Sun Feb 01 00:00:00 EST 2015 · 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA); 7-11 Feb. 2015; Burlingame, CA, USA · OSTI ID:1248768

Related Subjects