Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Reliability Lessons Learned From GPU Experience With The Titan Supercomputer at Oak Ridge Leadership Computing Facility

Conference ·

The high computational capability of graphics processing units (GPUs) is enabling and driving the scientific discovery process at large-scale. The world’s second fastest supercomputer for open science, Titan, has more than 18,000 GPUs that computational scientists use to perform scientific simu- lations and data analysis. Understanding of GPU reliability characteristics, however, is still in its nascent stage since GPUs have only recently been deployed at large-scale. This paper presents a detailed study of GPU errors and their impact on system operations and applications, describing experiences with the 18,688 GPUs on the Titan supercom- puter as well as lessons learned in the process of efficient operation of GPUs at scale. These experiences are helpful to HPC sites which already have large-scale GPU clusters or plan to deploy GPUs in the future.

Research Organization:
Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
Sponsoring Organization:
USDOE
DOE Contract Number:
AC05-00OR22725
OSTI ID:
1248768
Country of Publication:
United States
Language:
English

Similar Records

Reliability lessons learned from GPU experience with the Titan supercomputer at Oak Ridge leadership computing facility, In: SC '15 Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
Conference · Wed Dec 31 23:00:00 EST 2014 · PROCEEDINGS OF SC15: THE INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS · OSTI ID:1567401

GPU Lifetimes on Titan Supercomputer: Survival Analysis and Reliability
Conference · Sun Nov 01 00:00:00 EDT 2020 · OSTI ID:1771896

Understanding GPU errors on large-scale HPC systems and the implications for system design and operation
Conference · Sat Jan 31 23:00:00 EST 2015 · 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA); 7-11 Feb. 2015; Burlingame, CA, USA · OSTI ID:1567575

Related Subjects