skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Reliability Lessons Learned From GPU Experience With The Titan Supercomputer at Oak Ridge Leadership Computing Facility

Abstract

The high computational capability of graphics processing units (GPUs) is enabling and driving the scientific discovery process at large-scale. The world s second fastest supercomputer for open science, Titan, has more than 18,000 GPUs that computational scientists use to perform scientific simu- lations and data analysis. Understanding of GPU reliability characteristics, however, is still in its nascent stage since GPUs have only recently been deployed at large-scale. This paper presents a detailed study of GPU errors and their impact on system operations and applications, describing experiences with the 18,688 GPUs on the Titan supercom- puter as well as lessons learned in the process of efficient operation of GPUs at scale. These experiences are helpful to HPC sites which already have large-scale GPU clusters or plan to deploy GPUs in the future.

Authors:
 [1];  [2];  [2]
  1. Christian Brothers University
  2. ORNL
Publication Date:
Research Org.:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF)
Sponsoring Org.:
USDOE
OSTI Identifier:
1248768
DOE Contract Number:  
AC05-00OR22725
Resource Type:
Conference
Resource Relation:
Conference: Supercomputing (SC), Austin, TX, USA, 20151115, 20151115
Country of Publication:
United States
Language:
English

Citation Formats

Gallarno, George, Rogers, James H, and Maxwell, Don E. Reliability Lessons Learned From GPU Experience With The Titan Supercomputer at Oak Ridge Leadership Computing Facility. United States: N. p., 2015. Web.
Gallarno, George, Rogers, James H, & Maxwell, Don E. Reliability Lessons Learned From GPU Experience With The Titan Supercomputer at Oak Ridge Leadership Computing Facility. United States.
Gallarno, George, Rogers, James H, and Maxwell, Don E. 2015. "Reliability Lessons Learned From GPU Experience With The Titan Supercomputer at Oak Ridge Leadership Computing Facility". United States.
@article{osti_1248768,
title = {Reliability Lessons Learned From GPU Experience With The Titan Supercomputer at Oak Ridge Leadership Computing Facility},
author = {Gallarno, George and Rogers, James H and Maxwell, Don E},
abstractNote = {The high computational capability of graphics processing units (GPUs) is enabling and driving the scientific discovery process at large-scale. The world s second fastest supercomputer for open science, Titan, has more than 18,000 GPUs that computational scientists use to perform scientific simu- lations and data analysis. Understanding of GPU reliability characteristics, however, is still in its nascent stage since GPUs have only recently been deployed at large-scale. This paper presents a detailed study of GPU errors and their impact on system operations and applications, describing experiences with the 18,688 GPUs on the Titan supercom- puter as well as lessons learned in the process of efficient operation of GPUs at scale. These experiences are helpful to HPC sites which already have large-scale GPU clusters or plan to deploy GPUs in the future.},
doi = {},
url = {https://www.osti.gov/biblio/1248768}, journal = {},
number = ,
volume = ,
place = {United States},
year = {Thu Jan 01 00:00:00 EST 2015},
month = {Thu Jan 01 00:00:00 EST 2015}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share: