skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Reliability lessons learned from GPU experience with the Titan supercomputer at Oak Ridge leadership computing facility, In: SC '15 Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Abstract

The high computational capability of graphics processing units (GPUs) is enabling and driving the scientific discovery process at large-scale. The world's second fastest supercomputer for open science, Titan, has more than 18,000 GPUs that computational scientists use to perform scientific simulations and data analysis. Understanding of GPU reliability characteristics, however, is still in its nascent stage since GPUs have only recently been deployed at large-scale. This paper presents a detailed study of GPU errors and their impact on system operations and applications, describing experiences with the 18,688 GPUs on the Titan supercomputer as well as lessons learned in the process of efficient operation of GPUs at scale. These experiences are helpful to HPC sites which already have large-scale GPU clusters or plan to deploy GPUs in the future.

Authors:
 [1];  [1];  [2];  [1];  [1]
  1. Oak Ridge National Laboratory
  2. Christian Brothers University
Publication Date:
Research Org.:
Oak Ridge National Laboratory, Oak Ridge Leadership Computing Facility (OLCF)
Sponsoring Org.:
USDOE Office of Science (SC)
OSTI Identifier:
1567401
DOE Contract Number:  
AC05-00OR22725
Resource Type:
Conference
Journal Name:
PROCEEDINGS OF SC15: THE INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS
Additional Journal Information:
Conference: International Conference for High Performance Computing, Networking, Storage and Analysis, Austin, Texas, November 15-20, 2015
Country of Publication:
United States
Language:
English
Subject:
Computer Science; Engineering

Citation Formats

Tiwari, Devesh, Gupta, Saurabh, Gallarno, George, Rogers, Jim, and Maxwell, Don. Reliability lessons learned from GPU experience with the Titan supercomputer at Oak Ridge leadership computing facility, In: SC '15 Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. United States: N. p., 2015. Web. doi:10.1145/2807591.2807666.
Tiwari, Devesh, Gupta, Saurabh, Gallarno, George, Rogers, Jim, & Maxwell, Don. Reliability lessons learned from GPU experience with the Titan supercomputer at Oak Ridge leadership computing facility, In: SC '15 Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. United States. doi:10.1145/2807591.2807666.
Tiwari, Devesh, Gupta, Saurabh, Gallarno, George, Rogers, Jim, and Maxwell, Don. Thu . "Reliability lessons learned from GPU experience with the Titan supercomputer at Oak Ridge leadership computing facility, In: SC '15 Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis". United States. doi:10.1145/2807591.2807666.
@article{osti_1567401,
title = {Reliability lessons learned from GPU experience with the Titan supercomputer at Oak Ridge leadership computing facility, In: SC '15 Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis},
author = {Tiwari, Devesh and Gupta, Saurabh and Gallarno, George and Rogers, Jim and Maxwell, Don},
abstractNote = {The high computational capability of graphics processing units (GPUs) is enabling and driving the scientific discovery process at large-scale. The world's second fastest supercomputer for open science, Titan, has more than 18,000 GPUs that computational scientists use to perform scientific simulations and data analysis. Understanding of GPU reliability characteristics, however, is still in its nascent stage since GPUs have only recently been deployed at large-scale. This paper presents a detailed study of GPU errors and their impact on system operations and applications, describing experiences with the 18,688 GPUs on the Titan supercomputer as well as lessons learned in the process of efficient operation of GPUs at scale. These experiences are helpful to HPC sites which already have large-scale GPU clusters or plan to deploy GPUs in the future.},
doi = {10.1145/2807591.2807666},
journal = {PROCEEDINGS OF SC15: THE INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS},
number = ,
volume = ,
place = {United States},
year = {2015},
month = {1}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share:

Works referenced in this record:

Using FPGA Devices to Accelerate Biomolecular Simulations
journal, March 2007

  • Alam, Sadaf R.; Agarwal, Pratul K.; Smith, Melissa C.
  • Computer, Vol. 40, Issue 3
  • DOI: 10.1109/MC.2007.108

Cosmic rays don't strike twice: understanding the nature of DRAM errors and the implications for system design
journal, March 2012

  • Hwang, Andy A.; Stefanovici, Ioan A.; Schroeder, Bianca
  • ACM SIGPLAN Notices, Vol. 47, Issue 4
  • DOI: 10.1145/2248487.2150989

A Survey of General-Purpose Computation on Graphics Hardware
journal, March 2007


A Large-Scale Study of Failures in High-Performance Computing Systems
journal, October 2010

  • Schroeder, Bianca; Gibson, Garth A.
  • IEEE Transactions on Dependable and Secure Computing, Vol. 7, Issue 4
  • DOI: 10.1109/TDSC.2009.4