skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Combating the Reliability Challenge of GPU Register File at Low Supply Voltage

Abstract

Supply voltage reduction is an effective approach to significantly reduce GPU energy consumption. As the largest on-chip storage structure, the GPU register file becomes the reliability hotspot that prevents further supply voltage reduction below the safe limit (Vmin) due to process variation effects. This work addresses the reliability challenge of the GPU register file at low supply voltages, which is an essential first step for aggressive supply voltage reduction of the entire GPU chip. We propose GR-Guard, an architectural solution that leverages long register dead time to enable reliable operations from unreliable register file at low voltages.

Authors:
; ; ; ; ;
Publication Date:
Research Org.:
Pacific Northwest National Lab. (PNNL), Richland, WA (United States)
Sponsoring Org.:
USDOE
OSTI Identifier:
1339050
Report Number(s):
PNNL-SA-119484
KJ0402000
DOE Contract Number:
AC05-76RL01830
Resource Type:
Conference
Resource Relation:
Conference: Proceedings of the 25th International Conference on Parallel Architectures and Compilation (PACT '16), September 11-15, 2016, Haifa, Israel, 3-15
Country of Publication:
United States
Language:
English
Subject:
Energy-efficient design; Architecture-Compiler co-design; low voltage GPU

Citation Formats

Tan, Jingweijia, Song, Shuaiwen, Yan, Kaige, Fu, Xin, Marquez, Andres, and Kerbyson, Darren J. Combating the Reliability Challenge of GPU Register File at Low Supply Voltage. United States: N. p., 2016. Web. doi:10.1145/2967938.2967951.
Tan, Jingweijia, Song, Shuaiwen, Yan, Kaige, Fu, Xin, Marquez, Andres, & Kerbyson, Darren J. Combating the Reliability Challenge of GPU Register File at Low Supply Voltage. United States. doi:10.1145/2967938.2967951.
Tan, Jingweijia, Song, Shuaiwen, Yan, Kaige, Fu, Xin, Marquez, Andres, and Kerbyson, Darren J. Sun . "Combating the Reliability Challenge of GPU Register File at Low Supply Voltage". United States. doi:10.1145/2967938.2967951.
@article{osti_1339050,
title = {Combating the Reliability Challenge of GPU Register File at Low Supply Voltage},
author = {Tan, Jingweijia and Song, Shuaiwen and Yan, Kaige and Fu, Xin and Marquez, Andres and Kerbyson, Darren J.},
abstractNote = {Supply voltage reduction is an effective approach to significantly reduce GPU energy consumption. As the largest on-chip storage structure, the GPU register file becomes the reliability hotspot that prevents further supply voltage reduction below the safe limit (Vmin) due to process variation effects. This work addresses the reliability challenge of the GPU register file at low supply voltages, which is an essential first step for aggressive supply voltage reduction of the entire GPU chip. We propose GR-Guard, an architectural solution that leverages long register dead time to enable reliable operations from unreliable register file at low voltages.},
doi = {10.1145/2967938.2967951},
journal = {},
number = ,
volume = ,
place = {United States},
year = {Sun Sep 11 00:00:00 EDT 2016},
month = {Sun Sep 11 00:00:00 EDT 2016}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share:
  • To support their massively-multithreaded architecture, GPUs use very large register file (RF) which has a capacity higher than even L1 and L2 caches. In total contrast, traditional CPUs use tiny RF and much larger caches to optimize latency. Due to these differences, along with the crucial impact of RF in determining GPU performance, novel and intelligent techniques are required for managing GPU RF. In this paper, we survey the techniques for designing and managing GPU RF. We discuss techniques related to performance, energy and reliability aspects of RF. To emphasize the similarities and differences between the techniques, we classify themmore » along several parameters. Lastly, the aim of this paper is to synthesize the state-of-art developments in RF management and also stimulate further research in this area.« less
  • The 1700 MeV, 100 mA Accelerator Production of Tritium (APT) Proton Linac will require 244 1 MW, continuous wave RF systems. 1 MW continuous wave klystrons are used as the RF source and each klystron requires 95 kV, 17 A of beam voltage and current. The cost of the DC power supplies is the single largest percentage of the total RF system cost. Power supply reliability is crucial to overall RF system availability and AC to DC conversion efficiency affects the operating cost. The Low Energy Demonstration Accelerator (LEDA) being constructed at Los Alamos National Laboratory (LANL) will serve asmore » the prototype and test bed for APT. The design of the RF systems used in LEDA is driven by the need to field test high efficiency systems with extremely high reliability before APT is built. The authors present a detailed description and test results of one type of advanced high voltage power supply system using Insulated Gate Bipolar Transistors (IGBTs) that has been used with the LEDA High Power RF systems. The authors also present some of the distinctive features offered by this power supply topology, including crowbarless tube protection and modular construction which allows graceful degradation of power supply operation.« less
  • Recent trends in microprocessor design heavily rely on large register files with large I/O bandwidths for sustaining performance; a possible solution to relieve this bottleneck is the adoption of multiple register files. In this paper we show how the problem of assigning variables to multiple register banks can be reduced to that of a hypergraph coloring and, also, propose a technique to perform this coloring; this technique is applied to the problem of variable partitioning for multiple-register-file VLIW architectures.
  • Abstract not provided.
  • The high computational capability of graphics processing units (GPUs) is enabling and driving the scientific discovery process at large-scale. The world s second fastest supercomputer for open science, Titan, has more than 18,000 GPUs that computational scientists use to perform scientific simu- lations and data analysis. Understanding of GPU reliability characteristics, however, is still in its nascent stage since GPUs have only recently been deployed at large-scale. This paper presents a detailed study of GPU errors and their impact on system operations and applications, describing experiences with the 18,688 GPUs on the Titan supercom- puter as well as lessons learnedmore » in the process of efficient operation of GPUs at scale. These experiences are helpful to HPC sites which already have large-scale GPU clusters or plan to deploy GPUs in the future.« less