skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Characterizing Temperature, Power, and Soft-Error Behaviors in Data Center Systems: Insights, Challenges, and Opportunities

Abstract

GPUs have become part of the mainstream high performance computing facilities that increasingly require more computational power to simulate physical phenomena quickly and accurately. However, GPU nodes also consume significantly more power than traditional CPU nodes, and high power consumption introduces new system operation challenges, including increased temperature, power/cooling cost, and lower system reliability. This paper explores how power consumption and temperature characteristics affect reliability, provides insights into what are the implications of such understanding, and how to exploit these insights toward predicting GPU errors using neural networks.

Authors:
 [1];  [2];  [1];  [3]; ORCiD logo [4];  [1]
  1. College of William and Mary, Williamsburg, VA
  2. Northeastern University, Boston
  3. Intel Corporation
  4. ORNL
Publication Date:
Research Org.:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Sponsoring Org.:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)
OSTI Identifier:
1423068
DOE Contract Number:  
AC05-00OR22725
Resource Type:
Conference
Resource Relation:
Conference: 25th IEEE International Symposium on the Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS) 2017 - Banff, , Canada - 9/20/2017 8:00:00 AM-9/22/2017 8:00:00 AM
Country of Publication:
United States
Language:
English

Citation Formats

Nie, Bin, Tiwari, Devesh, Xue, Ji, Gupta, Saurabh, Engelmann, Christian, and Smirni, Evgenia. Characterizing Temperature, Power, and Soft-Error Behaviors in Data Center Systems: Insights, Challenges, and Opportunities. United States: N. p., 2017. Web. doi:10.1109/MASCOTS.2017.12.
Nie, Bin, Tiwari, Devesh, Xue, Ji, Gupta, Saurabh, Engelmann, Christian, & Smirni, Evgenia. Characterizing Temperature, Power, and Soft-Error Behaviors in Data Center Systems: Insights, Challenges, and Opportunities. United States. https://doi.org/10.1109/MASCOTS.2017.12
Nie, Bin, Tiwari, Devesh, Xue, Ji, Gupta, Saurabh, Engelmann, Christian, and Smirni, Evgenia. Wed . "Characterizing Temperature, Power, and Soft-Error Behaviors in Data Center Systems: Insights, Challenges, and Opportunities". United States. https://doi.org/10.1109/MASCOTS.2017.12. https://www.osti.gov/servlets/purl/1423068.
@article{osti_1423068,
title = {Characterizing Temperature, Power, and Soft-Error Behaviors in Data Center Systems: Insights, Challenges, and Opportunities},
author = {Nie, Bin and Tiwari, Devesh and Xue, Ji and Gupta, Saurabh and Engelmann, Christian and Smirni, Evgenia},
abstractNote = {GPUs have become part of the mainstream high performance computing facilities that increasingly require more computational power to simulate physical phenomena quickly and accurately. However, GPU nodes also consume significantly more power than traditional CPU nodes, and high power consumption introduces new system operation challenges, including increased temperature, power/cooling cost, and lower system reliability. This paper explores how power consumption and temperature characteristics affect reliability, provides insights into what are the implications of such understanding, and how to exploit these insights toward predicting GPU errors using neural networks.},
doi = {10.1109/MASCOTS.2017.12},
url = {https://www.osti.gov/biblio/1423068}, journal = {},
number = ,
volume = ,
place = {United States},
year = {2017},
month = {11}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share: