skip to main content

DOE PAGESDOE PAGES

This content will become publicly available on June 10, 2017

Title: Spatiotemporal modeling of node temperatures in supercomputers

Los Alamos National Laboratory (LANL) is home to many large supercomputing clusters. These clusters require an enormous amount of power (~500-2000 kW each), and most of this energy is converted into heat. Thus, cooling the components of the supercomputer becomes a critical and expensive endeavor. Recently a project was initiated to investigate the effect that changes to the cooling system in a machine room had on three large machines that were housed there. Coupled with this goal was the aim to develop a general good-practice for characterizing the effect of cooling changes and monitoring machine node temperatures in this and other machine rooms. This paper focuses on the statistical approach used to quantify the effect that several cooling changes to the room had on the temperatures of the individual nodes of the computers. The largest cluster in the room has 1,600 nodes that run a variety of jobs during general use. Since extremes temperatures are important, a Normal distribution plus generalized Pareto distribution for the upper tail is used to model the marginal distribution, along with a Gaussian process copula to account for spatio-temporal dependence. A Gaussian Markov random field (GMRF) model is used to model the spatial effects onmore » the node temperatures as the cooling changes take place. This model is then used to assess the condition of the node temperatures after each change to the room. The analysis approach was used to uncover the cause of a problematic episode of overheating nodes on one of the supercomputing clusters. Lastly, this same approach can easily be applied to monitor and investigate cooling systems at other data centers, as well.« less
Authors:
 [1] ;  [2] ;  [1] ;  [1] ;  [1] ;  [1] ;  [1]
  1. Los Alamos National Lab. (LANL), Los Alamos, NM (United States)
  2. North Carolina State Univ., Raleigh, NC (United States)
Publication Date:
OSTI Identifier:
1329590
Report Number(s):
LA-UR--15-22229
Journal ID: ISSN 0162-1459
Grant/Contract Number:
AC52-06NA25396
Type:
Accepted Manuscript
Journal Name:
Journal of the American Statistical Association
Additional Journal Information:
Journal Name: Journal of the American Statistical Association; Journal ID: ISSN 0162-1459
Publisher:
Taylor & Francis
Research Org:
Los Alamos National Lab. (LANL), Los Alamos, NM (United States)
Sponsoring Org:
USDOE
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING mathematics; high performance computing; cooling; spatiotemporal; Hierarchical Bayesian Modeling; generalized pareto distribution; extreme value; copula