DOE Data Explorer title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Long Term Per-Component Power and Thermal Measurements of the OLCF Summit System

Abstract

As we move into the exascale era, the power and energy footprints of high-performance computing (HPC) systems have grown significantly larger. Due to the harsh power and thermal conditions the system, components are exposed to extreme operating conditions. Operation of such modern HPC systems requires deep insights into long term system behavior to maintain its efficiency as well as its longevity. To help the HPC community to gain such insights, we provide a dataset that records the long-term power and thermal behavior of the 200PF pre-exascale supercomputer at the Oak Ridge Leadership Computing Facility (OLCF), Summit. This system is an IBM AC922 based system that has 9,252 IBM Power9 CPUs and 27,756 Nvidia V100 GPUs and can consume up to 13MW power at peak. Heat removal is performed using medium temperature direct liquid cooling and rear-door heat exchanger based secondary cooling loop. Originally extracted from a high-resolution (1Hz) per-component (GPUs, CPUs) measurements from the system, we primarily provide a dataset that has 10-second and 1-minute mean power & thermal measurements selected from five month-long segments over the course of 2020 (January & August), 2021 (February & August), and 2022 (January). For convenience, we also provide various sub datasets randomly sampledmore » from the time and space (hosts) of the cluster. Further details and example code for analysis can be found in the following GitHub repository: https://github.com/at-aaims/summit_power_and_thermal_data« less

Authors:
; ; ; ; ;
  1. ORNL-OLCF
Publication Date:
DOE Contract Number:  
AC05-00OR22725
Research Org.:
Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF); Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
Sponsoring Org.:
Office of Science (SC)
Subject:
97 MATHEMATICS AND COMPUTING; 99 GENERAL AND MISCELLANEOUS; High-performance Computing, system power and thermal, reliability, CPUs, GPUs, medium temperature water cooling, direct liquid cooling
OSTI Identifier:
1861393
DOI:
https://doi.org/10.13139/OLCF/1861393

Citation Formats

Shin, Woong, Ellis, J. Austin, Karimi, Ahmad Maroof, Oles, Vladyslav, Dash, Sajal, and Wang, Feiyi. Long Term Per-Component Power and Thermal Measurements of the OLCF Summit System. United States: N. p., 2022. Web. doi:10.13139/OLCF/1861393.
Shin, Woong, Ellis, J. Austin, Karimi, Ahmad Maroof, Oles, Vladyslav, Dash, Sajal, & Wang, Feiyi. Long Term Per-Component Power and Thermal Measurements of the OLCF Summit System. United States. doi:https://doi.org/10.13139/OLCF/1861393
Shin, Woong, Ellis, J. Austin, Karimi, Ahmad Maroof, Oles, Vladyslav, Dash, Sajal, and Wang, Feiyi. 2022. "Long Term Per-Component Power and Thermal Measurements of the OLCF Summit System". United States. doi:https://doi.org/10.13139/OLCF/1861393. https://www.osti.gov/servlets/purl/1861393. Pub date:Mon Apr 11 04:00:00 UTC 2022
@article{osti_1861393,
title = {Long Term Per-Component Power and Thermal Measurements of the OLCF Summit System},
author = {Shin, Woong and Ellis, J. Austin and Karimi, Ahmad Maroof and Oles, Vladyslav and Dash, Sajal and Wang, Feiyi},
abstractNote = {As we move into the exascale era, the power and energy footprints of high-performance computing (HPC) systems have grown significantly larger. Due to the harsh power and thermal conditions the system, components are exposed to extreme operating conditions. Operation of such modern HPC systems requires deep insights into long term system behavior to maintain its efficiency as well as its longevity. To help the HPC community to gain such insights, we provide a dataset that records the long-term power and thermal behavior of the 200PF pre-exascale supercomputer at the Oak Ridge Leadership Computing Facility (OLCF), Summit. This system is an IBM AC922 based system that has 9,252 IBM Power9 CPUs and 27,756 Nvidia V100 GPUs and can consume up to 13MW power at peak. Heat removal is performed using medium temperature direct liquid cooling and rear-door heat exchanger based secondary cooling loop. Originally extracted from a high-resolution (1Hz) per-component (GPUs, CPUs) measurements from the system, we primarily provide a dataset that has 10-second and 1-minute mean power & thermal measurements selected from five month-long segments over the course of 2020 (January & August), 2021 (February & August), and 2022 (January). For convenience, we also provide various sub datasets randomly sampled from the time and space (hosts) of the cluster. Further details and example code for analysis can be found in the following GitHub repository: https://github.com/at-aaims/summit_power_and_thermal_data},
doi = {10.13139/OLCF/1861393},
journal = {},
number = ,
volume = ,
place = {United States},
year = {Mon Apr 11 04:00:00 UTC 2022},
month = {Mon Apr 11 04:00:00 UTC 2022}
}