skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Learning from Five-year Resource-Utilization Data of Titan System

Abstract

Titan was the flagship supercomputer at the Oak Ridge Leadership Computing Facility (OLCF). It was deployed in late 2012, became the fastest supercomputer in the world and was retired on August 2, 2019. With Titan's mission complete, this paper provides a first-order examination of the usage of its critical resources (CPU, Memory, GPU, and I/O) over a five-year production period (2015-2019). In particular, we show quantitatively that the majority of CPU time was spent on the large-scale jobs, which is consistent with the policy of driving ground-breaking science through leadership computing. We also corroborate the general observation of the low CPU-memory usage with 95% jobs utilizing only 15% or less available memory. Additionally, we correlate the increase of total job submissions and the decrease of GPU-enabled jobs during 2016 with the GPU reliability issue which impacted the large-scale runs. We further show the surprising read/write ratio over the five-year period, which contradicts the general mindset of the large-scale simulation machines being “write-heavy”. This understanding will have potential impact on how we design our next-generation large-scale storage systems. We believe that our analyses and findings are going to be of great interest to the high-performance computing (HPC) community at large.

Authors:
ORCiD logo [1]; ORCiD logo [1]; ORCiD logo [1]; ORCiD logo [1]
  1. ORNL
Publication Date:
Research Org.:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Sponsoring Org.:
USDOE
OSTI Identifier:
1606979
DOE Contract Number:  
AC05-00OR22725
Resource Type:
Conference
Resource Relation:
Conference: IEEE International Conference on Cluster Computing (IEEE CLUSTER): Workshop on Monitoring and Analysis for High Performance Computing Systems Plus Applications - Albuquerque, NM, New Mexico, United States of America - 9/23/2019 12:00:00 PM-
Country of Publication:
United States
Language:
English

Citation Formats

Wang, Feiyi, Oral, Sarp, Sen, Satyabrata, and Imam, Neena. Learning from Five-year Resource-Utilization Data of Titan System. United States: N. p., 2019. Web. doi:10.1109/CLUSTER.2019.8891001.
Wang, Feiyi, Oral, Sarp, Sen, Satyabrata, & Imam, Neena. Learning from Five-year Resource-Utilization Data of Titan System. United States. https://doi.org/10.1109/CLUSTER.2019.8891001
Wang, Feiyi, Oral, Sarp, Sen, Satyabrata, and Imam, Neena. Sun . "Learning from Five-year Resource-Utilization Data of Titan System". United States. https://doi.org/10.1109/CLUSTER.2019.8891001. https://www.osti.gov/servlets/purl/1606979.
@article{osti_1606979,
title = {Learning from Five-year Resource-Utilization Data of Titan System},
author = {Wang, Feiyi and Oral, Sarp and Sen, Satyabrata and Imam, Neena},
abstractNote = {Titan was the flagship supercomputer at the Oak Ridge Leadership Computing Facility (OLCF). It was deployed in late 2012, became the fastest supercomputer in the world and was retired on August 2, 2019. With Titan's mission complete, this paper provides a first-order examination of the usage of its critical resources (CPU, Memory, GPU, and I/O) over a five-year production period (2015-2019). In particular, we show quantitatively that the majority of CPU time was spent on the large-scale jobs, which is consistent with the policy of driving ground-breaking science through leadership computing. We also corroborate the general observation of the low CPU-memory usage with 95% jobs utilizing only 15% or less available memory. Additionally, we correlate the increase of total job submissions and the decrease of GPU-enabled jobs during 2016 with the GPU reliability issue which impacted the large-scale runs. We further show the surprising read/write ratio over the five-year period, which contradicts the general mindset of the large-scale simulation machines being “write-heavy”. This understanding will have potential impact on how we design our next-generation large-scale storage systems. We believe that our analyses and findings are going to be of great interest to the high-performance computing (HPC) community at large.},
doi = {10.1109/CLUSTER.2019.8891001},
url = {https://www.osti.gov/biblio/1606979}, journal = {},
number = ,
volume = ,
place = {United States},
year = {2019},
month = {9}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share: