Learning from Five-year Resource-Utilization Data of Titan System
- ORNL
Titan was the flagship supercomputer at the Oak Ridge Leadership Computing Facility (OLCF). It was deployed in late 2012, became the fastest supercomputer in the world and was retired on August 2, 2019. With Titan's mission complete, this paper provides a first-order examination of the usage of its critical resources (CPU, Memory, GPU, and I/O) over a five-year production period (2015-2019). In particular, we show quantitatively that the majority of CPU time was spent on the large-scale jobs, which is consistent with the policy of driving ground-breaking science through leadership computing. We also corroborate the general observation of the low CPU-memory usage with 95% jobs utilizing only 15% or less available memory. Additionally, we correlate the increase of total job submissions and the decrease of GPU-enabled jobs during 2016 with the GPU reliability issue which impacted the large-scale runs. We further show the surprising read/write ratio over the five-year period, which contradicts the general mindset of the large-scale simulation machines being “write-heavy”. This understanding will have potential impact on how we design our next-generation large-scale storage systems. We believe that our analyses and findings are going to be of great interest to the high-performance computing (HPC) community at large.
- Research Organization:
- Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
- Sponsoring Organization:
- USDOE
- DOE Contract Number:
- AC05-00OR22725
- OSTI ID:
- 1606979
- Country of Publication:
- United States
- Language:
- English
Workload characterization of a leadership class storage cluster
|
conference | November 2010 |
Reliability lessons learned from GPU experience with the Titan supercomputer at Oak Ridge leadership computing facility
|
conference | January 2015 |
Main Memory in HPC: Do We Need More or Could We Live with Less?
|
journal | March 2017 |
Similar Records
Learning from Five-year Resource-Utilization Data of Titan System
GPU age-aware scheduling to improve the reliability of leadership jobs on Titan
SMC 2021 Data Challenge: Analyzing Resource Utilization and User Behavior on Titan Supercomputer
Conference
·
Sun Sep 01 00:00:00 EDT 2019
·
OSTI ID:1648993
GPU age-aware scheduling to improve the reliability of leadership jobs on Titan
Conference
·
Thu Nov 01 00:00:00 EDT 2018
·
OSTI ID:1489583
SMC 2021 Data Challenge: Analyzing Resource Utilization and User Behavior on Titan Supercomputer
Dataset
·
Mon Mar 29 00:00:00 EDT 2021
·
OSTI ID:1772811