SMC 2021 : Analyzing Resource Utilization and User Behavior on Titan Supercomputer

Dash, Sajal

doi:10.13139/OLCF/1772604

Title: SMC 2021 : Analyzing Resource Utilization and User Behavior on Titan Supercomputer

Dataset
Other Related Research

Abstract

Resource utilization statistics of submitted jobs on a supercomputer can help us understand how users from various scientific domains use HPC platforms and better design a job scheduler. We explore to generate insight regarding workload distribution and usage pattern domains from job scheduler trace, GPU failure information, and project-specific information collected from Titan supercomputer. Furthermore, we want to know how the scheduler performance varies over time and how the usersâ scheduling behavior changes following a system failure. These observations have the potential to provide valuable insight, which is helpful to prepare for system failures. These practices will help us develop and apply novel machine learning algorithms in understanding system behavior, requirement, and better scheduling of HPC systems. There are two datasets, RUR and GPU. â¢ RUR: This dataset is the job scheduler traces collected from the Titan supercomputerfrom 01/01/2015 to 07/31/2019 (2015.csv - 2019.csv). These were collected usingResource Utilization Report (RUR), a Cray-developed resource-usage data collectionand reporting system. It contains the usage information of its critical resources (CPU,Memory, GPU, and I/O) of each running job on Titan during that period [2]. ProjectAreas: Every job is associated with a project ID. TheProjectAreas.csvdatasetprovides a mapping of the project ID to its domainmore »« less

Authors:

Dash, Sajal

ORNL-OLCF

Publication Date:: Fri Mar 26 04:00:00 UTC 2021

DOE Contract Number:: AC05-00OR22725

Research Org.:: Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF); Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)

Sponsoring Org.:: Office of Science (SC)

Subject:: 42 ENGINEERING; 97 MATHEMATICS AND COMPUTING; RUR, Titan, GPU Failure

OSTI Identifier:: 1772604

DOI:: https://doi.org/10.13139/OLCF/1772604

Citation Formats


                    Dash, Sajal. SMC 2021 : Analyzing Resource Utilization and User Behavior on Titan Supercomputer.  United States: N. p., 2021. 
        Web.  doi:10.13139/OLCF/1772604.

Copy to clipboard


                    Dash, Sajal. SMC 2021 : Analyzing Resource Utilization and User Behavior on Titan Supercomputer.  United States.  doi:https://doi.org/10.13139/OLCF/1772604

Copy to clipboard


                    Dash, Sajal. 2021.  
"SMC 2021 : Analyzing Resource Utilization and User Behavior on Titan Supercomputer".  United States.  doi:https://doi.org/10.13139/OLCF/1772604.  https://www.osti.gov/servlets/purl/1772604. Pub date:Fri Mar 26 04:00:00 UTC 2021

Copy to clipboard


                    
@article{osti_1772604,

  title        = {SMC 2021 : Analyzing Resource Utilization and User Behavior on Titan Supercomputer},

  author       = {Dash, Sajal},

  abstractNote = {Resource utilization statistics of submitted jobs on a supercomputer can help us understand how users from various scientific domains use HPC platforms and better design a job scheduler. We explore to generate insight regarding workload distribution and usage pattern domains from job scheduler trace, GPU failure information, and project-specific information collected from Titan supercomputer. Furthermore, we want to know how the scheduler performance varies over time and how the usersâ scheduling behavior changes following a system failure. These observations have the potential to provide valuable insight, which is helpful to prepare for system failures. These practices will help us develop and apply novel machine learning algorithms in understanding system behavior, requirement, and better scheduling of HPC systems. There are two datasets, RUR and GPU. â¢ RUR: This dataset is the job scheduler traces collected from the Titan supercomputerfrom 01/01/2015 to 07/31/2019 (2015.csv - 2019.csv). These were collected usingResource Utilization Report (RUR), a Cray-developed resource-usage data collectionand reporting system. It contains the usage information of its critical resources (CPU,Memory, GPU, and I/O) of each running job on Titan during that period [2]. ProjectAreas: Every job is associated with a project ID. TheProjectAreas.csvdatasetprovides a mapping of the project ID to its domain science. â¢ GPU: There have been some hardware-related issues in the GPUs in Titan that caused some GPUs to fail, sometimes irrecoverably during some job runs. This dataset provides information regarding these failures during the execution of the submitted jobs. GPUs on Titan are uniquely identified by a serial number (SN), and they are installed in a location. A GPU can be installed in a location, then removed from that location following a failure, and then re-installed in a different location after fixing the problem. If the failure canât be recovered, the GPU might be removed entirely from Titan. There are two prominent types of failures that resulted in the removal of GPUs from Titan: Double Bit Error (DBE) and Out of the Bus (OTB). The dataset (gc_full.csv) has the following fields: 1. SN : Serial number of a GPU 2. location : The location where it is installed 3. insert : The time when it was inserted into that location 4. remove : The time when it was removed from that location 5. duration : Amount of time the GPU spent in this location 6. out : If the device was taken out entirely w/o a re-installment into a new location. 7. event : If the GPU was taken out entirely, the reason for its removal.T o learn more about this dataset, please refer to the git repositoryhttps://github.com/olcf/TitanGPULifeand the related publication [1]. References [1] George Ostrouchov, Don Maxwell, Rizwan A Ashraf, Christian Engelmann, MallikarjunShankar, and James H Rogers. Gpu lifetimes on titan supercomputer: Survival analysisand reliability. InSC20: International Conference for High Performance Computing,Networking, Storage and Analysis, pages 1â14. IEEE, 2020. [2] Feiyi Wang, Sarp Oral, Satyabrata Sen, and Neena Imam. Learning from five-yearresource-utilization data of titan system. In2019 IEEE International Conference onCluster Computing (CLUSTER), pages 1â6. IEEE, 2019.},

  doi          = {10.13139/OLCF/1772604},

  journal      = {},

  number       = ,

  volume       = ,

  place        = {United States},

  year         = {Fri Mar 26 04:00:00 UTC 2021},

  month        = {Fri Mar 26 04:00:00 UTC 2021}

}

Copy to clipboard

Dataset:

View Dataset

DOI: https://doi.org/10.13139/OLCF/1772604

Save / Share:

Export Metadata

Save to My Library

Similar records in DOE Data Explorer and OSTI.GOV collections:

SMC 2021 Data Challenge: Analyzing Resource Utilization and User Behavior on Titan Supercomputer

Dataset Dash, Sajal ; Paul, Arnab K. ; Oral, Sarp ; ...

Resource utilization statistics of submitted jobs on a supercomputer can help us understand how users from various scientific domains use HPC platforms and better design a job scheduler. We explore to generate insight regarding workload distribution and usage pattern domains from job scheduler trace, GPU failure information, and project-specific information collected from Titan supercomputer. Furthermore, we want to know how the scheduler performance varies over time and how the usersâ scheduling behavior changes following a system failure. These observations have the potential to provide valuable insight, which is helpful to prepare for system failures. These practices will help us developmore »« less
GPU Lifetimes on Titan Supercomputer: Survival Analysis and Reliability

Dataset Shankar, Mallikarjun ; Ostrouchov, George ; Maxwell, Don ; ...

George Ostrouchov, Don Maxwell, Rizwan Ashraf, Mallikarjun Shankar, and James Rogers. 2020. GPU Lifetimes on Titan Supercomputer: Survival Analysis and Reliability. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '20). Association for Computing Machinery, New York, NY, USA Data and code for SC20 paper about Titan GPU reliability analysis. https://github.com/olcf/TitanGPULife Includes R code to generate graphics for paper and additional analyses. See code/README for instructions. Includes original Titan GPU reliability data on over 100,000 collective hours of operation data/titan.gpu.history.txt - history data data/titan.service.txt - service nodes for exclusion Includes output data files producedmore »« less
Neutron Imaging dataset for SMC 2021 data challenge

Dataset Peterson, Peter ; Granroth, Garrett ; Bilheux, Hassina ; ...

The neutron radiography (nR) dataset provides information of the neutron events as measured for the Siemens star mask using the Timepix3 detector. The recordings consist of the position (x and y axes), the time-stamp, and time-over-threshold (TOT) values of each neutron event.
AmeriFlux CA-SMC Smith Creek

Dataset Sonnentag, Oliver

This is the AmeriFlux version of the carbon flux data for the site CA-SMC Smith Creek. Site Description - Boreal forest-peatland landscape with discontinuous permafrost, 19 km southeast of Wrigley, NT, Canada
VULCAN temperature dataset for SMC data challenge

Dataset Peterson, Peter

The VULCAN Beamline dataset provides the sample measurement, where temperatures is recorded in two physically different places on the sample. These are held in two different hdf5 groups in the data file.

Similar Records