DOE Data Explorer title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: GPU Lifetimes on Titan Supercomputer: Survival Analysis and Reliability

Abstract

George Ostrouchov, Don Maxwell, Rizwan Ashraf, Mallikarjun Shankar, and James Rogers. 2020. GPU Lifetimes on Titan Supercomputer: Survival Analysis and Reliability. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '20). Association for Computing Machinery, New York, NY, USA Data and code for SC20 paper about Titan GPU reliability analysis. https://github.com/olcf/TitanGPULife Includes R code to generate graphics for paper and additional analyses. See code/README for instructions. Includes original Titan GPU reliability data on over 100,000 collective hours of operation data/titan.gpu.history.txt - history data data/titan.service.txt - service nodes for exclusion Includes output data files produced by code/TitanGPUmodel.Rmd data/gc_full.csv - cleaned up data (see paper and R code) data/gc_summary_loc.csv - one record per GPU (variables: SN,time,nlife,nloc,last,col,row,cage,slot,node,max_loc_events,time_max_loc,dbe,dbe_loc,otb,otb_loc,out,batch,days,years,dead,dead_otb,dead_dbe) (see paper and R code) Includes .Rmd analysis document as TitanGPUmode.html Includes Python code to process data/gc_full.csv into graphics from time-between-failure analyses See code/tbf-analyses/README for instructions

Authors:
; ; ; ; ;
Publication Date:
DOE Contract Number:  
DE-AC05-00OR22725
Product Type:
Dataset
Research Org.:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF)
Sponsoring Org.:
Office of Science (SC), Advanced Scientific Computing Research (ASCR) (SC-21)
Subject:
42 ENGINEERING; 47 OTHER INSTRUMENTATION
Keywords:
GPU, reliability, supercomputer, NVIDIA, Cray, large-scale systems, Kaplan-Meier survival, Cox regression
OSTI Identifier:
1657202
DOI:
https://doi.org/10.13139/ORNLNCCS/1657202

Citation Formats

Shankar, Mallikarjun, Ostrouchov, George, Maxwell, Don, Rogers, James, Ashraf, Rizwan, and Engelmann, Chrstian. GPU Lifetimes on Titan Supercomputer: Survival Analysis and Reliability. United States: N. p., 2020. Web. doi:10.13139/ORNLNCCS/1657202.
Shankar, Mallikarjun, Ostrouchov, George, Maxwell, Don, Rogers, James, Ashraf, Rizwan, & Engelmann, Chrstian. GPU Lifetimes on Titan Supercomputer: Survival Analysis and Reliability. United States. doi:https://doi.org/10.13139/ORNLNCCS/1657202
Shankar, Mallikarjun, Ostrouchov, George, Maxwell, Don, Rogers, James, Ashraf, Rizwan, and Engelmann, Chrstian. 2020. "GPU Lifetimes on Titan Supercomputer: Survival Analysis and Reliability". United States. doi:https://doi.org/10.13139/ORNLNCCS/1657202. https://www.osti.gov/servlets/purl/1657202. Pub date:Wed Sep 02 00:00:00 EDT 2020
@article{osti_1657202,
title = {GPU Lifetimes on Titan Supercomputer: Survival Analysis and Reliability},
author = {Shankar, Mallikarjun and Ostrouchov, George and Maxwell, Don and Rogers, James and Ashraf, Rizwan and Engelmann, Chrstian},
abstractNote = {George Ostrouchov, Don Maxwell, Rizwan Ashraf, Mallikarjun Shankar, and James Rogers. 2020. GPU Lifetimes on Titan Supercomputer: Survival Analysis and Reliability. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '20). Association for Computing Machinery, New York, NY, USA Data and code for SC20 paper about Titan GPU reliability analysis. https://github.com/olcf/TitanGPULife Includes R code to generate graphics for paper and additional analyses. See code/README for instructions. Includes original Titan GPU reliability data on over 100,000 collective hours of operation data/titan.gpu.history.txt - history data data/titan.service.txt - service nodes for exclusion Includes output data files produced by code/TitanGPUmodel.Rmd data/gc_full.csv - cleaned up data (see paper and R code) data/gc_summary_loc.csv - one record per GPU (variables: SN,time,nlife,nloc,last,col,row,cage,slot,node,max_loc_events,time_max_loc,dbe,dbe_loc,otb,otb_loc,out,batch,days,years,dead,dead_otb,dead_dbe) (see paper and R code) Includes .Rmd analysis document as TitanGPUmode.html Includes Python code to process data/gc_full.csv into graphics from time-between-failure analyses See code/tbf-analyses/README for instructions},
doi = {10.13139/ORNLNCCS/1657202},
journal = {},
number = ,
volume = ,
place = {United States},
year = {2020},
month = {9}
}