skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Understanding the Interplay between Hardware Errors and User Job Characteristics on the Titan Supercomputer

Abstract

Designing dependable supercomputers begins with an understanding of errors in real-world, large-scale systems. The Titan supercomputer at Oak Ridge National Laboratory provides a unique opportunity to investigate errors when an actual system is actively used by multiple concurrent users and workloads from diverse domains at varying scales. This study presents a thorough analysis of 6, 908, 497 hardware errors from 18, 688 compute nodes of Titan for 312, 215 user jobs over a 3-year time period. Through careful joining of two system logs – the Machine Check Architecture (MCA) log and the job scheduler log – we show the correlated pattern of hardware errors for each job and user, in addition to individual descriptive statistics of errors, jobs, and users. Since the majority of hardware errors are memory errors, this study also shows the importance of error correcting in memory systems.

Authors:
ORCiD logo [1]; ORCiD logo [1]; ORCiD logo [1]
  1. ORNL
Publication Date:
Research Org.:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Sponsoring Org.:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)
OSTI Identifier:
1649409
DOE Contract Number:  
AC05-00OR22725
Resource Type:
Conference
Resource Relation:
Conference: 34th IEEE International Parallel & Distributed Processing Symposium (IPDPS) - New Orleans, Louisiana, United States of America - 5/18/2020 8:00:00 AM-5/22/2020 4:00:00 AM
Country of Publication:
United States
Language:
English

Citation Formats

Lim, Seung-Hwan, Miller, Ross, and Vazhkudai, Sudharshan. Understanding the Interplay between Hardware Errors and User Job Characteristics on the Titan Supercomputer. United States: N. p., 2020. Web.
Lim, Seung-Hwan, Miller, Ross, & Vazhkudai, Sudharshan. Understanding the Interplay between Hardware Errors and User Job Characteristics on the Titan Supercomputer. United States.
Lim, Seung-Hwan, Miller, Ross, and Vazhkudai, Sudharshan. Fri . "Understanding the Interplay between Hardware Errors and User Job Characteristics on the Titan Supercomputer". United States. https://www.osti.gov/servlets/purl/1649409.
@article{osti_1649409,
title = {Understanding the Interplay between Hardware Errors and User Job Characteristics on the Titan Supercomputer},
author = {Lim, Seung-Hwan and Miller, Ross and Vazhkudai, Sudharshan},
abstractNote = {Designing dependable supercomputers begins with an understanding of errors in real-world, large-scale systems. The Titan supercomputer at Oak Ridge National Laboratory provides a unique opportunity to investigate errors when an actual system is actively used by multiple concurrent users and workloads from diverse domains at varying scales. This study presents a thorough analysis of 6, 908, 497 hardware errors from 18, 688 compute nodes of Titan for 312, 215 user jobs over a 3-year time period. Through careful joining of two system logs – the Machine Check Architecture (MCA) log and the job scheduler log – we show the correlated pattern of hardware errors for each job and user, in addition to individual descriptive statistics of errors, jobs, and users. Since the majority of hardware errors are memory errors, this study also shows the importance of error correcting in memory systems.},
doi = {},
journal = {},
number = ,
volume = ,
place = {United States},
year = {2020},
month = {5}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share: