skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Analyzing the Impact of System Reliability Events on Applications in the Titan Supercomputer

Abstract

Extreme-scale computing systems employ Reliability, Availability and Serviceability (RAS) mechanisms and infrastructure to log events from multiple system components. In this paper, we analyze RAS logs in conjunction with the application placement and scheduling database, in order to understand the impact of common RAS events on application performance. This study conducted on the records of about 2 million applications executed on Titan supercomputer provides important insights for system users, operators and computer science researchers. Specifically, we investigate the impact of RAS events on application performance and its variability by comparing cases where events are recorded with corresponding cases where no events are recorded. Such a statistical investigation is possible since we observed that system users tend to execute their applications multiple times. Our analysis reveals that most RAS events do impact application performance, although not always. We also find that different system components affect application performance differently. In particular, our investigation includes the following components: parallel file system processor, memory, graphics processing units, system and user software issues. Our work establishes the importance of providing feedback to system users for increasing operational efficiency of extreme-scale systems.

Authors:
ORCiD logo [1]; ORCiD logo [1]
  1. ORNL
Publication Date:
Research Org.:
Oak Ridge National Laboratory, Oak Ridge Leadership Computing Facility (OLCF); Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Sponsoring Org.:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR) (SC-21)
OSTI Identifier:
1486940
DOE Contract Number:  
AC05-00OR22725
Resource Type:
Conference
Resource Relation:
Conference: 2018 IEEE/ACM 8th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS) - Dallas, Texas, United States of America - 11/11/2018 8:00:00 PM-11/16/2018 8:00:00 PM
Country of Publication:
United States
Language:
English

Citation Formats

Ashraf, Rizwan A., and Engelmann, Christian. Analyzing the Impact of System Reliability Events on Applications in the Titan Supercomputer. United States: N. p., 2018. Web. doi:10.1109/FTXS.2018.00008.
Ashraf, Rizwan A., & Engelmann, Christian. Analyzing the Impact of System Reliability Events on Applications in the Titan Supercomputer. United States. doi:10.1109/FTXS.2018.00008.
Ashraf, Rizwan A., and Engelmann, Christian. Thu . "Analyzing the Impact of System Reliability Events on Applications in the Titan Supercomputer". United States. doi:10.1109/FTXS.2018.00008. https://www.osti.gov/servlets/purl/1486940.
@article{osti_1486940,
title = {Analyzing the Impact of System Reliability Events on Applications in the Titan Supercomputer},
author = {Ashraf, Rizwan A. and Engelmann, Christian},
abstractNote = {Extreme-scale computing systems employ Reliability, Availability and Serviceability (RAS) mechanisms and infrastructure to log events from multiple system components. In this paper, we analyze RAS logs in conjunction with the application placement and scheduling database, in order to understand the impact of common RAS events on application performance. This study conducted on the records of about 2 million applications executed on Titan supercomputer provides important insights for system users, operators and computer science researchers. Specifically, we investigate the impact of RAS events on application performance and its variability by comparing cases where events are recorded with corresponding cases where no events are recorded. Such a statistical investigation is possible since we observed that system users tend to execute their applications multiple times. Our analysis reveals that most RAS events do impact application performance, although not always. We also find that different system components affect application performance differently. In particular, our investigation includes the following components: parallel file system processor, memory, graphics processing units, system and user software issues. Our work establishes the importance of providing feedback to system users for increasing operational efficiency of extreme-scale systems.},
doi = {10.1109/FTXS.2018.00008},
journal = {},
number = ,
volume = ,
place = {United States},
year = {2018},
month = {11}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share: