Analyzing the Impact of System Reliability Events on Applications in the Titan Supercomputer
- ORNL
Extreme-scale computing systems employ Reliability, Availability and Serviceability (RAS) mechanisms and infrastructure to log events from multiple system components. In this paper, we analyze RAS logs in conjunction with the application placement and scheduling database, in order to understand the impact of common RAS events on application performance. This study conducted on the records of about 2 million applications executed on Titan supercomputer provides important insights for system users, operators and computer science researchers. Specifically, we investigate the impact of RAS events on application performance and its variability by comparing cases where events are recorded with corresponding cases where no events are recorded. Such a statistical investigation is possible since we observed that system users tend to execute their applications multiple times. Our analysis reveals that most RAS events do impact application performance, although not always. We also find that different system components affect application performance differently. In particular, our investigation includes the following components: parallel file system processor, memory, graphics processing units, system and user software issues. Our work establishes the importance of providing feedback to system users for increasing operational efficiency of extreme-scale systems.
- Research Organization:
- Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
- Sponsoring Organization:
- USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)
- DOE Contract Number:
- AC05-00OR22725
- OSTI ID:
- 1486940
- Resource Relation:
- Conference: 2018 IEEE/ACM 8th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS) - Dallas, Texas, United States of America - 11/11/2018 8:00:00 PM-11/16/2018 8:00:00 PM
- Country of Publication:
- United States
- Language:
- English
Failures in large scale systems: long-term measurement, analysis, and implications
|
conference | January 2017 |
Cosmic rays don't strike twice: understanding the nature of DRAM errors and the implications for system design
|
conference | January 2012 |
Taming of the Shrew: Modeling the Normal and Faulty Behaviour of Large-scale HPC Systems
|
conference | May 2012 |
Resiliency of HPC Interconnects: A Case Study of Interconnect Failures and Recovery in Blue Waters
|
journal | November 2018 |
LogDiver
|
conference | June 2015 |
Measuring the Impact of Memory Errors on ApplicationĀ Performance
|
journal | January 2017 |
Understanding and Analyzing Interconnect Errors and Network Congestion on a Large Scale HPC System
|
conference | June 2018 |
Memory Errors in Modern Systems
|
conference | March 2015 |
Machine Learning Models for GPU Error Prediction in a Large Scale HPC System
|
conference | June 2018 |
Lessons Learned from the Analysis of System Failures at Petascale: The Case of Blue Waters
|
conference | June 2014 |
Reading between the lines of failure logs: Understanding how HPC systems fail
|
conference | June 2013 |
Measuring and Understanding Extreme-Scale Application Resilience: A Field Study of 5,000,000 HPC Application Runs
|
conference | June 2015 |
DRAM errors in the wild: a large-scale field study
|
conference | January 2009 |
Similar Records
A Big Data Analytics Framework for HPC Log Data: Three Case Studies Using the Titan Supercomputer Log
SMC 2021 : Analyzing Resource Utilization and User Behavior on Titan Supercomputer