skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Analyzing the Impact of System Reliability Events on Applications in the Titan Supercomputer

Conference ·

Extreme-scale computing systems employ Reliability, Availability and Serviceability (RAS) mechanisms and infrastructure to log events from multiple system components. In this paper, we analyze RAS logs in conjunction with the application placement and scheduling database, in order to understand the impact of common RAS events on application performance. This study conducted on the records of about 2 million applications executed on Titan supercomputer provides important insights for system users, operators and computer science researchers. Specifically, we investigate the impact of RAS events on application performance and its variability by comparing cases where events are recorded with corresponding cases where no events are recorded. Such a statistical investigation is possible since we observed that system users tend to execute their applications multiple times. Our analysis reveals that most RAS events do impact application performance, although not always. We also find that different system components affect application performance differently. In particular, our investigation includes the following components: parallel file system processor, memory, graphics processing units, system and user software issues. Our work establishes the importance of providing feedback to system users for increasing operational efficiency of extreme-scale systems.

Research Organization:
Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
Sponsoring Organization:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)
DOE Contract Number:
AC05-00OR22725
OSTI ID:
1486940
Resource Relation:
Conference: 2018 IEEE/ACM 8th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS) - Dallas, Texas, United States of America - 11/11/2018 8:00:00 PM-11/16/2018 8:00:00 PM
Country of Publication:
United States
Language:
English

References (13)

Failures in large scale systems: long-term measurement, analysis, and implications
  • Gupta, Saurabh; Patel, Tirthak; Engelmann, Christian
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '17 https://doi.org/10.1145/3126908.3126937
conference January 2017
Cosmic rays don't strike twice: understanding the nature of DRAM errors and the implications for system design
  • Hwang, Andy A.; Stefanovici, Ioan A.; Schroeder, Bianca
  • Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems - ASPLOS '12 https://doi.org/10.1145/2150976.2150989
conference January 2012
Taming of the Shrew: Modeling the Normal and Faulty Behaviour of Large-scale HPC Systems
  • Gainaru, Ana; Cappello, Franck; Kramer, William
  • 2012 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), 2012 IEEE 26th International Parallel and Distributed Processing Symposium https://doi.org/10.1109/IPDPS.2012.107
conference May 2012
Resiliency of HPC Interconnects: A Case Study of Interconnect Failures and Recovery in Blue Waters journal November 2018
LogDiver conference June 2015
Measuring the Impact of Memory Errors on ApplicationĀ  Performance journal January 2017
Understanding and Analyzing Interconnect Errors and Network Congestion on a Large Scale HPC System conference June 2018
Memory Errors in Modern Systems
  • Sridharan, Vilas; DeBardeleben, Nathan; Blanchard, Sean
  • Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems https://doi.org/10.1145/2694344.2694348
conference March 2015
Machine Learning Models for GPU Error Prediction in a Large Scale HPC System conference June 2018
Lessons Learned from the Analysis of System Failures at Petascale: The Case of Blue Waters
  • Martino, Catello Di; Kalbarczyk, Zbigniew; Iyer, Ravishankar K.
  • 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) https://doi.org/10.1109/DSN.2014.62
conference June 2014
Reading between the lines of failure logs: Understanding how HPC systems fail conference June 2013
Measuring and Understanding Extreme-Scale Application Resilience: A Field Study of 5,000,000 HPC Application Runs
  • Martino, Catello Di; Kramer, William; Kalbarczyk, Zbigniew
  • 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) https://doi.org/10.1109/DSN.2015.50
conference June 2015
DRAM errors in the wild: a large-scale field study
  • Schroeder, Bianca; Pinheiro, Eduardo; Weber, Wolf-Dietrich
  • Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems - SIGMETRICS '09 https://doi.org/10.1145/1555349.1555372
conference January 2009

Similar Records

A Big Data Analytics Framework for HPC Log Data: Three Case Studies Using the Titan Supercomputer Log
Conference · Sat Sep 01 00:00:00 EDT 2018 · 2018 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER) · OSTI ID:1486940

A Big Data Analytics Framework for HPC Log Data: Three Case Studies Using the Titan Supercomputer Log
Conference · Thu Nov 01 00:00:00 EDT 2018 · OSTI ID:1486940

SMC 2021 : Analyzing Resource Utilization and User Behavior on Titan Supercomputer
Dataset · Fri Mar 26 00:00:00 EDT 2021 · OSTI ID:1486940

Related Subjects