Analyzing the Impact of System Reliability Events on Applications in the Titan Supercomputer

Ashraf, Rizwan; Engelmann, Christian

doi:10.1109/FTXS.2018.00008

Title: Analyzing the Impact of System Reliability Events on Applications in the Titan Supercomputer

Conference · Thu Nov 01 00:00:00 EDT 2018

DOI:https://doi.org/10.1109/FTXS.2018.00008· OSTI ID:1486940

^[1];

^[1]

ORNL

Extreme-scale computing systems employ Reliability, Availability and Serviceability (RAS) mechanisms and infrastructure to log events from multiple system components. In this paper, we analyze RAS logs in conjunction with the application placement and scheduling database, in order to understand the impact of common RAS events on application performance. This study conducted on the records of about 2 million applications executed on Titan supercomputer provides important insights for system users, operators and computer science researchers. Specifically, we investigate the impact of RAS events on application performance and its variability by comparing cases where events are recorded with corresponding cases where no events are recorded. Such a statistical investigation is possible since we observed that system users tend to execute their applications multiple times. Our analysis reveals that most RAS events do impact application performance, although not always. We also find that different system components affect application performance differently. In particular, our investigation includes the following components: parallel file system processor, memory, graphics processing units, system and user software issues. Our work establishes the importance of providing feedback to system users for increasing operational efficiency of extreme-scale systems.

View Conference

Cite

Export

Save

Research Organization:: Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)

Sponsoring Organization:: USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)

DOE Contract Number:: AC05-00OR22725

OSTI ID:: 1486940

Resource Relation:: Conference: 2018 IEEE/ACM 8th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS) - Dallas, Texas, United States of America - 11/11/2018 8:00:00 PM-11/16/2018 8:00:00 PM

Country of Publication:: United States

Language:: English

References (13)

Failures in large scale systems: long-term measurement, analysis, and implications Gupta, Saurabh; Patel, Tirthak; Engelmann, Christian Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '17 https://doi.org/10.1145/3126908.3126937	conference	January 2017
Cosmic rays don't strike twice: understanding the nature of DRAM errors and the implications for system design Hwang, Andy A.; Stefanovici, Ioan A.; Schroeder, Bianca Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems - ASPLOS '12 https://doi.org/10.1145/2150976.2150989	conference	January 2012
Taming of the Shrew: Modeling the Normal and Faulty Behaviour of Large-scale HPC Systems Gainaru, Ana; Cappello, Franck; Kramer, William 2012 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), 2012 IEEE 26th International Parallel and Distributed Processing Symposium https://doi.org/10.1109/IPDPS.2012.107	conference	May 2012
Resiliency of HPC Interconnects: A Case Study of Interconnect Failures and Recovery in Blue Waters Jha, Saurabh; Formicola, Valerio; Martino, Catello Di IEEE Transactions on Dependable and Secure Computing, Vol. 15, Issue 6 https://doi.org/10.1109/TDSC.2017.2737537	journal	November 2018
LogDiver Martino, Catello Di; Jha, Saurabh; Kramer, William Proceedings of the 5th Workshop on Fault Tolerance for HPC at eXtreme Scale https://doi.org/10.1145/2751504.2751511	conference	June 2015
Measuring the Impact of Memory Errors on Application Performance Gottscho, Mark; Shoaib, Mohammed; Govindan, Sriram IEEE Computer Architecture Letters, Vol. 16, Issue 1 https://doi.org/10.1109/LCA.2016.2599513	journal	January 2017
Understanding and Analyzing Interconnect Errors and Network Congestion on a Large Scale HPC System Kumar, Mohit; Gupta, Saurabh; Patel, Tirthak 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) https://doi.org/10.1109/DSN.2018.00023	conference	June 2018
Memory Errors in Modern Systems Sridharan, Vilas; DeBardeleben, Nathan; Blanchard, Sean Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems https://doi.org/10.1145/2694344.2694348	conference	March 2015
Machine Learning Models for GPU Error Prediction in a Large Scale HPC System Nie, Bin; Xue, Ji; Gupta, Saurabh 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) https://doi.org/10.1109/DSN.2018.00022	conference	June 2018
Lessons Learned from the Analysis of System Failures at Petascale: The Case of Blue Waters Martino, Catello Di; Kalbarczyk, Zbigniew; Iyer, Ravishankar K. 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) https://doi.org/10.1109/DSN.2014.62	conference	June 2014
Reading between the lines of failure logs: Understanding how HPC systems fail El-Sayed, Nosayba; Schroeder, Bianca 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) https://doi.org/10.1109/DSN.2013.6575356	conference	June 2013
Measuring and Understanding Extreme-Scale Application Resilience: A Field Study of 5,000,000 HPC Application Runs Martino, Catello Di; Kramer, William; Kalbarczyk, Zbigniew 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) https://doi.org/10.1109/DSN.2015.50	conference	June 2015
DRAM errors in the wild: a large-scale field study Schroeder, Bianca; Pinheiro, Eduardo; Weber, Wolf-Dietrich Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems - SIGMETRICS '09 https://doi.org/10.1145/1555349.1555372	conference	January 2009

Similar Records

A Big Data Analytics Framework for HPC Log Data: Three Case Studies Using the Titan Supercomputer Log

Conference · Sat Sep 01 00:00:00 EDT 2018 · 2018 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER) · OSTI ID:1486940

Park, Byung H.; Hui, Yawei; Boehm, Swen; +3 more

A Big Data Analytics Framework for HPC Log Data: Three Case Studies Using the Titan Supercomputer Log

Conference · Thu Nov 01 00:00:00 EDT 2018 · OSTI ID:1486940

Park, Byung; Hui, Yawei; Boehm, Swen; +3 more

SMC 2021 : Analyzing Resource Utilization and User Behavior on Titan Supercomputer

Dataset · Fri Mar 26 00:00:00 EDT 2021 · OSTI ID:1486940

Dash, Sajal

Title: Analyzing the Impact of System Reliability Events on Applications in the Titan Supercomputer

Citation Formats

References (13)

Similar Records

Related Subjects