Exploring Properties and Correlations of Fatal Events in a Large-Scale HPC System

Di, Sheng; Guo, Hanqi; Gupta, Rinku; Pershey, Eric R.; Snir, Marc; Cappello, Franck

doi:10.1109/tpds.2018.2864184

Title: Exploring Properties and Correlations of Fatal Events in a Large-Scale HPC System

Full Record
Other Related Research

Abstract

In this study, we explore potential correlations of fatal system events for one of the most powerful supercomputers-IBM Blue Gene/Q Mira, which is deployed at Argonne National Laboratory, based on its 5-year reliability, availability, and serviceability (RAS) log. Our contribution is two-fold. (1) We design an efficient log analysis tool, namely LogAider, with a novel filtering method to effectively extract fatal events from masses of system messages that are heavily duplicated in the log. LogAider exhibits a very precise detection of temporal-correlation with a high similarity (up to 95 percent) to the ground-truth (i.e., compared to the failure records reported by the administrators). The total number of fatal events can be reduced to about 1,255 compared with originally 2.6 million duplicated fatal messages. (2) We analyze the 5-year RAS log of the MIRA system using LogAider, and summarize six important "takeaways" which can help system vendors and administrators better understand an extreme-scale system's fatal events. Specifically, we find that the distribution or proportion of the fatal system events follow a Pareto-like principle in general. The temporal correlation among fatal events is much stronger than that of warn messages and info messages, and the correlated events tend to constitute a fewmore »« less

Authors:

^[1];

^[1]; Gupta, Rinku ^[1]; Pershey, Eric R. ^[1];

^[2]; Cappello, Franck ^[1]

Argonne National Lab. (ANL), Lemont, IL (United States)
Univ. of Illinois at Urbana-Champaign, IL (United States)

Publication Date:: Tue Aug 14 00:00:00 EDT 2018

Research Org.:: Argonne National Lab. (ANL), Argonne, IL (United States). Argonne Leadership Computing Facility (ALCF)

Sponsoring Org.:: USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)

OSTI Identifier:: 1510059

Grant/Contract Number:: AC02-06CH11357

Resource Type:: Accepted Manuscript

Journal Name:: IEEE Transactions on Parallel and Distributed Systems

Additional Journal Information:: Journal Volume: 30; Journal Issue: 2; Journal ID: ISSN 1045-9219

Publisher:: IEEE

Country of Publication:: United States

Language:: English

Subject:: 97 MATHEMATICS AND COMPUTING; fatal event analysis; mining correlations; peta-scale supercomputer; reliability-availability-serviceability (RAS)

Citation Formats


                    Di, Sheng, Guo, Hanqi, Gupta, Rinku, Pershey, Eric R., Snir, Marc, and Cappello, Franck. Exploring Properties and Correlations of Fatal Events in a Large-Scale HPC System.  United States: N. p., 2018. 
Web.  doi:10.1109/tpds.2018.2864184.

Copy to clipboard


                    Di, Sheng, Guo, Hanqi, Gupta, Rinku, Pershey, Eric R., Snir, Marc, & Cappello, Franck. Exploring Properties and Correlations of Fatal Events in a Large-Scale HPC System.  United States.  https://doi.org/10.1109/tpds.2018.2864184

Copy to clipboard


                    Di, Sheng, Guo, Hanqi, Gupta, Rinku, Pershey, Eric R., Snir, Marc, and Cappello, Franck. Tue .  
"Exploring Properties and Correlations of Fatal Events in a Large-Scale HPC System".  United States.  https://doi.org/10.1109/tpds.2018.2864184.  https://www.osti.gov/servlets/purl/1510059.

Copy to clipboard


                    
@article{osti_1510059,

  title        = {Exploring Properties and Correlations of Fatal Events in a Large-Scale HPC System},

  author       = {Di, Sheng and Guo, Hanqi and Gupta, Rinku and Pershey, Eric R. and Snir, Marc and Cappello, Franck},

  abstractNote = {In this study, we explore potential correlations of fatal system events for one of the most powerful supercomputers-IBM Blue Gene/Q Mira, which is deployed at Argonne National Laboratory, based on its 5-year reliability, availability, and serviceability (RAS) log. Our contribution is two-fold. (1) We design an efficient log analysis tool, namely LogAider, with a novel filtering method to effectively extract fatal events from masses of system messages that are heavily duplicated in the log. LogAider exhibits a very precise detection of temporal-correlation with a high similarity (up to 95 percent) to the ground-truth (i.e., compared to the failure records reported by the administrators). The total number of fatal events can be reduced to about 1,255 compared with originally 2.6 million duplicated fatal messages. (2) We analyze the 5-year RAS log of the MIRA system using LogAider, and summarize six important "takeaways" which can help system vendors and administrators better understand an extreme-scale system's fatal events. Specifically, we find that the distribution or proportion of the fatal system events follow a Pareto-like principle in general. The temporal correlation among fatal events is much stronger than that of warn messages and info messages, and the correlated events tend to constitute a few clusters. The mean time between fatal events (MTBFE) of the Mira system is about 1.3 days from the perspective of the system, and the MTTI is 2-4 days from the perspective of users. The most error-prone item value with respect to any key attribute appears likely in the log every 2-10 days. Weibull, Gamma, and Pearson6 are the three best-fit distributions for the fatal event intervals. The overall correlation of fatal events on the 5D torus network is not prominent, whereas the small-region locality correlation (e.g., the fatal events inside racks) is relatively strong. We believe our work will be interesting to large-scale HPC system administrators and vendors and to fault tolerance researchers, enabling them to better understand fatal events and mitigate such events accordingly.},

  doi          = {10.1109/tpds.2018.2864184},

  journal      = {IEEE Transactions on Parallel and Distributed Systems},

  number       = 2,

  volume       = 30,

  place        = {United States},

  year         = {Tue Aug 14 00:00:00 EDT 2018},

  month        = {Tue Aug 14 00:00:00 EDT 2018}

}

Copy to clipboard

Journal Article:

Free Publicly Available Full Text

Accepted Manuscript (DOE)

Publisher's Version of Record

https://doi.org/10.1109/tpds.2018.2864184

Other availability

Search WorldCat to find libraries that may hold this journal

Citation Metrics:

Cited by: 11 works

Citation information provided by
Web of Science

Save / Share:

Export Metadata

Save to My Library

Similar Records in DOE PAGES and OSTI.GOV collections:

Three Mile Island serious but not fatal to nuclear

Journal Article - Electr. World; (United States)

The Three Mile Island nuclear incident will cause some delays in the construction of nuclear plants. But when the emotion dies down, people will realize there is still an obvious need for nuclear power in the long term, even as President Carter indicated in his press conference after his television energy message. This was the conclusion arrived at by several top officials of electric utilities in a spot check on the President's energy message and on their views on Three Mile Island. The officials also commented on decontrol of oil prices and wheeling power to oil-fired regions. Karl H. Rudolphmore »« less
A Big Data Analytics Framework for HPC Log Data: Three Case Studies Using the Titan Supercomputer Log

Conference Park, Byung H. ; Hui, Yawei ; Boehm, Swen ; ... - 2018 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER)

Reliability, availability and serviceability (RAS) logs of high performance computing (HPC) resources, when closely investigated in spatial and temporal dimensions, can provide invaluable information regarding system status, performance, and resource utilization. These data are often generated from multiple logging systems and sensors that cover many components of the system. The analysis of these data for finding persistent temporal and spatial insights faces two main difficulties: the volume of RAS logs makes manual inspection difficult and the unstructured nature and unique properties of log data produced by each subsystem adds another dimension of difficulty in identifying implicit correlation among recorded events.more »« less
https://doi.org/10.1109/CLUSTER.2018.00073
A Big Data Analytics Framework for HPC Log Data: Three Case Studies Using the Titan Supercomputer Log

Conference Park, Byung ; Hui, Yawei ; Boehm, Swen ; ...

Reliability, availability and serviceability (RAS) logs of high performance computing (HPC) resources, when closely investigated in spatial and temporal dimensions, can provide invaluable information regarding system status, performance, and resource utilization. These data are often generated from multiple logging systems and sensors that cover many components of the system. The analysis of these data for finding persistent temporal and spatial insights faces two main difficulties: the volume of RAS logs makes manual inspection difficult and the unstructured nature and unique properties of log data produced by each subsystem adds another dimension of difficulty in identifying implicit correlation among recorded events.more »« less
https://doi.org/10.1109/CLUSTER.2018.00073

Full Text Available
Characterization and identification of HPC applications at leadership computing facility

Conference Liu, Zhengchun ; Rao, Nageswara ; Kettimuthu, Rajkumar ; ...

High Performance Computing (HPC) is an important method for scientific discovery via large-scale simulation, data analysis, or artificial intelligence. Leadership-class supercomputers are expensive, but essential to run large HPC applications. The Petascale era of supercomputers began in 2008, with the first machines achieving performance in excess of one petaflops, and with the advent of new supercomputers in 2021 (e.g., Aurora, Frontier), the Exascale era will soon begin. However, the high theoretical computing capability (i.e., peak FLOPS) of a machine is not the only meaningful target when designing a supercomputer, as the resources demand of applications varies. A deep understanding ofmore »« less
https://doi.org/10.1145/3392717.3392774

Full Text Available
Pin-pointing Node Failures in HPC Systems

Conference Roman, E ; Das, A ; Mueller, F ; ...

Automated fault prediction and diagnosis in HPC systems needs to be efficient for better system resilience. With increasing scalability required for exascale, accurate fault prediction aiding in quick remedy is hard. With changing supercomputer architectures, distilling fault data from the noisy raw logs requires substantial efforts. Predicting node failures in such voluminous system logs is challenging. To this end, we investigate an interesting way to pin-point node failures in such supercomputing systems. Our study on Cray system data with automated machine learning tools suggests that specific patterns of event messages on node unavailability can be indicator to node failures. Thismore »« less
Full Text Available

Similar Records