DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Exploring Properties and Correlations of Fatal Events in a Large-Scale HPC System

Abstract

In this study, we explore potential correlations of fatal system events for one of the most powerful supercomputers-IBM Blue Gene/Q Mira, which is deployed at Argonne National Laboratory, based on its 5-year reliability, availability, and serviceability (RAS) log. Our contribution is two-fold. (1) We design an efficient log analysis tool, namely LogAider, with a novel filtering method to effectively extract fatal events from masses of system messages that are heavily duplicated in the log. LogAider exhibits a very precise detection of temporal-correlation with a high similarity (up to 95 percent) to the ground-truth (i.e., compared to the failure records reported by the administrators). The total number of fatal events can be reduced to about 1,255 compared with originally 2.6 million duplicated fatal messages. (2) We analyze the 5-year RAS log of the MIRA system using LogAider, and summarize six important "takeaways" which can help system vendors and administrators better understand an extreme-scale system's fatal events. Specifically, we find that the distribution or proportion of the fatal system events follow a Pareto-like principle in general. The temporal correlation among fatal events is much stronger than that of warn messages and info messages, and the correlated events tend to constitute a fewmore » clusters. The mean time between fatal events (MTBFE) of the Mira system is about 1.3 days from the perspective of the system, and the MTTI is 2-4 days from the perspective of users. The most error-prone item value with respect to any key attribute appears likely in the log every 2-10 days. Weibull, Gamma, and Pearson6 are the three best-fit distributions for the fatal event intervals. The overall correlation of fatal events on the 5D torus network is not prominent, whereas the small-region locality correlation (e.g., the fatal events inside racks) is relatively strong. We believe our work will be interesting to large-scale HPC system administrators and vendors and to fault tolerance researchers, enabling them to better understand fatal events and mitigate such events accordingly.« less

Authors:
ORCiD logo [1]; ORCiD logo [1];  [1];  [1]; ORCiD logo [2];  [1]
  1. Argonne National Lab. (ANL), Lemont, IL (United States)
  2. Univ. of Illinois at Urbana-Champaign, IL (United States)
Publication Date:
Research Org.:
Argonne National Lab. (ANL), Argonne, IL (United States). Argonne Leadership Computing Facility (ALCF)
Sponsoring Org.:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)
OSTI Identifier:
1510059
Grant/Contract Number:  
AC02-06CH11357
Resource Type:
Accepted Manuscript
Journal Name:
IEEE Transactions on Parallel and Distributed Systems
Additional Journal Information:
Journal Volume: 30; Journal Issue: 2; Journal ID: ISSN 1045-9219
Publisher:
IEEE
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING; fatal event analysis; mining correlations; peta-scale supercomputer; reliability-availability-serviceability (RAS)

Citation Formats

Di, Sheng, Guo, Hanqi, Gupta, Rinku, Pershey, Eric R., Snir, Marc, and Cappello, Franck. Exploring Properties and Correlations of Fatal Events in a Large-Scale HPC System. United States: N. p., 2018. Web. doi:10.1109/tpds.2018.2864184.
Di, Sheng, Guo, Hanqi, Gupta, Rinku, Pershey, Eric R., Snir, Marc, & Cappello, Franck. Exploring Properties and Correlations of Fatal Events in a Large-Scale HPC System. United States. https://doi.org/10.1109/tpds.2018.2864184
Di, Sheng, Guo, Hanqi, Gupta, Rinku, Pershey, Eric R., Snir, Marc, and Cappello, Franck. Tue . "Exploring Properties and Correlations of Fatal Events in a Large-Scale HPC System". United States. https://doi.org/10.1109/tpds.2018.2864184. https://www.osti.gov/servlets/purl/1510059.
@article{osti_1510059,
title = {Exploring Properties and Correlations of Fatal Events in a Large-Scale HPC System},
author = {Di, Sheng and Guo, Hanqi and Gupta, Rinku and Pershey, Eric R. and Snir, Marc and Cappello, Franck},
abstractNote = {In this study, we explore potential correlations of fatal system events for one of the most powerful supercomputers-IBM Blue Gene/Q Mira, which is deployed at Argonne National Laboratory, based on its 5-year reliability, availability, and serviceability (RAS) log. Our contribution is two-fold. (1) We design an efficient log analysis tool, namely LogAider, with a novel filtering method to effectively extract fatal events from masses of system messages that are heavily duplicated in the log. LogAider exhibits a very precise detection of temporal-correlation with a high similarity (up to 95 percent) to the ground-truth (i.e., compared to the failure records reported by the administrators). The total number of fatal events can be reduced to about 1,255 compared with originally 2.6 million duplicated fatal messages. (2) We analyze the 5-year RAS log of the MIRA system using LogAider, and summarize six important "takeaways" which can help system vendors and administrators better understand an extreme-scale system's fatal events. Specifically, we find that the distribution or proportion of the fatal system events follow a Pareto-like principle in general. The temporal correlation among fatal events is much stronger than that of warn messages and info messages, and the correlated events tend to constitute a few clusters. The mean time between fatal events (MTBFE) of the Mira system is about 1.3 days from the perspective of the system, and the MTTI is 2-4 days from the perspective of users. The most error-prone item value with respect to any key attribute appears likely in the log every 2-10 days. Weibull, Gamma, and Pearson6 are the three best-fit distributions for the fatal event intervals. The overall correlation of fatal events on the 5D torus network is not prominent, whereas the small-region locality correlation (e.g., the fatal events inside racks) is relatively strong. We believe our work will be interesting to large-scale HPC system administrators and vendors and to fault tolerance researchers, enabling them to better understand fatal events and mitigate such events accordingly.},
doi = {10.1109/tpds.2018.2864184},
journal = {IEEE Transactions on Parallel and Distributed Systems},
number = 2,
volume = 30,
place = {United States},
year = {Tue Aug 14 00:00:00 EDT 2018},
month = {Tue Aug 14 00:00:00 EDT 2018}
}

Journal Article:
Free Publicly Available Full Text
Publisher's Version of Record

Citation Metrics:
Cited by: 11 works
Citation information provided by
Web of Science

Save / Share: