A Comprehensive Informative Metric for Analyzing HPC System Status Using the LogSCAN Platform

Hui, Yawei; Park, Byung; Engelmann, Christian

doi:10.1109/FTXS.2018.00007

Title: A Comprehensive Informative Metric for Analyzing HPC System Status Using the LogSCAN Platform

Conference · Thu Nov 01 00:00:00 EDT 2018

DOI:https://doi.org/10.1109/FTXS.2018.00007· OSTI ID:1486939

^[1];

^[1]

ORNL

Log processing by Spark and Cassandra-based ANalytics (LogSCAN) is a newly developed analytical platform that provides flexible and scalable data gathering, transformation and computation. One major challenge is to effectively summarize the status of a complex computer system, such as the Titan supercomputer at the Oak Ridge Leadership Computing Facility (OLCF). Although there is plenty of operational and maintenance information collected and stored in real time, which may yield insights about short- and long-term system status, it is difficult to present this information in a comprehensive form. In this work, we present system information entropy (SIE), a newly developed metric that leverages the powers of traditional machine learning techniques and information theory. By compressing the multivariant multi-dimensional event information recorded during the operation of the targeted system into a single time series of SIE, we demonstrate that the historical system status can be sensitively represented concisely and comprehensively. Given a sharp indicator as SIE, we argue that follow-up analytics based on SIE will reveal in-depth knowledge about system status using other sophisticated approaches, such as pattern recognition in the temporal domain or causality analysis incorporating extra independent metrics of the system.

View Conference

Cite

Export

Save

Research Organization:: Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)

Sponsoring Organization:: USDOE

DOE Contract Number:: AC05-00OR22725

OSTI ID:: 1486939

Resource Relation:: Conference: The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC18) - Dallas, Texas, United States of America - 11/11/2018 5:00:00 AM-11/16/2018 5:00:00 AM

Country of Publication:: United States

Language:: English

References (9)

A Mathematical Theory of Communication Shannon, C. E. Bell System Technical Journal, Vol. 27, Issue 3 https://doi.org/10.1002/j.1538-7305.1948.tb01338.x	journal	July 1948
LIII. On lines and planes of closest fit to systems of points in space Pearson, Karl The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, Vol. 2, Issue 11 https://doi.org/10.1080/14786440109462720	journal	November 1901
A Big Data Analytics Framework for HPC Log Data: Three Case Studies Using the Titan Supercomputer Log Park, Byung H.; Hui, Yawei; Boehm, Swen 2018 IEEE International Conference on Cluster Computing (CLUSTER) https://doi.org/10.1109/CLUSTER.2018.00073	conference	September 2018
Toward Automated Anomaly Identification in Large-Scale Systems Lan, Zhiling; Zheng, Ziming; Li, Yawei IEEE Transactions on Parallel and Distributed Systems, Vol. 21, Issue 2 https://doi.org/10.1109/TPDS.2009.52	journal	February 2010
Big Data Meets HPC Log Analytics: Scalable Approach to Understanding Systems at Extreme Scale Park, Byung H.; Hukerikar, Saurabh; Adamson, Ryan 2017 IEEE International Conference on Cluster Computing (CLUSTER) https://doi.org/10.1109/CLUSTER.2017.113	conference	September 2017
Fault prediction under the microscope: A closer look into HPC systems Gainaru, Ana; Cappello, Franck; Snir, Marc 2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2012.57	conference	November 2012
Adaptive Anomaly Identification by Exploring Metric Subspace in Cloud Computing Infrastructures Guan, Qiang; Fu, Song 2013 IEEE 32nd International Symposium on Reliable Distributed Systems https://doi.org/10.1109/SRDS.2013.29	conference	September 2013
A Principled Approach to HPC Event Monitoring Goudarzi, Alireza; Arnold, Dorian; Stefanovic, Darko Proceedings of the 5th Workshop on Fault Tolerance for HPC at eXtreme Scale https://doi.org/10.1145/2751504.2751506	conference	June 2015
LOGAIDER: A Tool for Mining Potential Correlations of HPC Log Events Di, Sheng; Gupta, Rinku; Snir, Marc 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID) https://doi.org/10.1109/CCGRID.2017.18	conference	May 2017

Similar Records

Are we witnessing the spectre of an HPC meltdown?: Are We Witnessing the Spectre of an HPC Meltdown?

Journal Article · Tue Oct 16 00:00:00 EDT 2018 · Concurrency and Computation. Practice and Experience · OSTI ID:1486939

Melesse Vergara, Veronica G.; Brim, Michael J.; Joubert, Wayne; +6 more

STREAM: A Scalable Federated HPC Telemetry Platform

Conference · Mon May 01 00:00:00 EDT 2023 · OSTI ID:1486939

Adamson, Ryan; Osborne, Tim; Lester, Corwin; +1 more

Applying Graph Analytics to Understand Compute Core Usage and Publication Trends in a Petascale Supercomputing Facility

Conference · Fri Dec 01 00:00:00 EST 2017 · OSTI ID:1486939

Lee, Sangkeun (Matt); Vazhkudai, Sudharshan; Gunasekaran, Raghul

Title: A Comprehensive Informative Metric for Analyzing HPC System Status Using the LogSCAN Platform

Citation Formats

References (9)

Similar Records

Related Subjects