A Comprehensive Informative Metric for Analyzing HPC System Status Using the LogSCAN Platform
- ORNL
Log processing by Spark and Cassandra-based ANalytics (LogSCAN) is a newly developed analytical platform that provides flexible and scalable data gathering, transformation and computation. One major challenge is to effectively summarize the status of a complex computer system, such as the Titan supercomputer at the Oak Ridge Leadership Computing Facility (OLCF). Although there is plenty of operational and maintenance information collected and stored in real time, which may yield insights about short- and long-term system status, it is difficult to present this information in a comprehensive form. In this work, we present system information entropy (SIE), a newly developed metric that leverages the powers of traditional machine learning techniques and information theory. By compressing the multivariant multi-dimensional event information recorded during the operation of the targeted system into a single time series of SIE, we demonstrate that the historical system status can be sensitively represented concisely and comprehensively. Given a sharp indicator as SIE, we argue that follow-up analytics based on SIE will reveal in-depth knowledge about system status using other sophisticated approaches, such as pattern recognition in the temporal domain or causality analysis incorporating extra independent metrics of the system.
- Research Organization:
- Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
- Sponsoring Organization:
- USDOE
- DOE Contract Number:
- AC05-00OR22725
- OSTI ID:
- 1486939
- Resource Relation:
- Conference: The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC18) - Dallas, Texas, United States of America - 11/11/2018 5:00:00 AM-11/16/2018 5:00:00 AM
- Country of Publication:
- United States
- Language:
- English
A Mathematical Theory of Communication
|
journal | July 1948 |
LIII. On lines and planes of closest fit to systems of points in space
|
journal | November 1901 |
A Big Data Analytics Framework for HPC Log Data: Three Case Studies Using the Titan Supercomputer Log
|
conference | September 2018 |
Toward Automated Anomaly Identification in Large-Scale Systems
|
journal | February 2010 |
Big Data Meets HPC Log Analytics: Scalable Approach to Understanding Systems at Extreme Scale
|
conference | September 2017 |
Fault prediction under the microscope: A closer look into HPC systems
|
conference | November 2012 |
Adaptive Anomaly Identification by Exploring Metric Subspace in Cloud Computing Infrastructures
|
conference | September 2013 |
A Principled Approach to HPC Event Monitoring
|
conference | June 2015 |
LOGAIDER: A Tool for Mining Potential Correlations of HPC Log Events
|
conference | May 2017 |
Similar Records
STREAM: A Scalable Federated HPC Telemetry Platform
Applying Graph Analytics to Understand Compute Core Usage and Publication Trends in a Petascale Supercomputing Facility