skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: A Comprehensive Informative Metric for Analyzing HPC System Status Using the LogSCAN Platform

Abstract

Log processing by Spark and Cassandra-based ANalytics (LogSCAN) is a newly developed analytical platform that provides flexible and scalable data gathering, transformation and computation. One major challenge is to effectively summarize the status of a complex computer system, such as the Titan supercomputer at the Oak Ridge Leadership Computing Facility (OLCF). Although there is plenty of operational and maintenance information collected and stored in real time, which may yield insights about short- and long-term system status, it is difficult to present this information in a comprehensive form. In this work, we present system information entropy (SIE), a newly developed metric that leverages the powers of traditional machine learning techniques and information theory. By compressing the multivariant multi-dimensional event information recorded during the operation of the targeted system into a single time series of SIE, we demonstrate that the historical system status can be sensitively represented concisely and comprehensively. Given a sharp indicator as SIE, we argue that follow-up analytics based on SIE will reveal in-depth knowledge about system status using other sophisticated approaches, such as pattern recognition in the temporal domain or causality analysis incorporating extra independent metrics of the system.

Authors:
ORCiD logo [1]; ORCiD logo [1]; ORCiD logo [1]
  1. ORNL
Publication Date:
Research Org.:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Sponsoring Org.:
USDOE
OSTI Identifier:
1486939
DOE Contract Number:  
AC05-00OR22725
Resource Type:
Conference
Resource Relation:
Conference: The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC18) - Dallas, Texas, United States of America - 11/11/2018 5:00:00 AM-11/16/2018 5:00:00 AM
Country of Publication:
United States
Language:
English

Citation Formats

Hui, Yawei, Park, Byung H., and Engelmann, Christian. A Comprehensive Informative Metric for Analyzing HPC System Status Using the LogSCAN Platform. United States: N. p., 2018. Web. doi:10.1109/FTXS.2018.00007.
Hui, Yawei, Park, Byung H., & Engelmann, Christian. A Comprehensive Informative Metric for Analyzing HPC System Status Using the LogSCAN Platform. United States. doi:10.1109/FTXS.2018.00007.
Hui, Yawei, Park, Byung H., and Engelmann, Christian. Thu . "A Comprehensive Informative Metric for Analyzing HPC System Status Using the LogSCAN Platform". United States. doi:10.1109/FTXS.2018.00007. https://www.osti.gov/servlets/purl/1486939.
@article{osti_1486939,
title = {A Comprehensive Informative Metric for Analyzing HPC System Status Using the LogSCAN Platform},
author = {Hui, Yawei and Park, Byung H. and Engelmann, Christian},
abstractNote = {Log processing by Spark and Cassandra-based ANalytics (LogSCAN) is a newly developed analytical platform that provides flexible and scalable data gathering, transformation and computation. One major challenge is to effectively summarize the status of a complex computer system, such as the Titan supercomputer at the Oak Ridge Leadership Computing Facility (OLCF). Although there is plenty of operational and maintenance information collected and stored in real time, which may yield insights about short- and long-term system status, it is difficult to present this information in a comprehensive form. In this work, we present system information entropy (SIE), a newly developed metric that leverages the powers of traditional machine learning techniques and information theory. By compressing the multivariant multi-dimensional event information recorded during the operation of the targeted system into a single time series of SIE, we demonstrate that the historical system status can be sensitively represented concisely and comprehensively. Given a sharp indicator as SIE, we argue that follow-up analytics based on SIE will reveal in-depth knowledge about system status using other sophisticated approaches, such as pattern recognition in the temporal domain or causality analysis incorporating extra independent metrics of the system.},
doi = {10.1109/FTXS.2018.00007},
journal = {},
number = ,
volume = ,
place = {United States},
year = {2018},
month = {11}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share: