A Big Data Analytics Framework for HPC Log Data: Three Case Studies Using the Titan Supercomputer Log

Park, Byung H.; Hui, Yawei; Boehm, Swen; Ashraf, Rizwan A.; Layton, Christopher; Engelmann, Christian

doi:10.1109/CLUSTER.2018.00073

A Big Data Analytics Framework for HPC Log Data: Three Case Studies Using the Titan Supercomputer Log

Conference · Sat Sep 01 00:00:00 EDT 2018 · 2018 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER)

DOI:https://doi.org/10.1109/CLUSTER.2018.00073· OSTI ID:1567487

Park, Byung H.; Hui, Yawei; Boehm, Swen; Ashraf, Rizwan A.; Layton, Christopher; Engelmann, Christian

Reliability, availability and serviceability (RAS) logs of high performance computing (HPC) resources, when closely investigated in spatial and temporal dimensions, can provide invaluable information regarding system status, performance, and resource utilization. These data are often generated from multiple logging systems and sensors that cover many components of the system. The analysis of these data for finding persistent temporal and spatial insights faces two main difficulties: the volume of RAS logs makes manual inspection difficult and the unstructured nature and unique properties of log data produced by each subsystem adds another dimension of difficulty in identifying implicit correlation among recorded events. To address these issues, we recently developed a multi-user Big Data analytics framework for HPC log data at Oak Ridge National Laboratory (ORNL). This paper introduces three in-progress data analytics projects that leverage this framework to assess system status, mine event patterns, and study correlations between user applications and system events. We describe the motivation of each project and detail their workflows using three years of log data collected from ORNL's Titan supercomputer.

🛈

OSTI does not have a digital full text copy available. For more information, please see document availability, search WorldCat, or search Google Scholar.

Research Organization:: Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF); UT-Battelle LLC/ORNL, Oak Ridge, TN (Unted States)

Sponsoring Organization:: USDOE Office of Science; USDOE

DOE Contract Number:: AC05-00OR22725

OSTI ID:: 1567487

Journal Information:: 2018 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER), Journal Name: 2018 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER); ISSN 1552-5244

Country of Publication:: United States

Language:: English

Similar Records

A Big Data Analytics Framework for HPC Log Data: Three Case Studies Using the Titan Supercomputer Log

Conference · Thu Nov 01 00:00:00 EDT 2018 · OSTI ID:1570137

Big Data Meets HPC Log Analytics: Scalable Approach to Understanding Systems at Extreme Scale

Conference · Fri Sep 01 00:00:00 EDT 2017 · OSTI ID:1460236

Performance Analysis Tool for HPC and Big Data Applications on Scientific Clusters

Book · Sat Sep 17 00:00:00 EDT 2016 · OSTI ID:1393595

Related Subjects

Computer Science
Engineering

A Big Data Analytics Framework for HPC Log Data: Three Case Studies Using the Titan Supercomputer Log

Citation Formats

Similar Records

Related Subjects