skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: GUIDE: a scalable information directory service to collect, federate, and analyze logs for operational insights into a leadership HPC facility

Abstract

In this paper, we describe the GUIDE framework used to collect, federate, and analyze log data from the Oak Ridge Leadership Computing Facility (OLCF), and how we use that data to derive insights into facility operations. We collect system logs and extract monitoring data at every level of the various OLCF subsystems, and have developed a suite of pre-processing tools to make the raw data consumable. The cleansed logs are then ingested and federated into a central, scalable data warehouse, Splunk, that offers storage, indexing, querying, and visualization capabilities. We have further developed and deployed a set of tools to analyze these multiple disparate log streams in concert and derive operational insights. We describe our experience from developing and deploying the GUIDE infrastructure, and deriving valuable insights on the various subsystems, based on two years of operations in the production OLCF environment.

Authors:
 [1];  [1];  [2];  [1];  [1];  [1];  [1];  [1]
  1. Oak Ridge National Laboratory
  2. Northeastern University
Publication Date:
Research Org.:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF)
Sponsoring Org.:
USDOE Office of Science (SC)
OSTI Identifier:
1567468
DOE Contract Number:  
AC05-00OR22725
Resource Type:
Conference
Resource Relation:
Conference: SC '17 Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
Country of Publication:
United States
Language:
English
Subject:
Computer Science

Citation Formats

Vazhkudai, Sudharshan S., Miller, Ross, Tiwari, Devesh, Zimmer, Christopher, Wang, Feiyi, Oral, Sarp, Gunasekaran, Raghul, and Steinert, Deryl. GUIDE: a scalable information directory service to collect, federate, and analyze logs for operational insights into a leadership HPC facility. United States: N. p., 2017. Web. doi:10.1145/3126908.3126946.
Vazhkudai, Sudharshan S., Miller, Ross, Tiwari, Devesh, Zimmer, Christopher, Wang, Feiyi, Oral, Sarp, Gunasekaran, Raghul, & Steinert, Deryl. GUIDE: a scalable information directory service to collect, federate, and analyze logs for operational insights into a leadership HPC facility. United States. doi:10.1145/3126908.3126946.
Vazhkudai, Sudharshan S., Miller, Ross, Tiwari, Devesh, Zimmer, Christopher, Wang, Feiyi, Oral, Sarp, Gunasekaran, Raghul, and Steinert, Deryl. Sun . "GUIDE: a scalable information directory service to collect, federate, and analyze logs for operational insights into a leadership HPC facility". United States. doi:10.1145/3126908.3126946.
@article{osti_1567468,
title = {GUIDE: a scalable information directory service to collect, federate, and analyze logs for operational insights into a leadership HPC facility},
author = {Vazhkudai, Sudharshan S. and Miller, Ross and Tiwari, Devesh and Zimmer, Christopher and Wang, Feiyi and Oral, Sarp and Gunasekaran, Raghul and Steinert, Deryl},
abstractNote = {In this paper, we describe the GUIDE framework used to collect, federate, and analyze log data from the Oak Ridge Leadership Computing Facility (OLCF), and how we use that data to derive insights into facility operations. We collect system logs and extract monitoring data at every level of the various OLCF subsystems, and have developed a suite of pre-processing tools to make the raw data consumable. The cleansed logs are then ingested and federated into a central, scalable data warehouse, Splunk, that offers storage, indexing, querying, and visualization capabilities. We have further developed and deployed a set of tools to analyze these multiple disparate log streams in concert and derive operational insights. We describe our experience from developing and deploying the GUIDE infrastructure, and deriving valuable insights on the various subsystems, based on two years of operations in the production OLCF environment.},
doi = {10.1145/3126908.3126946},
journal = {},
number = ,
volume = ,
place = {United States},
year = {2017},
month = {1}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share:

Works referenced in this record:

Random Forests
journal, January 2001