skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Real-Time System Log Monitoring/Analytics Framework

Conference ·
OSTI ID:1056901

Analyzing system logs provides useful insights for identifying system/application anomalies and helps in better usage of system resources. Nevertheless, it is simply not practical to scan through the raw log messages on a regular basis for large-scale systems. First, the sheer volume of unstructured log messages affects the readability, and secondly correlating the log messages to system events is a daunting task. These factors limit large-scale system logs primarily for generating alerts on known system events, and post-mortem diagnosis for identifying previously unknown system events that impacted the systems performance. In this paper, we describe a log monitoring framework that enables prompt analysis of system events in real-time. Our web-based framework provides a summarized view of console, netwatch, consumer, and apsched logs in real- time. The logs are parsed and processed to generate views of applications, message types, individual/group of compute nodes, and in sections of the compute platform. Also from past application runs we build a statistical profile of user/application characteristics with respect to known system events, recoverable/non-recoverable error messages and resources utilized. The web-based tool is being developed for Jaguar XT5 at the Oak Ridge Leadership Computing facility.

Research Organization:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States). National Center for Computational Sciences (NCCS)
Sponsoring Organization:
USDOE
DOE Contract Number:
DE-AC05-00OR22725
OSTI ID:
1056901
Resource Relation:
Conference: Cray User Group, Fairbanks, AK, USA, 20110522, 20110522
Country of Publication:
United States
Language:
English