Real-Time System Log Monitoring/Analytics Framework
- ORNL
Analyzing system logs provides useful insights for identifying system/application anomalies and helps in better usage of system resources. Nevertheless, it is simply not practical to scan through the raw log messages on a regular basis for large-scale systems. First, the sheer volume of unstructured log messages affects the readability, and secondly correlating the log messages to system events is a daunting task. These factors limit large-scale system logs primarily for generating alerts on known system events, and post-mortem diagnosis for identifying previously unknown system events that impacted the systems performance. In this paper, we describe a log monitoring framework that enables prompt analysis of system events in real-time. Our web-based framework provides a summarized view of console, netwatch, consumer, and apsched logs in real- time. The logs are parsed and processed to generate views of applications, message types, individual/group of compute nodes, and in sections of the compute platform. Also from past application runs we build a statistical profile of user/application characteristics with respect to known system events, recoverable/non-recoverable error messages and resources utilized. The web-based tool is being developed for Jaguar XT5 at the Oak Ridge Leadership Computing facility.
- Research Organization:
- Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States). National Center for Computational Sciences (NCCS)
- Sponsoring Organization:
- USDOE
- DOE Contract Number:
- DE-AC05-00OR22725
- OSTI ID:
- 1056901
- Resource Relation:
- Conference: Cray User Group, Fairbanks, AK, USA, 20110522, 20110522
- Country of Publication:
- United States
- Language:
- English
Similar Records
High Performance Computing Facility Operational Assessment, FY 2010 Oak Ridge Leadership Computing Facility
A scalable messaging system for accelerating discovery from large scale scientific simulations