Use of the ERD for administrative monitoring of Theta
Monitoring the state of an HPC cluster in a timely and accurate fashion is critical to most system administration functions. For many Cray users, the first step in monitoring is ingestion of log files. Unfortunately, log parsing is an inherently inefficient process, requiring multiple software components to read and write from files on disk. Cray's own utilities use a message bus, the Event Router Daemon (ERD), for a wide variety of purposes. At the Argonne Leadership Computing Facility (ALCF), we have begun to use this message bus for monitoring via a client library written in Go, allowing us to read in structured data directly from Cray's services and, in many instances, bypass log files entirely. In this paper, we will examine the implementation and utilization of this approach on our 4392 node XC40, Theta, as well as the overall benefits and drawbacks to using the ERD for real-time monitoring.
- Research Organization:
- Argonne National Lab. (ANL), Argonne, IL (United States)
- Sponsoring Organization:
- Argonne National Laboratory - Argonne Leadership Computing Facility
- DOE Contract Number:
- AC02-06CH11357
- OSTI ID:
- 1559857
- Journal Information:
- Currency and Computation (Online), Vol. 31, Issue 16; Conference: 2017 Practice and Experience in Advanced Research Computing, New Orleans, LA, US, 07/09/17 - 07/13/17; ISSN 1532-0634
- Publisher:
- Wiley
- Country of Publication:
- United States
- Language:
- English
Similar Records
Argonne Leadership Computing Facility: 2021 Operational Assessment Report
The NetLogger Toolkit V2.0