skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Use of the ERD for administrative monitoring of Theta

Conference · · Currency and Computation (Online)
DOI:https://doi.org/10.1002/cpe.5099· OSTI ID:1559857

Monitoring the state of an HPC cluster in a timely and accurate fashion is critical to most system administration functions. For many Cray users, the first step in monitoring is ingestion of log files. Unfortunately, log parsing is an inherently inefficient process, requiring multiple software components to read and write from files on disk. Cray's own utilities use a message bus, the Event Router Daemon (ERD), for a wide variety of purposes. At the Argonne Leadership Computing Facility (ALCF), we have begun to use this message bus for monitoring via a client library written in Go, allowing us to read in structured data directly from Cray's services and, in many instances, bypass log files entirely. In this paper, we will examine the implementation and utilization of this approach on our 4392 node XC40, Theta, as well as the overall benefits and drawbacks to using the ERD for real-time monitoring.

Research Organization:
Argonne National Lab. (ANL), Argonne, IL (United States)
Sponsoring Organization:
Argonne National Laboratory - Argonne Leadership Computing Facility
DOE Contract Number:
AC02-06CH11357
OSTI ID:
1559857
Journal Information:
Currency and Computation (Online), Vol. 31, Issue 16; Conference: 2017 Practice and Experience in Advanced Research Computing, New Orleans, LA, US, 07/09/17 - 07/13/17; ISSN 1532-0634
Publisher:
Wiley
Country of Publication:
United States
Language:
English

Similar Records

Use of the ERD for administrative monitoring of Theta
Journal Article · Tue Jan 29 00:00:00 EST 2019 · Concurrency and Computation. Practice and Experience · OSTI ID:1559857

Argonne Leadership Computing Facility: 2021 Operational Assessment Report
Technical Report · Fri Jan 01 00:00:00 EST 2021 · OSTI ID:1559857

The NetLogger Toolkit V2.0
Software · Fri Mar 28 00:00:00 EST 2003 · OSTI ID:1559857

Related Subjects