Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

STREAM: A Scalable Federated HPC Telemetry Platform

Conference ·
OSTI ID:1995656

Obtaining and analyzing high performance computing (HPC) telemetry in real time is a complex task that can impact algo- rithmic performance, operating costs, and ultimately scientific outcomes. If your organization operates multiple HPC systems, filesystems, and clusters, telemetry streams can be synthesized in order to ease operational and analytics burden. In order to collect this telemetry, the Oak Ridge Leadership Computing Facility (OLCF) has deployed STREAM (Streaming Telemetry for Resource Events, Analytics, and Monitoring), which is a distributed and high-performance message bus based on Apache Kafka. STREAM collects center-wide performance information and must interface with many sources, including five HPE deployed supercomputers, each with their own Kafka cluster which is managed by HPCM. OLCF Supercomputers and their attached scratch filesystems currently send more than 300 million messages to over 200 topics producing around 1.3 Terabytes per day of telemetry data to STREAM. This paper describes the architectural principles that enable STREAM to be both resilient and highly performant while supporting multiple upstream Kafka clusters and other data sources. It also discusses the design challenges and decisions faced in adapting our existing system- monitoring infrastructure to support the first Exascale computing platform.

Research Organization:
Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
Sponsoring Organization:
USDOE; USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR) (SC-21)
DOE Contract Number:
AC05-00OR22725
OSTI ID:
1995656
Country of Publication:
United States
Language:
English

Similar Records

HPC Molecular Simulation Tries Out a New GPU: Experiences on Early AMD Test Systems for the Frontier Supercomputer
Conference · Wed Jun 01 00:00:00 EDT 2022 · OSTI ID:1883870

Sonar
Software · Sun Nov 18 19:00:00 EST 2018 · OSTI ID:code-23162

Are we witnessing the spectre of an HPC meltdown?: Are We Witnessing the Spectre of an HPC Meltdown?
Journal Article · Tue Oct 16 00:00:00 EDT 2018 · Concurrency and Computation. Practice and Experience · OSTI ID:1488719

Related Subjects