Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Data Center Facility Monitoring with Physics Aware Approach

Conference ·

U.S. Department of Energy's National Renewable Energy Laboratory (NREL) hosts one of the world's most energy-efficient HPC data centers; this system uses component-level warm-water liquid cooling to efficiently remove heat from the data center and capture it for reuse in the building or rejection to the atmosphere. Given the complexity of this system, building data-driven tools for holistically monitoring and operating the entire data center is a priority for ensuring maximal efficiency and resiliency. In this advanced smart facility, over one million metrics are recorded per minute using state-of-the-art streaming data architecture and software to capture and process the state of the system in real time. Here we detail two efforts to effectively analyze, visualize, and interpret this large volume streaming data. We have developed a novel, flexible system for identifying and visualizing individual metric anomalies and component performance across the data center through automatic metadata extraction and physically-motivated visualization for quick interpretation. Additionally, to directly connect system maintenance to data stream processing we explore a physics informed multi-metric drift and anomaly detection application to detect scale-build up in heat exchangers.

Research Organization:
National Renewable Energy Laboratory (NREL), Golden, CO (United States)
Sponsoring Organization:
USDOE Office of Energy Efficiency and Renewable Energy (EERE); Hewlett-Packard Enterprise
DOE Contract Number:
AC36-08GO28308
OSTI ID:
1975001
Report Number(s):
NREL/CP-2C00-84123; MainId:84896; UUID:f393e45d-ff73-4498-b249-6853441339d0; MainAdminID:69575
Resource Relation:
Conference: Presented at the ISC High Performance 2022 International Workshops, 29 May - 2 June 2022, Hamburg, Germany
Country of Publication:
United States
Language:
English

References (5)

A semisupervised autoencoder-based approach for anomaly detection in high performance computing systems journal October 2019
AutoDiagn: An Automated Real-Time Diagnosis Framework for Big Data Systems journal May 2022
Adaptive Anomaly Identification by Exploring Metric Subspace in Cloud Computing Infrastructures conference September 2013
Diagnosing Performance Variations in HPC Applications Using Machine Learning book January 2017
Online Diagnosis of Performance Variation in HPC Systems Using Machine Learning journal April 2019