DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Machine Learning-Based Temperature Prediction for Runtime Thermal Management Across System Components

Abstract

Elevated temperatures limit the peak performance of systems because of frequent interventions by thermal throttling. Non-uniform thermal states across system nodes also cause performance variation within seemingly equivalent nodes leading to significant degradation of overall performance. In this paper we present a framework for creating a lightweight thermal prediction system suitable for run-time management decisions. We pursue two avenues to explore optimized lightweight thermal predictors. First, we use feature selection algorithms to improve the performance of previously designed machine learning methods. Second, we develop alternative methods using neural network and linear regression-based methods to perform a comprehensive comparative study of prediction methods. We show that our optimized models achieve improved performance with better prediction accuracy and lower overhead as compared with the Gaussian process model proposed previously. Specifically we present a reduced version of the Gaussian process model, a neural network-based model, and a linear regression-based model. Using the optimization methods, we are able to reduce the average prediction errors in the Gaussian process from 4: 2 degrees C to 2: 9 degrees C. We also show that the newly developed models using neural network and Lasso linear regression have average prediction errors of 2: 9 degrees C and 3:more » 8 degrees C respectively. The prediction overheads are 0.22, 0.097, and 0.026 ms per prediction for reduced Gaussian process, neural network, and Lasso linear regression models, respectively, compared with 0.57 ms per prediction for the previous Gaussian process model. We have implemented our proposed thermal prediction models on a two-node system configuration to help identify the optimal task placement. The task placement identified by the models reduces the average system temperature by up to 11: 9 degrees C without any performance degradation. Furthermore, these models respectively achieve 75, 82.5, and 74.17 percent success rates in correctly pointing to those task placements with better thermal response, compared with 72.5 percent success for the original model in achieving the same objective. Lastly, we extended our analysis to a 16-node system and we were able to train models and execute them in real time to guide task migration and achieve on average 17 percent reduction in the overall system cooling power.« less

Authors:
 [1];  [1];  [1]; ORCiD logo [1];  [2];  [2];  [2]
  1. Northwestern Univ., Evanston, IL (United States)
  2. Argonne National Lab. (ANL), Argonne, IL (United States)
Publication Date:
Research Org.:
Argonne National Lab. (ANL), Argonne, IL (United States)
Sponsoring Org.:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR); National Science Foundation (NSF)
OSTI Identifier:
1461523
Grant/Contract Number:  
AC02-06CH11357
Resource Type:
Accepted Manuscript
Journal Name:
IEEE Transactions on Parallel and Distributed Systems
Additional Journal Information:
Journal Volume: 29; Journal Issue: 2; Journal ID: ISSN 1045-9219
Publisher:
IEEE
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING; Thermal modeling; high performance computing systems; many-core processors; operating systems

Citation Formats

Zhang, Kaicheng, Guliani, Akhil, Ogrenci-Memik, Seda, Memik, Gokhan, Yoshii, Kazutomo, Sankaran, Rajesh, and Beckman, Pete. Machine Learning-Based Temperature Prediction for Runtime Thermal Management Across System Components. United States: N. p., 2017. Web. doi:10.1109/TPDS.2017.2732951.
Zhang, Kaicheng, Guliani, Akhil, Ogrenci-Memik, Seda, Memik, Gokhan, Yoshii, Kazutomo, Sankaran, Rajesh, & Beckman, Pete. Machine Learning-Based Temperature Prediction for Runtime Thermal Management Across System Components. United States. https://doi.org/10.1109/TPDS.2017.2732951
Zhang, Kaicheng, Guliani, Akhil, Ogrenci-Memik, Seda, Memik, Gokhan, Yoshii, Kazutomo, Sankaran, Rajesh, and Beckman, Pete. Fri . "Machine Learning-Based Temperature Prediction for Runtime Thermal Management Across System Components". United States. https://doi.org/10.1109/TPDS.2017.2732951. https://www.osti.gov/servlets/purl/1461523.
@article{osti_1461523,
title = {Machine Learning-Based Temperature Prediction for Runtime Thermal Management Across System Components},
author = {Zhang, Kaicheng and Guliani, Akhil and Ogrenci-Memik, Seda and Memik, Gokhan and Yoshii, Kazutomo and Sankaran, Rajesh and Beckman, Pete},
abstractNote = {Elevated temperatures limit the peak performance of systems because of frequent interventions by thermal throttling. Non-uniform thermal states across system nodes also cause performance variation within seemingly equivalent nodes leading to significant degradation of overall performance. In this paper we present a framework for creating a lightweight thermal prediction system suitable for run-time management decisions. We pursue two avenues to explore optimized lightweight thermal predictors. First, we use feature selection algorithms to improve the performance of previously designed machine learning methods. Second, we develop alternative methods using neural network and linear regression-based methods to perform a comprehensive comparative study of prediction methods. We show that our optimized models achieve improved performance with better prediction accuracy and lower overhead as compared with the Gaussian process model proposed previously. Specifically we present a reduced version of the Gaussian process model, a neural network-based model, and a linear regression-based model. Using the optimization methods, we are able to reduce the average prediction errors in the Gaussian process from 4: 2 degrees C to 2: 9 degrees C. We also show that the newly developed models using neural network and Lasso linear regression have average prediction errors of 2: 9 degrees C and 3: 8 degrees C respectively. The prediction overheads are 0.22, 0.097, and 0.026 ms per prediction for reduced Gaussian process, neural network, and Lasso linear regression models, respectively, compared with 0.57 ms per prediction for the previous Gaussian process model. We have implemented our proposed thermal prediction models on a two-node system configuration to help identify the optimal task placement. The task placement identified by the models reduces the average system temperature by up to 11: 9 degrees C without any performance degradation. Furthermore, these models respectively achieve 75, 82.5, and 74.17 percent success rates in correctly pointing to those task placements with better thermal response, compared with 72.5 percent success for the original model in achieving the same objective. Lastly, we extended our analysis to a 16-node system and we were able to train models and execute them in real time to guide task migration and achieve on average 17 percent reduction in the overall system cooling power.},
doi = {10.1109/TPDS.2017.2732951},
journal = {IEEE Transactions on Parallel and Distributed Systems},
number = 2,
volume = 29,
place = {United States},
year = {Fri Jul 28 00:00:00 EDT 2017},
month = {Fri Jul 28 00:00:00 EDT 2017}
}

Journal Article:
Free Publicly Available Full Text
Publisher's Version of Record

Citation Metrics:
Cited by: 37 works
Citation information provided by
Web of Science

Save / Share:

Works referencing / citing this record:

Machine learning and artificial neural network accelerated computational discoveries in materials science
journal, November 2019

  • Hong, Yang; Hou, Bo; Jiang, Hengle
  • WIREs Computational Molecular Science, Vol. 10, Issue 3
  • DOI: 10.1002/wcms.1450

Proposing Enhanced Feature Engineering and a Selection Model for Machine Learning Processes
journal, April 2018

  • Uddin, Muhammad; Lee, Jeongkyu; Rizvi, Syed
  • Applied Sciences, Vol. 8, Issue 4
  • DOI: 10.3390/app8040646