Indicator-directed Dynamic Power Management for Iterative Workloads on GPU-Accelerated Systems
- Clemson University
- BATTELLE (PACIFIC NW LAB)
Modern high-performance and warehouse computing centers show strong interest in minimizing system power consumption while satisfying customers’ quality of service (QoS). Dynamic voltage and frequency scaling (DVFS) is effective for achieving this goal. Nevertheless, automating the process online and making it transparent to users must address three major challenges: (1) Complexity — today’s hardware components (e.g., CPUs, GPUs, memory, network, etc.) can be configured in several or dozens of frequency/voltage states for satisfying divergent system demands. Given their combination and the emergence of heterogeneity, searching the optimal configuration in the design space online can be timing consuming. (2) QoS guarantee — user-defined objectives such as power constraint and performance target must be monitored, predicted and ensured at the best effort. (3) Adaptability — various known and unknown workloads run on systems. Workloads characteristics should be quickly determined and configurations dynamically adjusted in accord with workloads and QoS. In this work, we focus on applications exhibiting an interesting feature – iterative or periodic, which is common among conventional HPC and emerging machine learning workloads. We propose an online dynamic power-performance (ODPP) management framework to dynamically adjust GPU DVFS configurations to meet performance and power objectives and constraints, without any code annotation or intrusion. Particularly, ODPP extracts the performance and power indicators for applications from their resources utilization profiles in a short episode. It further automatically constructs an accurate model that infers from the indicators how the application's performance and power vary with GPU core and memory frequencies. Aided with the model, for both seen and unseen applications, ODPP can quickly determine the most appropriate DVFS configuration for their execution. We evaluate ODPP on an NVIDIA GPU using multiple exascale computing (ECP) and deep learning applications.
- Research Organization:
- Pacific Northwest National Lab. (PNNL), Richland, WA (United States)
- Sponsoring Organization:
- USDOE
- DOE Contract Number:
- AC05-76RL01830
- OSTI ID:
- 1661894
- Report Number(s):
- PNNL-SA-148280
- Resource Relation:
- Conference: The 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGrid 2020), May 11-144, 2020, Melbourne, Australia
- Country of Publication:
- United States
- Language:
- English
Similar Records
Scaling Deep Learning workloads: NVIDIA DGX-1/Pascal and Intel Knights Landing
XUnified: A Framework for Guiding Optimal Use of GPU Unified Memory