Experiences in autotuning matrix multiplication for energy minimization on GPUs
Abstract
Summary In this paper, we report extensive results and analysis of autotuning the computationally intensive graphics processing units kernel for dense matrix–matrix multiplication in double precision. In contrast to traditional autotuning and/or optimization for runtime performance only, we also take the energy efficiency into account. For kernels achieving equal performance, we show significant differences in their energy balance. We also identify the memory throughput as the most influential metric that trades off performance and energy efficiency. As a result, the performance optimal case ends up not being the most efficient kernel in overall resource use. Copyright © 2015 John Wiley & Sons, Ltd.
- Authors:
-
- Univ. of Tennessee, Knoxville, TN (United States). Dept. of Electrical Engineering and Computer Science
- Univ. of Tennessee, Knoxville, TN (United States). Dept. of Electrical Engineering and Computer Science; Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States); Univ. of Manchester (United Kingdom)
- Publication Date:
- Research Org.:
- Univ. of Tennessee, Knoxville, TN (United States); Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
- Sponsoring Org.:
- USDOE; National Science Foundation (NSF); Nvidia Corporation (United States); Intel Corporation (United States); Advanced Micro Devices, Inc. (AMD) (United States); Russian Scientific Fund (Russian Federation)
- Contributing Org.:
- Univ. of Manchester (United Kingdom)
- OSTI Identifier:
- 1361296
- Alternate Identifier(s):
- OSTI ID: 1401625
- Grant/Contract Number:
- AC05-00OR22725; SHF-1320603; N14-11-00190
- Resource Type:
- Accepted Manuscript
- Journal Name:
- Concurrency and Computation. Practice and Experience
- Additional Journal Information:
- Journal Volume: 27; Journal Issue: 17; Journal ID: ISSN 1532-0626
- Publisher:
- Wiley
- Country of Publication:
- United States
- Language:
- English
- Subject:
- 97 MATHEMATICS AND COMPUTING; automatic software tuning; hardware accelerators; matrix multiplication; power; energy
Citation Formats
Anzt, Hartwig, Haugen, Blake, Kurzak, Jakub, Luszczek, Piotr, and Dongarra, Jack. Experiences in autotuning matrix multiplication for energy minimization on GPUs. United States: N. p., 2015.
Web. doi:10.1002/cpe.3516.
Anzt, Hartwig, Haugen, Blake, Kurzak, Jakub, Luszczek, Piotr, & Dongarra, Jack. Experiences in autotuning matrix multiplication for energy minimization on GPUs. United States. https://doi.org/10.1002/cpe.3516
Anzt, Hartwig, Haugen, Blake, Kurzak, Jakub, Luszczek, Piotr, and Dongarra, Jack. Wed .
"Experiences in autotuning matrix multiplication for energy minimization on GPUs". United States. https://doi.org/10.1002/cpe.3516. https://www.osti.gov/servlets/purl/1361296.
@article{osti_1361296,
title = {Experiences in autotuning matrix multiplication for energy minimization on GPUs},
author = {Anzt, Hartwig and Haugen, Blake and Kurzak, Jakub and Luszczek, Piotr and Dongarra, Jack},
abstractNote = {Summary In this paper, we report extensive results and analysis of autotuning the computationally intensive graphics processing units kernel for dense matrix–matrix multiplication in double precision. In contrast to traditional autotuning and/or optimization for runtime performance only, we also take the energy efficiency into account. For kernels achieving equal performance, we show significant differences in their energy balance. We also identify the memory throughput as the most influential metric that trades off performance and energy efficiency. As a result, the performance optimal case ends up not being the most efficient kernel in overall resource use. Copyright © 2015 John Wiley & Sons, Ltd.},
doi = {10.1002/cpe.3516},
journal = {Concurrency and Computation. Practice and Experience},
number = 17,
volume = 27,
place = {United States},
year = {Wed May 20 00:00:00 EDT 2015},
month = {Wed May 20 00:00:00 EDT 2015}
}
Web of Science
Works referenced in this record:
Auto-tuning a high-level language targeted to GPU codes
conference, May 2012
- Grauer-Gray, Scott; Xu, Lifan; Searles, Robert
- 2012 Innovative Parallel Computing (InPar)
Improving power efficiency of dense linear algebra algorithms on multi-core processors via slack control
conference, July 2011
- Alonso, Pedro; Dolz, Manuel F.; Mayo, Rafael
- Simulation (HPCS), 2011 International Conference on High Performance Computing & Simulation
Fast implementation of DGEMM on Fermi GPU
conference, January 2011
- Tan, Guangming; Li, Linchuan; Triechle, Sean
- Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11
A new energy aware performance metric
journal, July 2010
- Bekas, Costas; Curioni, Alessandro
- Computer Science - Research and Development, Vol. 25, Issue 3-4
Autotuning Stencil-Based Computations on GPUs
conference, September 2012
- Mametjanov, Azamat; Lowell, Daniel; Ma, Ching-Chen
- 2012 IEEE International Conference on Cluster Computing (CLUSTER)
Energy-efficient execution of dense linear algebra algorithms on multi-core processors
journal, May 2012
- Alonso, Pedro; Dolz, Manuel F.; Mayo, Rafael
- Cluster Computing, Vol. 16, Issue 3
Search Space Pruning Constraints Visualization
conference, September 2014
- Haugen, Blake; Kurzak, Jakub
- 2014 Second IEEE Working Conference on Software Visualization (VISSOFT)
Quantifying the energy cost of data movement in scientific applications
conference, September 2013
- Kestor, Gokcen; Gioiosa, Roberto; Kerbyson, Darren J.
- 2013 IEEE International Symposium on Workload Characterization (IISWC)
An Improved Magma Gemm For Fermi Graphics Processing Units
journal, September 2010
- Nath, Rajib; Tomov, Stanimire; Dongarra, Jack
- The International Journal of High Performance Computing Applications, Vol. 24, Issue 4
Input-aware auto-tuning for directive-based GPU programming
conference, January 2013
- Magni, Alberto; Grewe, Dominik; Johnson, Nick
- Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units - GPGPU-6
Unveiling the performance-energy trade-off in iterative linear system solvers for multithreaded processors: Unveiling the performance-energy trade-off in iterative linear system solvers for multithreaded processors
journal, September 2014
- Aliaga, José I.; Anzt, Hartwig; Castillo, Maribel
- Concurrency and Computation: Practice and Experience, Vol. 27, Issue 4
The LINPACK Benchmark: past, present and future
journal, January 2003
- Dongarra, Jack J.; Luszczek, Piotr; Petitet, Antoine
- Concurrency and Computation: Practice and Experience, Vol. 15, Issue 9
Preliminary Results of Autotuning GEMM Kernels for the NVIDIA Kepler Architecture- GeForce GTX 680
report, April 2012
- Kurzak, Jakub; Luszczek, Pitor; Tomov, Stanimire
Improving the energy efficiency of sparse linear system solvers on multicore and manycore systems
journal, June 2014
- Anzt, H.; Quintana-Ortí, E. S.
- Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, Vol. 372, Issue 2018
Algorithmic Time, Energy, and Power on Candidate HPC Compute Building Blocks
conference, May 2014
- Choi, Jee; Dukhan, Marat; Liu, Xing
- 2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2014 IEEE 28th International Parallel and Distributed Processing Symposium
PowerPack: Energy Profiling and Analysis of High-Performance Systems and Applications
journal, May 2010
- Ge, Rong; Feng, Xizhou; Song, Shuaiwen
- IEEE Transactions on Parallel and Distributed Systems, Vol. 21, Issue 5
Energy Efficient Scheduling of Real-Time Tasks on Multicore Processors
journal, November 2008
- Euiseong Seo,
- IEEE Transactions on Parallel and Distributed Systems, Vol. 19, Issue 11
RAPL: memory power estimation and capping
conference, January 2010
- David, Howard; Gorbatov, Eugene; Hanebutte, Ulf R.
- Proceedings of the 16th ACM/IEEE international symposium on Low power electronics and design - ISLPED '10
Power emulation based DVFS efficiency investigations for embedded systems
conference, September 2010
- Genser, Andreas; Bachmann, Christian; Steger, Christian
- 2010 International Symposium on System-on-Chip - SOC, 2010 International Symposium on System on Chip
Model-driven autotuning of sparse matrix-vector multiply on GPUs
journal, May 2010
- Choi, Jee W.; Singh, Amik; Vuduc, Richard W.
- ACM SIGPLAN Notices, Vol. 45, Issue 5
Understanding the Energy Consumption of Dynamic Random Access Memories
conference, December 2010
- Vogelsang, Thomas
- 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)
A survey of architectural techniques for DRAM power management
journal, January 2012
- Mittal, Sparsh
- International Journal of High Performance Systems Architecture, Vol. 4, Issue 2
Resource-conscious scheduling for energy efficiency on multicore processors
conference, January 2010
- Merkel, Andreas; Stoess, Jan; Bellosa, Frank
- Proceedings of the 5th European conference on Computer systems - EuroSys '10
Performance Tuning of Matrix Multiplication in OpenCL on Different GPUs and CPUs
conference, November 2012
- Matsumoto, Kazuya; Nakasato, Naohito; Sedukhin, Stanislav G.
- 2012 SC Companion: High Performance Computing, Networking, Storage and Analysis (SCC), 2012 SC Companion: High Performance Computing, Networking Storage and Analysis
Automatically Tuning Sparse Matrix-Vector Multiplication for GPU Architectures
book, January 2010
- Monakov, Alexander; Lokhmotov, Anton; Avetisyan, Arutyun
- High Performance Embedded Architectures and Compilers
Works referencing / citing this record:
Novel HPC techniques to batch execution of many variable size BLAS computations on GPUs
conference, January 2017
- Abdelfattah, Ahmad; Haidar, Azzam; Tomov, Stanimire
- Proceedings of the International Conference on Supercomputing - ICS '17
BOAST: A metaprogramming framework to produce portable and efficient computing kernels for HPC applications
journal, August 2017
- Videau, Brice; Pouget, Kevin; Genovese, Luigi
- The International Journal of High Performance Computing Applications, Vol. 32, Issue 1