DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Experiences in autotuning matrix multiplication for energy minimization on GPUs

Abstract

Summary In this paper, we report extensive results and analysis of autotuning the computationally intensive graphics processing units kernel for dense matrix–matrix multiplication in double precision. In contrast to traditional autotuning and/or optimization for runtime performance only, we also take the energy efficiency into account. For kernels achieving equal performance, we show significant differences in their energy balance. We also identify the memory throughput as the most influential metric that trades off performance and energy efficiency. As a result, the performance optimal case ends up not being the most efficient kernel in overall resource use. Copyright © 2015 John Wiley & Sons, Ltd.

Authors:
 [1];  [1];  [1]; ORCiD logo [1];  [2]
  1. Univ. of Tennessee, Knoxville, TN (United States). Dept. of Electrical Engineering and Computer Science
  2. Univ. of Tennessee, Knoxville, TN (United States). Dept. of Electrical Engineering and Computer Science; Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States); Univ. of Manchester (United Kingdom)
Publication Date:
Research Org.:
Univ. of Tennessee, Knoxville, TN (United States); Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
Sponsoring Org.:
USDOE; National Science Foundation (NSF); Nvidia Corporation (United States); Intel Corporation (United States); Advanced Micro Devices, Inc. (AMD) (United States); Russian Scientific Fund (Russian Federation)
Contributing Org.:
Univ. of Manchester (United Kingdom)
OSTI Identifier:
1361296
Alternate Identifier(s):
OSTI ID: 1401625
Grant/Contract Number:  
AC05-00OR22725; SHF-1320603; N14-11-00190
Resource Type:
Accepted Manuscript
Journal Name:
Concurrency and Computation. Practice and Experience
Additional Journal Information:
Journal Volume: 27; Journal Issue: 17; Journal ID: ISSN 1532-0626
Publisher:
Wiley
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING; automatic software tuning; hardware accelerators; matrix multiplication; power; energy

Citation Formats

Anzt, Hartwig, Haugen, Blake, Kurzak, Jakub, Luszczek, Piotr, and Dongarra, Jack. Experiences in autotuning matrix multiplication for energy minimization on GPUs. United States: N. p., 2015. Web. doi:10.1002/cpe.3516.
Anzt, Hartwig, Haugen, Blake, Kurzak, Jakub, Luszczek, Piotr, & Dongarra, Jack. Experiences in autotuning matrix multiplication for energy minimization on GPUs. United States. https://doi.org/10.1002/cpe.3516
Anzt, Hartwig, Haugen, Blake, Kurzak, Jakub, Luszczek, Piotr, and Dongarra, Jack. Wed . "Experiences in autotuning matrix multiplication for energy minimization on GPUs". United States. https://doi.org/10.1002/cpe.3516. https://www.osti.gov/servlets/purl/1361296.
@article{osti_1361296,
title = {Experiences in autotuning matrix multiplication for energy minimization on GPUs},
author = {Anzt, Hartwig and Haugen, Blake and Kurzak, Jakub and Luszczek, Piotr and Dongarra, Jack},
abstractNote = {Summary In this paper, we report extensive results and analysis of autotuning the computationally intensive graphics processing units kernel for dense matrix–matrix multiplication in double precision. In contrast to traditional autotuning and/or optimization for runtime performance only, we also take the energy efficiency into account. For kernels achieving equal performance, we show significant differences in their energy balance. We also identify the memory throughput as the most influential metric that trades off performance and energy efficiency. As a result, the performance optimal case ends up not being the most efficient kernel in overall resource use. Copyright © 2015 John Wiley & Sons, Ltd.},
doi = {10.1002/cpe.3516},
journal = {Concurrency and Computation. Practice and Experience},
number = 17,
volume = 27,
place = {United States},
year = {Wed May 20 00:00:00 EDT 2015},
month = {Wed May 20 00:00:00 EDT 2015}
}

Journal Article:
Free Publicly Available Full Text
Publisher's Version of Record

Citation Metrics:
Cited by: 10 works
Citation information provided by
Web of Science

Save / Share:

Works referenced in this record:

Auto-tuning a high-level language targeted to GPU codes
conference, May 2012


Improving power efficiency of dense linear algebra algorithms on multi-core processors via slack control
conference, July 2011

  • Alonso, Pedro; Dolz, Manuel F.; Mayo, Rafael
  • Simulation (HPCS), 2011 International Conference on High Performance Computing & Simulation
  • DOI: 10.1109/HPCSim.2011.5999861

Fast implementation of DGEMM on Fermi GPU
conference, January 2011

  • Tan, Guangming; Li, Linchuan; Triechle, Sean
  • Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11
  • DOI: 10.1145/2063384.2063431

A new energy aware performance metric
journal, July 2010

  • Bekas, Costas; Curioni, Alessandro
  • Computer Science - Research and Development, Vol. 25, Issue 3-4
  • DOI: 10.1007/s00450-010-0119-z

Autotuning Stencil-Based Computations on GPUs
conference, September 2012

  • Mametjanov, Azamat; Lowell, Daniel; Ma, Ching-Chen
  • 2012 IEEE International Conference on Cluster Computing (CLUSTER)
  • DOI: 10.1109/CLUSTER.2012.46

Energy-efficient execution of dense linear algebra algorithms on multi-core processors
journal, May 2012


Search Space Pruning Constraints Visualization
conference, September 2014

  • Haugen, Blake; Kurzak, Jakub
  • 2014 Second IEEE Working Conference on Software Visualization (VISSOFT)
  • DOI: 10.1109/VISSOFT.2014.15

Quantifying the energy cost of data movement in scientific applications
conference, September 2013

  • Kestor, Gokcen; Gioiosa, Roberto; Kerbyson, Darren J.
  • 2013 IEEE International Symposium on Workload Characterization (IISWC)
  • DOI: 10.1109/IISWC.2013.6704670

An Improved Magma Gemm For Fermi Graphics Processing Units
journal, September 2010

  • Nath, Rajib; Tomov, Stanimire; Dongarra, Jack
  • The International Journal of High Performance Computing Applications, Vol. 24, Issue 4
  • DOI: 10.1177/1094342010385729

Input-aware auto-tuning for directive-based GPU programming
conference, January 2013

  • Magni, Alberto; Grewe, Dominik; Johnson, Nick
  • Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units - GPGPU-6
  • DOI: 10.1145/2458523.2458530

Unveiling the performance-energy trade-off in iterative linear system solvers for multithreaded processors: Unveiling the performance-energy trade-off in iterative linear system solvers for multithreaded processors
journal, September 2014

  • Aliaga, José I.; Anzt, Hartwig; Castillo, Maribel
  • Concurrency and Computation: Practice and Experience, Vol. 27, Issue 4
  • DOI: 10.1002/cpe.3341

The LINPACK Benchmark: past, present and future
journal, January 2003

  • Dongarra, Jack J.; Luszczek, Piotr; Petitet, Antoine
  • Concurrency and Computation: Practice and Experience, Vol. 15, Issue 9
  • DOI: 10.1002/cpe.728

Preliminary Results of Autotuning GEMM Kernels for the NVIDIA Kepler Architecture- GeForce GTX 680
report, April 2012


Improving the energy efficiency of sparse linear system solvers on multicore and manycore systems
journal, June 2014

  • Anzt, H.; Quintana-Ortí, E. S.
  • Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, Vol. 372, Issue 2018
  • DOI: 10.1098/rsta.2013.0279

Algorithmic Time, Energy, and Power on Candidate HPC Compute Building Blocks
conference, May 2014

  • Choi, Jee; Dukhan, Marat; Liu, Xing
  • 2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2014 IEEE 28th International Parallel and Distributed Processing Symposium
  • DOI: 10.1109/IPDPS.2014.54

PowerPack: Energy Profiling and Analysis of High-Performance Systems and Applications
journal, May 2010

  • Ge, Rong; Feng, Xizhou; Song, Shuaiwen
  • IEEE Transactions on Parallel and Distributed Systems, Vol. 21, Issue 5
  • DOI: 10.1109/TPDS.2009.76

Energy Efficient Scheduling of Real-Time Tasks on Multicore Processors
journal, November 2008

  • Euiseong Seo,
  • IEEE Transactions on Parallel and Distributed Systems, Vol. 19, Issue 11
  • DOI: 10.1109/TPDS.2008.104

RAPL: memory power estimation and capping
conference, January 2010

  • David, Howard; Gorbatov, Eugene; Hanebutte, Ulf R.
  • Proceedings of the 16th ACM/IEEE international symposium on Low power electronics and design - ISLPED '10
  • DOI: 10.1145/1840845.1840883

Power emulation based DVFS efficiency investigations for embedded systems
conference, September 2010

  • Genser, Andreas; Bachmann, Christian; Steger, Christian
  • 2010 International Symposium on System-on-Chip - SOC, 2010 International Symposium on System on Chip
  • DOI: 10.1109/ISSOC.2010.5625559

Model-driven autotuning of sparse matrix-vector multiply on GPUs
journal, May 2010


Understanding the Energy Consumption of Dynamic Random Access Memories
conference, December 2010

  • Vogelsang, Thomas
  • 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)
  • DOI: 10.1109/MICRO.2010.42

A survey of architectural techniques for DRAM power management
journal, January 2012


Resource-conscious scheduling for energy efficiency on multicore processors
conference, January 2010

  • Merkel, Andreas; Stoess, Jan; Bellosa, Frank
  • Proceedings of the 5th European conference on Computer systems - EuroSys '10
  • DOI: 10.1145/1755913.1755930

Performance Tuning of Matrix Multiplication in OpenCL on Different GPUs and CPUs
conference, November 2012

  • Matsumoto, Kazuya; Nakasato, Naohito; Sedukhin, Stanislav G.
  • 2012 SC Companion: High Performance Computing, Networking, Storage and Analysis (SCC), 2012 SC Companion: High Performance Computing, Networking Storage and Analysis
  • DOI: 10.1109/SC.Companion.2012.59

Automatically Tuning Sparse Matrix-Vector Multiplication for GPU Architectures
book, January 2010

  • Monakov, Alexander; Lokhmotov, Anton; Avetisyan, Arutyun
  • High Performance Embedded Architectures and Compilers
  • DOI: 10.1007/978-3-642-11515-8_10

Works referencing / citing this record:

Novel HPC techniques to batch execution of many variable size BLAS computations on GPUs
conference, January 2017

  • Abdelfattah, Ahmad; Haidar, Azzam; Tomov, Stanimire
  • Proceedings of the International Conference on Supercomputing - ICS '17
  • DOI: 10.1145/3079079.3079103

BOAST: A metaprogramming framework to produce portable and efficient computing kernels for HPC applications
journal, August 2017

  • Videau, Brice; Pouget, Kevin; Genovese, Luigi
  • The International Journal of High Performance Computing Applications, Vol. 32, Issue 1
  • DOI: 10.1177/1094342017718068