DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Batched matrix computations on hardware accelerators based on GPUs

Abstract

Scientific applications require solvers that work on many small size problems that are independent from each other. At the same time, the high-end hardware evolves rapidly and becomes ever more throughput-oriented and thus there is an increasing need for an effective approach to develop energy-efficient, high-performance codes for these small matrix problems that we call batched factorizations. The many applications that need this functionality could especially benefit from the use of GPUs, which currently are four to five times more energy efficient than multicore CPUs on important scientific workloads. This study, consequently, describes the development of the most common, one-sided factorizations, Cholesky, LU, and QR, for a set of small dense matrices. The algorithms we present together with their implementations are, by design, inherently parallel. In particular, our approach is based on representing the process as a sequence of batched BLAS routines that are executed entirely on a GPU. Importantly, this is unlike the LAPACK and the hybrid MAGMA factorization algorithms that work under drastically different assumptions of hardware design and efficiency of execution of the various computational kernels involved in the implementation. Thus, our approach is more efficient than what works for a combination of multicore CPUs and GPUsmore » for the problems sizes of interest of the application use cases. The paradigm where upon a single chip (a GPU or a CPU) factorizes a single problem at a time is not at all efficient in our applications’ context. We illustrate all of these claims through a detailed performance analysis. With the help of profiling and tracing tools, we guide our development of batched factorizations to achieve up to two-fold speedup and three-fold better energy efficiency as compared against our highly optimized batched CPU implementations based on MKL library. Finally, the tested system featured two sockets of Intel Sandy Bridge CPUs and we compared with a batched LU factorizations featured in the CUBLAS library for GPUs, we achieve as high as 2.5× speedup on the NVIDIA K40 GPU.« less

Authors:
 [1];  [1];  [1];  [1];  [2]
  1. Univ. of Tennessee, Knoxville, TN (United States)
  2. Univ. of Tennessee, Knoxville, TN (United States); Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States); Univ. of Manchester (United Kingdom)
Publication Date:
Research Org.:
Univ. of Tennessee, Knoxville, TN (United States); Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
Sponsoring Org.:
USDOE; National Science Foundation (NSF); Nvidia Corporation (United States); Russian Scientific Fund (Russian Federation)
Contributing Org.:
Univ. of Manchester (United Kingdom)
OSTI Identifier:
1361289
Grant/Contract Number:  
AC05-00OR22725; ACI-1339822; N14-11-00190
Resource Type:
Accepted Manuscript
Journal Name:
International Journal of High Performance Computing Applications
Additional Journal Information:
Journal Volume: 29; Journal Issue: 2; Journal ID: ISSN 1094-3420
Publisher:
SAGE
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING; batched factorization; numerical linear algebra; hardware accelerators; numerical software libraries; one-sided factorization algorithms

Citation Formats

Haidar, Azzam, Dong, Tingxing, Luszczek, Piotr, Tomov, Stanimire, and Dongarra, Jack. Batched matrix computations on hardware accelerators based on GPUs. United States: N. p., 2015. Web. doi:10.1177/1094342014567546.
Haidar, Azzam, Dong, Tingxing, Luszczek, Piotr, Tomov, Stanimire, & Dongarra, Jack. Batched matrix computations on hardware accelerators based on GPUs. United States. https://doi.org/10.1177/1094342014567546
Haidar, Azzam, Dong, Tingxing, Luszczek, Piotr, Tomov, Stanimire, and Dongarra, Jack. Mon . "Batched matrix computations on hardware accelerators based on GPUs". United States. https://doi.org/10.1177/1094342014567546. https://www.osti.gov/servlets/purl/1361289.
@article{osti_1361289,
title = {Batched matrix computations on hardware accelerators based on GPUs},
author = {Haidar, Azzam and Dong, Tingxing and Luszczek, Piotr and Tomov, Stanimire and Dongarra, Jack},
abstractNote = {Scientific applications require solvers that work on many small size problems that are independent from each other. At the same time, the high-end hardware evolves rapidly and becomes ever more throughput-oriented and thus there is an increasing need for an effective approach to develop energy-efficient, high-performance codes for these small matrix problems that we call batched factorizations. The many applications that need this functionality could especially benefit from the use of GPUs, which currently are four to five times more energy efficient than multicore CPUs on important scientific workloads. This study, consequently, describes the development of the most common, one-sided factorizations, Cholesky, LU, and QR, for a set of small dense matrices. The algorithms we present together with their implementations are, by design, inherently parallel. In particular, our approach is based on representing the process as a sequence of batched BLAS routines that are executed entirely on a GPU. Importantly, this is unlike the LAPACK and the hybrid MAGMA factorization algorithms that work under drastically different assumptions of hardware design and efficiency of execution of the various computational kernels involved in the implementation. Thus, our approach is more efficient than what works for a combination of multicore CPUs and GPUs for the problems sizes of interest of the application use cases. The paradigm where upon a single chip (a GPU or a CPU) factorizes a single problem at a time is not at all efficient in our applications’ context. We illustrate all of these claims through a detailed performance analysis. With the help of profiling and tracing tools, we guide our development of batched factorizations to achieve up to two-fold speedup and three-fold better energy efficiency as compared against our highly optimized batched CPU implementations based on MKL library. Finally, the tested system featured two sockets of Intel Sandy Bridge CPUs and we compared with a batched LU factorizations featured in the CUBLAS library for GPUs, we achieve as high as 2.5× speedup on the NVIDIA K40 GPU.},
doi = {10.1177/1094342014567546},
journal = {International Journal of High Performance Computing Applications},
number = 2,
volume = 29,
place = {United States},
year = {Mon Feb 09 00:00:00 EST 2015},
month = {Mon Feb 09 00:00:00 EST 2015}
}

Journal Article:
Free Publicly Available Full Text
Publisher's Version of Record

Citation Metrics:
Cited by: 32 works
Citation information provided by
Web of Science

Save / Share:

Works referenced in this record:

LAPACK Users' Guide
software, January 1999


Stability of Methods for Matrix Inversion
journal, January 1992

  • Croz, Jeremy J. Du; Higham, Nicholas J.
  • IMA Journal of Numerical Analysis, Vol. 12, Issue 1
  • DOI: 10.1093/imanum/12.1.1

Power-Management Architecture of the Intel Microarchitecture Code-Named Sandy Bridge
journal, March 2012

  • Rotem, Efraim; Naveh, Alon; Ananthakrishnan, Avinash
  • IEEE Micro, Vol. 32, Issue 2
  • DOI: 10.1109/MM.2012.12

Sparsity: Optimization Framework for Sparse Matrix Kernels
journal, February 2004

  • Im, Eun-Jin; Yelick, Katherine; Vuduc, Richard
  • The International Journal of High Performance Computing Applications, Vol. 18, Issue 1
  • DOI: 10.1177/1094342004041296

Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects
journal, July 2009


A Predictive Model for Solving Small Linear Algebra Problems in GPU Registers
conference, May 2012

  • Anderson, Michael J.; Sheffield, David; Keutzer, Kurt
  • 2012 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), 2012 IEEE 26th International Parallel and Distributed Processing Symposium
  • DOI: 10.1109/IPDPS.2012.11

A Step towards Energy Efficient Computing: Redesigning a Hydrodynamic Application on CPU-GPU
conference, May 2014

  • Dong, Tingxing; Dobrev, Veselin; Kolev, Tzanio
  • 2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2014 IEEE 28th International Parallel and Distributed Processing Symposium
  • DOI: 10.1109/IPDPS.2014.103

Works referencing / citing this record:

Novel HPC techniques to batch execution of many variable size BLAS computations on GPUs
conference, January 2017

  • Abdelfattah, Ahmad; Haidar, Azzam; Tomov, Stanimire
  • Proceedings of the International Conference on Supercomputing - ICS '17
  • DOI: 10.1145/3079079.3079103

Hierarchical approach for deriving a reproducible unblocked LU factorization
journal, January 2019

  • Iakymchuk, Roman; Graillat, Stef; Defour, David
  • The International Journal of High Performance Computing Applications, Vol. 33, Issue 5
  • DOI: 10.1177/1094342019832968

Performance Tuning and Optimization Techniques of Fixed and Variable Size Batched Cholesky Factorization on GPUs
journal, January 2016