Batched matrix computations on hardware accelerators based on GPUs

Haidar, Azzam; Dong, Tingxing; Luszczek, Piotr; Tomov, Stanimire; Dongarra, Jack

doi:10.1177/1094342014567546

Title: Batched matrix computations on hardware accelerators based on GPUs

Abstract

Scientific applications require solvers that work on many small size problems that are independent from each other. At the same time, the high-end hardware evolves rapidly and becomes ever more throughput-oriented and thus there is an increasing need for an effective approach to develop energy-efficient, high-performance codes for these small matrix problems that we call batched factorizations. The many applications that need this functionality could especially benefit from the use of GPUs, which currently are four to five times more energy efficient than multicore CPUs on important scientific workloads. This study, consequently, describes the development of the most common, one-sided factorizations, Cholesky, LU, and QR, for a set of small dense matrices. The algorithms we present together with their implementations are, by design, inherently parallel. In particular, our approach is based on representing the process as a sequence of batched BLAS routines that are executed entirely on a GPU. Importantly, this is unlike the LAPACK and the hybrid MAGMA factorization algorithms that work under drastically different assumptions of hardware design and efficiency of execution of the various computational kernels involved in the implementation. Thus, our approach is more efficient than what works for a combination of multicore CPUs and GPUsmore »« less

Authors:

Haidar, Azzam ^[1]; Dong, Tingxing ^[1]; Luszczek, Piotr ^[1]; Tomov, Stanimire ^[1]; Dongarra, Jack ^[2]

Univ. of Tennessee, Knoxville, TN (United States)
Univ. of Tennessee, Knoxville, TN (United States); Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States); Univ. of Manchester (United Kingdom)

Publication Date:: Mon Feb 09 00:00:00 EST 2015

Research Org.:: Univ. of Tennessee, Knoxville, TN (United States); Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)

Sponsoring Org.:: USDOE; National Science Foundation (NSF); Nvidia Corporation (United States); Russian Scientific Fund (Russian Federation)

Contributing Org.:: Univ. of Manchester (United Kingdom)

OSTI Identifier:: 1361289

Grant/Contract Number:: AC05-00OR22725; ACI-1339822; N14-11-00190

Resource Type:: Accepted Manuscript

Journal Name:: International Journal of High Performance Computing Applications

Additional Journal Information:: Journal Volume: 29; Journal Issue: 2; Journal ID: ISSN 1094-3420

Publisher:: SAGE

Country of Publication:: United States

Language:: English

Subject:: 97 MATHEMATICS AND COMPUTING; batched factorization; numerical linear algebra; hardware accelerators; numerical software libraries; one-sided factorization algorithms

Citation Formats


                    Haidar, Azzam, Dong, Tingxing, Luszczek, Piotr, Tomov, Stanimire, and Dongarra, Jack. Batched matrix computations on hardware accelerators based on GPUs.  United States: N. p., 2015. 
Web.  doi:10.1177/1094342014567546.

Copy to clipboard


                    Haidar, Azzam, Dong, Tingxing, Luszczek, Piotr, Tomov, Stanimire, & Dongarra, Jack. Batched matrix computations on hardware accelerators based on GPUs.  United States.  https://doi.org/10.1177/1094342014567546

Copy to clipboard


                    Haidar, Azzam, Dong, Tingxing, Luszczek, Piotr, Tomov, Stanimire, and Dongarra, Jack. Mon .  
"Batched matrix computations on hardware accelerators based on GPUs".  United States.  https://doi.org/10.1177/1094342014567546.  https://www.osti.gov/servlets/purl/1361289.

Copy to clipboard


                    
@article{osti_1361289,

  title        = {Batched matrix computations on hardware accelerators based on GPUs},

  author       = {Haidar, Azzam and Dong, Tingxing and Luszczek, Piotr and Tomov, Stanimire and Dongarra, Jack},

  abstractNote = {Scientific applications require solvers that work on many small size problems that are independent from each other. At the same time, the high-end hardware evolves rapidly and becomes ever more throughput-oriented and thus there is an increasing need for an effective approach to develop energy-efficient, high-performance codes for these small matrix problems that we call batched factorizations. The many applications that need this functionality could especially benefit from the use of GPUs, which currently are four to five times more energy efficient than multicore CPUs on important scientific workloads. This study, consequently, describes the development of the most common, one-sided factorizations, Cholesky, LU, and QR, for a set of small dense matrices. The algorithms we present together with their implementations are, by design, inherently parallel. In particular, our approach is based on representing the process as a sequence of batched BLAS routines that are executed entirely on a GPU. Importantly, this is unlike the LAPACK and the hybrid MAGMA factorization algorithms that work under drastically different assumptions of hardware design and efficiency of execution of the various computational kernels involved in the implementation. Thus, our approach is more efficient than what works for a combination of multicore CPUs and GPUs for the problems sizes of interest of the application use cases. The paradigm where upon a single chip (a GPU or a CPU) factorizes a single problem at a time is not at all efficient in our applications’ context. We illustrate all of these claims through a detailed performance analysis. With the help of profiling and tracing tools, we guide our development of batched factorizations to achieve up to two-fold speedup and three-fold better energy efficiency as compared against our highly optimized batched CPU implementations based on MKL library. Finally, the tested system featured two sockets of Intel Sandy Bridge CPUs and we compared with a batched LU factorizations featured in the CUBLAS library for GPUs, we achieve as high as 2.5× speedup on the NVIDIA K40 GPU.},

  doi          = {10.1177/1094342014567546},

  journal      = {International Journal of High Performance Computing Applications},

  number       = 2,

  volume       = 29,

  place        = {United States},

  year         = {Mon Feb 09 00:00:00 EST 2015},

  month        = {Mon Feb 09 00:00:00 EST 2015}

}

Copy to clipboard

Journal Article:

Free Publicly Available Full Text

Accepted Manuscript (DOE)

Publisher's Version of Record

https://doi.org/10.1177/1094342014567546

Other availability

Search WorldCat to find libraries that may hold this journal

Citation Metrics:

Cited by: 32 works

Citation information provided by
Web of Science

Save / Share:

Export Metadata

Save to My Library

Works referenced in this record:

LAPACK Users' Guide
software, January 1999

Anderson, E.; Bai, Z.; Bischof, C.
DOI: 10.1137/1.9780898719604

Stability of Methods for Matrix Inversion
journal, January 1992

Croz, Jeremy J. Du; Higham, Nicholas J.
IMA Journal of Numerical Analysis, Vol. 12, Issue 1
DOI: 10.1093/imanum/12.1.1

Power-Management Architecture of the Intel Microarchitecture Code-Named Sandy Bridge
journal, March 2012

Rotem, Efraim; Naveh, Alon; Ananthakrishnan, Avinash
IEEE Micro, Vol. 32, Issue 2
DOI: 10.1109/MM.2012.12

Sparsity: Optimization Framework for Sparse Matrix Kernels
journal, February 2004

Im, Eun-Jin; Yelick, Katherine; Vuduc, Richard
The International Journal of High Performance Computing Applications, Vol. 18, Issue 1
DOI: 10.1177/1094342004041296

Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects
journal, July 2009

Agullo, Emmanuel; Demmel, Jim; Dongarra, Jack
Journal of Physics: Conference Series, Vol. 180
DOI: 10.1088/1742-6596/180/1/012037

A Predictive Model for Solving Small Linear Algebra Problems in GPU Registers
conference, May 2012

Anderson, Michael J.; Sheffield, David; Keutzer, Kurt
2012 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), 2012 IEEE 26th International Parallel and Distributed Processing Symposium
DOI: 10.1109/IPDPS.2012.11

A Step towards Energy Efficient Computing: Redesigning a Hydrodynamic Application on CPU-GPU
conference, May 2014

Dong, Tingxing; Dobrev, Veselin; Kolev, Tzanio
2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2014 IEEE 28th International Parallel and Distributed Processing Symposium
DOI: 10.1109/IPDPS.2014.103

Works referencing / citing this record:

Novel HPC techniques to batch execution of many variable size BLAS computations on GPUs
conference, January 2017

Abdelfattah, Ahmad; Haidar, Azzam; Tomov, Stanimire
Proceedings of the International Conference on Supercomputing - ICS '17
DOI: 10.1145/3079079.3079103

Hierarchical approach for deriving a reproducible unblocked LU factorization
journal, January 2019

Iakymchuk, Roman; Graillat, Stef; Defour, David
The International Journal of High Performance Computing Applications, Vol. 33, Issue 5
DOI: 10.1177/1094342019832968

Performance Tuning and Optimization Techniques of Fixed and Variable Size Batched Cholesky Factorization on GPUs
journal, January 2016

Abdelfattah, Ahmad; Haidar, Azzam; Tomov, Stanimire
Procedia Computer Science, Vol. 80
DOI: 10.1016/j.procs.2016.05.303

Similar Records in DOE PAGES and OSTI.GOV collections:

Towards Batched Linear Solvers on Accelerated Hardware Platforms

Book Haidar, Azzam ; Dong, Tingzing Tim ; Tomov, Stanimire ; ...

As hardware evolves, an increasingly effective approach to develop energy efficient, high-performance solvers, is to design them to work on many small and independent problems. Indeed, many applications already need this functionality, especially for GPUs, which are known to be currently about four to five times more energy efficient than multicore CPUs for every floating-point operation. In this paper, we describe the development of the main one-sided factorizations: LU, QR, and Cholesky; that are needed for a set of small dense matrices to work in parallel. We refer to such algorithms as batched factorizations. Our approach is based on representingmore »« less
https://doi.org/10.1145/2688500.2688534
A Framework for Batched and GPU-Resident Factorization Algorithms Applied to Block Householder Transformations

Book Dong, Tingzing Tim ; Tomov, Stanimire Z ; Luszczek, Piotr R ; ...

As modern hardware keeps evolving, an increasingly effective approach to developing energy efficient and high-performance solvers is to design them to work on many small size and independent problems. Many applications already need this functionality, especially for GPUs, which are currently known to be about four to five times more energy efficient than multicore CPUs. We describe the development of one-sided factorizations that work for a set of small dense matrices in parallel, and we illustrate our techniques on the QR factorization based on Householder transformations. We refer to this mode of operation as a batched factorization. Our approach ismore »« less
HPC Programming on Intel Many-Integrated-Core Hardware with MAGMA Port to Xeon Phi

Journal Article Dongarra, Jack ; Gates, Mark ; Haidar, Azzam ; ... - Scientific Programming

This paper presents the design and implementation of several fundamental dense linear algebra (DLA) algorithms for multicore with Intel Xeon Phi coprocessors. In particular, we consider algorithms for solving linear systems. Further, we give an overview of the MAGMA MIC library, an open source, high performance library, that incorporates the developments presented here and, more broadly, provides the DLA functionality equivalent to that of the popular LAPACK library while targeting heterogeneous architectures that feature a mix of multicore CPUs and coprocessors. The LAPACK-compliance simplifies the use of the MAGMA MIC library in applications, while providing them with portably performant DLA.more »« less
Cited by 14
https://doi.org/10.1155/2015/502593

Full Text Available
Data Locality Enhancement of Dynamic Simulations for Exascale Computing (Final Report)

Technical Report Shen, Xipeng

The development of modern processors exhibits two trends that complicate the optimizations of modern software. The first is the increasing sensitivity of processors' throughput to irregularities in computation. With more processors produced through a massive integration of simple cores, future systems will increasingly favor regular data-level parallel computations, but deviate from the needs of applications with complex patterns. Some evidences are already shown on Graphic Processing Units (GPU): Irregular data accesses (e.g., indirect references A[D[i]]) and conditional branches are limiting many GPU applications' performance at a level an order of magnitude lower than the peak of GPU. The second hardwaremore »« less
https://doi.org/10.2172/1576175

Full Text Available
Acceleration of GPU-based Krylov solvers via data transfer reduction

Journal Article Anzt, Hartwig ; Tomov, Stanimire ; Luszczek, Piotr ; ... - International Journal of High Performance Computing Applications

Krylov subspace iterative solvers are often the method of choice when solving large sparse linear systems. At the same time, hardware accelerators such as graphics processing units continue to offer significant floating point performance gains for matrix and vector computations through easy-to-use libraries of computational kernels. However, as these libraries are usually composed of a well optimized but limited set of linear algebra operations, applications that use them often fail to reduce certain data communications, and hence fail to leverage the full potential of the accelerator. In this study, we target the acceleration of Krylov subspace iterative methods for graphicsmore »« less
Cited by 11
https://doi.org/10.1177/1094342015580139

Full Text Available

Similar Records

Title: Batched matrix computations on hardware accelerators based on GPUs

Abstract

Citation Formats

LAPACK Users' Guide software, January 1999

Stability of Methods for Matrix Inversion journal, January 1992

Power-Management Architecture of the Intel Microarchitecture Code-Named Sandy Bridge journal, March 2012

Sparsity: Optimization Framework for Sparse Matrix Kernels journal, February 2004

Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects journal, July 2009

A Predictive Model for Solving Small Linear Algebra Problems in GPU Registers conference, May 2012

A Step towards Energy Efficient Computing: Redesigning a Hydrodynamic Application on CPU-GPU conference, May 2014

Novel HPC techniques to batch execution of many variable size BLAS computations on GPUs conference, January 2017

Hierarchical approach for deriving a reproducible unblocked LU factorization journal, January 2019

Performance Tuning and Optimization Techniques of Fixed and Variable Size Batched Cholesky Factorization on GPUs journal, January 2016

LAPACK Users' Guide
software, January 1999

Stability of Methods for Matrix Inversion
journal, January 1992

Power-Management Architecture of the Intel Microarchitecture Code-Named Sandy Bridge
journal, March 2012

Sparsity: Optimization Framework for Sparse Matrix Kernels
journal, February 2004

Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects
journal, July 2009

A Predictive Model for Solving Small Linear Algebra Problems in GPU Registers
conference, May 2012

A Step towards Energy Efficient Computing: Redesigning a Hydrodynamic Application on CPU-GPU
conference, May 2014

Novel HPC techniques to batch execution of many variable size BLAS computations on GPUs
conference, January 2017

Hierarchical approach for deriving a reproducible unblocked LU factorization
journal, January 2019

Performance Tuning and Optimization Techniques of Fixed and Variable Size Batched Cholesky Factorization on GPUs
journal, January 2016