Towards Batched Linear Solvers on Accelerated Hardware Platforms
Abstract
As hardware evolves, an increasingly effective approach to develop energy efficient, highperformance solvers, is to design them to work on many small and independent problems. Indeed, many applications already need this functionality, especially for GPUs, which are known to be currently about four to five times more energy efficient than multicore CPUs for every floatingpoint operation. In this paper, we describe the development of the main onesided factorizations: LU, QR, and Cholesky; that are needed for a set of small dense matrices to work in parallel. We refer to such algorithms as batched factorizations. Our approach is based on representing the algorithms as a sequence of batched BLAS routines for GPUcontained execution. Note that this is similar in functionality to the LAPACK and the hybrid MAGMA algorithms for largematrix factorizations. But it is different from a straightforward approach, whereby each of GPU's symmetric multiprocessors factorizes a single problem at a time. We illustrate how our performance analysis together with the profiling and tracing tools guided the development of batched factorizations to achieve up to 2fold speedup and 3fold better energy efficiency compared to our highly optimized batched CPU implementations based on the MKL library on a twosockets, Intel Sandy Bridgemore »
 Authors:
 University of Tennessee (UT)
 ORNL
 Publication Date:
 Research Org.:
 Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
 Sponsoring Org.:
 USDOE
 OSTI Identifier:
 1261494
 DOE Contract Number:
 AC0500OR22725
 Resource Type:
 Book
 Country of Publication:
 United States
 Language:
 English
Citation Formats
Haidar, Azzam, Dong, Tingzing Tim, Tomov, Stanimire, and Dongarra, Jack J. Towards Batched Linear Solvers on Accelerated Hardware Platforms. United States: N. p., 2015.
Web. doi:10.1145/2688500.2688534.
Haidar, Azzam, Dong, Tingzing Tim, Tomov, Stanimire, & Dongarra, Jack J. Towards Batched Linear Solvers on Accelerated Hardware Platforms. United States. doi:10.1145/2688500.2688534.
Haidar, Azzam, Dong, Tingzing Tim, Tomov, Stanimire, and Dongarra, Jack J. 2015.
"Towards Batched Linear Solvers on Accelerated Hardware Platforms". United States.
doi:10.1145/2688500.2688534.
@article{osti_1261494,
title = {Towards Batched Linear Solvers on Accelerated Hardware Platforms},
author = {Haidar, Azzam and Dong, Tingzing Tim and Tomov, Stanimire and Dongarra, Jack J},
abstractNote = {As hardware evolves, an increasingly effective approach to develop energy efficient, highperformance solvers, is to design them to work on many small and independent problems. Indeed, many applications already need this functionality, especially for GPUs, which are known to be currently about four to five times more energy efficient than multicore CPUs for every floatingpoint operation. In this paper, we describe the development of the main onesided factorizations: LU, QR, and Cholesky; that are needed for a set of small dense matrices to work in parallel. We refer to such algorithms as batched factorizations. Our approach is based on representing the algorithms as a sequence of batched BLAS routines for GPUcontained execution. Note that this is similar in functionality to the LAPACK and the hybrid MAGMA algorithms for largematrix factorizations. But it is different from a straightforward approach, whereby each of GPU's symmetric multiprocessors factorizes a single problem at a time. We illustrate how our performance analysis together with the profiling and tracing tools guided the development of batched factorizations to achieve up to 2fold speedup and 3fold better energy efficiency compared to our highly optimized batched CPU implementations based on the MKL library on a twosockets, Intel Sandy Bridge server. Compared to a batched LU factorization featured in the NVIDIA's CUBLAS library for GPUs, we achieves up to 2.5fold speedup on the K40 GPU.},
doi = {10.1145/2688500.2688534},
journal = {},
number = ,
volume = ,
place = {United States},
year = 2015,
month = 1
}

This chapter presents the implementation of a batched CUDA solver based on LU factorization for small linear systems. This solver may be used in applications such as reactive flow transport models, which apply the NewtonRaphson technique to linearize and iteratively solve the sets of non linear equations that represent the reactions for ten of thousands to millions of physical locations. The implementation exploits somewhat counterintuitive GPGPU programming techniques: it assigns the solution of a matrix (representing a system) to a single CUDA thread, does not exploit shared memory and employs dynamic memory allocation on the GPUs. These techniques enable ourmore »

A Framework for Batched and GPUResident Factorization Algorithms Applied to Block Householder Transformations
As modern hardware keeps evolving, an increasingly effective approach to developing energy efficient and highperformance solvers is to design them to work on many small size and independent problems. Many applications already need this functionality, especially for GPUs, which are currently known to be about four to five times more energy efficient than multicore CPUs. We describe the development of onesided factorizations that work for a set of small dense matrices in parallel, and we illustrate our techniques on the QR factorization based on Householder transformations. We refer to this mode of operation as a batched factorization. Our approach ismore » 
Towards polyalgorithmic linear system solvers for nonlinear elliptic problems
The authors investigate the performance of several preconditioned conjugate gradientlike algorithms and a standard stationary iterative method (blockline successive overrelaxation (SOR)) on linear systems of equations that arise from a nonlinear elliptic flame sheet problem simulation. The nonlinearity forces a pseudotransient continuation process that makes the problem parabolic and thus compacts the spectrum of the Jacobian matrix so that simple relaxation methods are viable in the initial stages of the solution process. However, because of the transition from parabolic to elliptic character as the timestep is increased in pursuit of the steadystate solution, the performance of the candidate linear solversmore » 
Power/Performance Tradeoffs of Small Batched LU Based Solvers on GPUs
In this paper we propose and analyze a set of batched linear solvers for small matrices on Graphic Processing Units (GPUs), evaluating the various alternatives depending on the size of the systems to solve. We discuss three different solutions that operate with different level of parallelization and GPU features. The first, exploiting the CUBLAS library, manages matrices of size up to 32x32 and employs Warp level (one matrix, one Warp) parallelism and shared memory. The second works at Threadblock level parallelism (one matrix, one Threadblock), still exploiting shared memory but managing matrices up to 76x76. The third is Thread levelmore » 
Batched matrix computations on hardware accelerators based on GPUs
Scientific applications require solvers that work on many small size problems that are independent from each other. At the same time, the highend hardware evolves rapidly and becomes ever more throughputoriented and thus there is an increasing need for an effective approach to develop energyefficient, highperformance codes for these small matrix problems that we call batched factorizations. The many applications that need this functionality could especially benefit from the use of GPUs, which currently are four to five times more energy efficient than multicore CPUs on important scientific workloads. This study, consequently, describes the development of the most common, onesidedmore »Cited by 8