Towards Batched Linear Solvers on Accelerated Hardware Platforms

Haidar, Azzam; Dong, Tingzing Tim; Tomov, Stanimire; Dongarra, Jack J

doi:10.1145/2688500.2688534

Towards Batched Linear Solvers on Accelerated Hardware Platforms

Book · Wed Dec 31 23:00:00 EST 2014

DOI:https://doi.org/10.1145/2688500.2688534· OSTI ID:1261494

Haidar, Azzam ^[1]; Dong, Tingzing Tim ^[1]; Tomov, Stanimire ^[1]; Dongarra, Jack J ^[2]

University of Tennessee (UT)
ORNL

As hardware evolves, an increasingly effective approach to develop energy efficient, high-performance solvers, is to design them to work on many small and independent problems. Indeed, many applications already need this functionality, especially for GPUs, which are known to be currently about four to five times more energy efficient than multicore CPUs for every floating-point operation. In this paper, we describe the development of the main one-sided factorizations: LU, QR, and Cholesky; that are needed for a set of small dense matrices to work in parallel. We refer to such algorithms as batched factorizations. Our approach is based on representing the algorithms as a sequence of batched BLAS routines for GPU-contained execution. Note that this is similar in functionality to the LAPACK and the hybrid MAGMA algorithms for large-matrix factorizations. But it is different from a straightforward approach, whereby each of GPU's symmetric multiprocessors factorizes a single problem at a time. We illustrate how our performance analysis together with the profiling and tracing tools guided the development of batched factorizations to achieve up to 2-fold speedup and 3-fold better energy efficiency compared to our highly optimized batched CPU implementations based on the MKL library on a two-sockets, Intel Sandy Bridge server. Compared to a batched LU factorization featured in the NVIDIA's CUBLAS library for GPUs, we achieves up to 2.5-fold speedup on the K40 GPU.

🛈

OSTI does not have a digital full text copy available. For more information, please see document availability, search WorldCat, or search Google Scholar.

Research Organization:: Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)

Sponsoring Organization:: USDOE

DOE Contract Number:: AC05-00OR22725

OSTI ID:: 1261494

Country of Publication:: United States

Language:: English

Similar Records

Batched matrix computations on hardware accelerators based on GPUs

Journal Article · Sun Feb 08 23:00:00 EST 2015 · International Journal of High Performance Computing Applications · OSTI ID:1361289

A Framework for Batched and GPU-Resident Factorization Algorithms Applied to Block Householder Transformations

Book · Wed Dec 31 23:00:00 EST 2014 · OSTI ID:1261481

Implementation and Tuning of Batched Cholesky Factorization and Solve for NVIDIA GPUs

Journal Article · Fri Jul 01 00:00:00 EDT 2016 · IEEE Transactions on Parallel and Distributed Systems · OSTI ID:1565512

Towards Batched Linear Solvers on Accelerated Hardware Platforms

Citation Formats

Similar Records

Related Subjects