skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Towards Batched Linear Solvers on Accelerated Hardware Platforms

Abstract

As hardware evolves, an increasingly effective approach to develop energy efficient, high-performance solvers, is to design them to work on many small and independent problems. Indeed, many applications already need this functionality, especially for GPUs, which are known to be currently about four to five times more energy efficient than multicore CPUs for every floating-point operation. In this paper, we describe the development of the main one-sided factorizations: LU, QR, and Cholesky; that are needed for a set of small dense matrices to work in parallel. We refer to such algorithms as batched factorizations. Our approach is based on representing the algorithms as a sequence of batched BLAS routines for GPU-contained execution. Note that this is similar in functionality to the LAPACK and the hybrid MAGMA algorithms for large-matrix factorizations. But it is different from a straightforward approach, whereby each of GPU's symmetric multiprocessors factorizes a single problem at a time. We illustrate how our performance analysis together with the profiling and tracing tools guided the development of batched factorizations to achieve up to 2-fold speedup and 3-fold better energy efficiency compared to our highly optimized batched CPU implementations based on the MKL library on a two-sockets, Intel Sandy Bridgemore » server. Compared to a batched LU factorization featured in the NVIDIA's CUBLAS library for GPUs, we achieves up to 2.5-fold speedup on the K40 GPU.« less

Authors:
 [1];  [1];  [1];  [2]
  1. University of Tennessee (UT)
  2. ORNL
Publication Date:
Research Org.:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Sponsoring Org.:
USDOE
OSTI Identifier:
1261494
DOE Contract Number:
AC05-00OR22725
Resource Type:
Book
Country of Publication:
United States
Language:
English

Citation Formats

Haidar, Azzam, Dong, Tingzing Tim, Tomov, Stanimire, and Dongarra, Jack J. Towards Batched Linear Solvers on Accelerated Hardware Platforms. United States: N. p., 2015. Web. doi:10.1145/2688500.2688534.
Haidar, Azzam, Dong, Tingzing Tim, Tomov, Stanimire, & Dongarra, Jack J. Towards Batched Linear Solvers on Accelerated Hardware Platforms. United States. doi:10.1145/2688500.2688534.
Haidar, Azzam, Dong, Tingzing Tim, Tomov, Stanimire, and Dongarra, Jack J. Thu . "Towards Batched Linear Solvers on Accelerated Hardware Platforms". United States. doi:10.1145/2688500.2688534.
@article{osti_1261494,
title = {Towards Batched Linear Solvers on Accelerated Hardware Platforms},
author = {Haidar, Azzam and Dong, Tingzing Tim and Tomov, Stanimire and Dongarra, Jack J},
abstractNote = {As hardware evolves, an increasingly effective approach to develop energy efficient, high-performance solvers, is to design them to work on many small and independent problems. Indeed, many applications already need this functionality, especially for GPUs, which are known to be currently about four to five times more energy efficient than multicore CPUs for every floating-point operation. In this paper, we describe the development of the main one-sided factorizations: LU, QR, and Cholesky; that are needed for a set of small dense matrices to work in parallel. We refer to such algorithms as batched factorizations. Our approach is based on representing the algorithms as a sequence of batched BLAS routines for GPU-contained execution. Note that this is similar in functionality to the LAPACK and the hybrid MAGMA algorithms for large-matrix factorizations. But it is different from a straightforward approach, whereby each of GPU's symmetric multiprocessors factorizes a single problem at a time. We illustrate how our performance analysis together with the profiling and tracing tools guided the development of batched factorizations to achieve up to 2-fold speedup and 3-fold better energy efficiency compared to our highly optimized batched CPU implementations based on the MKL library on a two-sockets, Intel Sandy Bridge server. Compared to a batched LU factorization featured in the NVIDIA's CUBLAS library for GPUs, we achieves up to 2.5-fold speedup on the K40 GPU.},
doi = {10.1145/2688500.2688534},
journal = {},
number = ,
volume = ,
place = {United States},
year = {Thu Jan 01 00:00:00 EST 2015},
month = {Thu Jan 01 00:00:00 EST 2015}
}

Book:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this book.

Save / Share: