A Framework for Batched and GPU-Resident Factorization Algorithms Applied to Block Householder Transformations

Dong, Tingzing Tim; Tomov, Stanimire Z; Luszczek, Piotr R; Dongarra, Jack J

Title: A Framework for Batched and GPU-Resident Factorization Algorithms Applied to Block Householder Transformations

Book · Thu Jan 01 00:00:00 EST 2015

OSTI ID:1261481

Dong, Tingzing Tim ^[1]; Tomov, Stanimire Z ^[2]; Luszczek, Piotr R ^[2]; Dongarra, Jack J ^[2]

University of Tennessee (UT)
ORNL

As modern hardware keeps evolving, an increasingly effective approach to developing energy efficient and high-performance solvers is to design them to work on many small size and independent problems. Many applications already need this functionality, especially for GPUs, which are currently known to be about four to five times more energy efficient than multicore CPUs. We describe the development of one-sided factorizations that work for a set of small dense matrices in parallel, and we illustrate our techniques on the QR factorization based on Householder transformations. We refer to this mode of operation as a batched factorization. Our approach is based on representing the algorithms as a sequence of batched BLAS routines for GPU-only execution. This is in contrast to the hybrid CPU-GPU algorithms that rely heavily on using the multicore CPU for specific parts of the workload. But for a system to benefit fully from the GPU's significantly higher energy efficiency, avoiding the use of the multicore CPU must be a primary design goal, so the system can rely more heavily on the more efficient GPU. Additionally, this will result in the removal of the costly CPU-to-GPU communication. Furthermore, we do not use a single symmetric multiprocessor(on the GPU) to factorize a single problem at a time. We illustrate how our performance analysis, and the use of profiling and tracing tools, guided the development and optimization of our batched factorization to achieve up to a 2-fold speedup and a 3-fold energy efficiency improvement compared to our highly optimized batched CPU implementations based on the MKL library(when using two sockets of Intel Sandy Bridge CPUs). Compared to a batched QR factorization featured in the CUBLAS library for GPUs, we achieved up to 5x speedup on the K40 GPU.

OSTI does not have a digital full text copy available. For more information, please see document availability, search WorldCat, or search Google Scholar.

Cite

Export

Save

Research Organization:: Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)

Sponsoring Organization:: USDOE

DOE Contract Number:: AC05-00OR22725

OSTI ID:: 1261481

Country of Publication:: United States

Language:: English

Similar Records

Batched matrix computations on hardware accelerators based on GPUs

Journal Article · Mon Feb 09 00:00:00 EST 2015 · International Journal of High Performance Computing Applications · OSTI ID:1261481

Haidar, Azzam; Dong, Tingxing; Luszczek, Piotr; +2 more

Towards Batched Linear Solvers on Accelerated Hardware Platforms

Book · Thu Jan 01 00:00:00 EST 2015 · OSTI ID:1261481

Haidar, Azzam; Dong, Tingzing Tim; Tomov, Stanimire; +1 more

A scalable approach to solving dense linear algebra problems on hybrid CPU-GPU systems

Journal Article · Wed Oct 01 00:00:00 EDT 2014 · Concurrency and Computation. Practice and Experience · OSTI ID:1261481

Song, Fengguang; Dongarra, Jack

Title: A Framework for Batched and GPU-Resident Factorization Algorithms Applied to Block Householder Transformations

Citation Formats

Similar Records

Related Subjects