skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: A GEMM interface and implementation on NVIDIA GPUs for multiple small matrices

Publication Date:
Sponsoring Org.:
OSTI Identifier:
Grant/Contract Number:
Resource Type:
Journal Article: Publisher's Accepted Manuscript
Journal Name:
Journal of Parallel and Distributed Computing
Additional Journal Information:
Journal Volume: 75; Journal Issue: C; Related Information: CHORUS Timestamp: 2016-09-04 15:36:24; Journal ID: ISSN 0743-7315
Country of Publication:
United States

Citation Formats

Jhurani, Chetan, and Mullowney, Paul. A GEMM interface and implementation on NVIDIA GPUs for multiple small matrices. United States: N. p., 2015. Web. doi:10.1016/j.jpdc.2014.09.003.
Jhurani, Chetan, & Mullowney, Paul. A GEMM interface and implementation on NVIDIA GPUs for multiple small matrices. United States. doi:10.1016/j.jpdc.2014.09.003.
Jhurani, Chetan, and Mullowney, Paul. 2015. "A GEMM interface and implementation on NVIDIA GPUs for multiple small matrices". United States. doi:10.1016/j.jpdc.2014.09.003.
title = {A GEMM interface and implementation on NVIDIA GPUs for multiple small matrices},
author = {Jhurani, Chetan and Mullowney, Paul},
abstractNote = {},
doi = {10.1016/j.jpdc.2014.09.003},
journal = {Journal of Parallel and Distributed Computing},
number = C,
volume = 75,
place = {United States},
year = 2015,
month = 1

Journal Article:
Free Publicly Available Full Text
Publisher's Version of Record at 10.1016/j.jpdc.2014.09.003

Citation Metrics:
Cited by: 4works
Citation information provided by
Web of Science

Save / Share:
  • Wang Landau sampling is implemented on the Graphics Processing Unit (GPU) with the Compute Unified Device Architecture (CUDA). Performances on three different GPU cards, including the new generation Fermi architecture card, are compared with that on a Central Processing Unit (CPU). The parameters for massively parallel Wang Landau sampling are tuned in order to achieve fast convergence. For simulations of the water cluster systems, we obtain an average of over 50 times speedup for a given workload.
  • This paper presents parallelization strategies for the radial basis function-finite difference (RBF-FD) method. As a generalized finite differencing scheme, the RBF-FD method functions without the need for underlying meshes to structure nodes. It offers high-order accuracy approximation and scales as O(N) per time step, with N being with the total number of nodes. To our knowledge, this is the first implementation of the RBF-FD method to leverage GPU accelerators for the solution of PDEs. Additionally, this implementation is the first to span both multiple CPUs and multiple GPUs. OpenCL kernels target the GPUs and inter-processor communication and synchronization is managedmore » by the Message Passing Interface (MPI). We verify our implementation of the RBF-FD method with two hyperbolic PDEs on the sphere, and demonstrate up to 9x speedup on a commodity GPU with unoptimized kernel implementations. On a high performance cluster, the method achieves up to 7x speedup for the maximum problem size of 27,556 nodes.« less
  • The ultrafast decay dynamics of 4-( N,N-dimethylamino)benzonitrile (DMABN) following photoexcitation was studied with the ab initio multiple spawning (AIMS) method, combined with GPU-accelerated linear-response time-dependent density functional theory (LR-TDDFT). We validate the LR-TDDFT method for this case and then present a detailed analysis of the first ≈200 fs of DMABN excited-state dynamics. Almost complete nonadiabatic population transfer from S 2 (the initially populated bright state) to S 1 takes place in less than 50 fs, without significant torsion of the dimethylamino (DMA) group. Significant torsion of the DMA group is only observed after the nuclear wavepacket reaches S 1 andmore » acquires locally excited electronic character. Here, our results show that torsion of the DMA group is not prerequisite for nonadiabatic transitions in DMABN, although such motion is indeed relevant on the lowest excited state (S 1).« less
  • Commodity clusters augmented with application accelerators are evolving as competitive high performance computing systems. The Graphical Processing Unit (GPU) with a very high arithmetic density and performance per price ratio is a good platform for the scientific application acceleration. In addition to the interconnect bottlenecks among the cluster compute nodes, the cost of memory copies between the host and the GPU device have to be carefully amortized to improve the overall efficiency of the application. Scientific applications also rely on efficient implementation of the BAsic Linear Algebra Subroutines (BLAS), among which the General Matrix Multiply (GEMM) is considered as themore » workhorse subroutine. In this paper, they study the performance of the memory copies and GEMM subroutines that are critical to port the computational chemistry algorithms to the GPU clusters. To that end, a benchmark based on the NetPIPE framework is developed to evaluate the latency and bandwidth of the memory copies between the host and the GPU device. The performance of the single and double precision GEMM subroutines from the NVIDIA CUBLAS 2.0 library are studied. The results have been compared with that of the BLAS routines from the Intel Math Kernel Library (MKL) to understand the computational trade-offs. The test bed is a Intel Xeon cluster equipped with NVIDIA Tesla GPUs.« less
  • Kepler is the newest GPU architecture from NVIDIA, and the GTX 680 is the first commercially available graphics card based on that architecture. Matrix multiplication is a canonical computational kernel, and often the main target of initial optimization efforts for a new chip. This article presents preliminary results of automatically tuning matrix multiplication kernels for the Kepler architecture using the GTX 680 card.