On the performance and energy efficiency of sparse linear algebra on GPUs

Anzt, Hartwig; Tomov, Stanimire; Dongarra, Jack

doi:10.1177/1094342016672081

On the performance and energy efficiency of sparse linear algebra on GPUs

Journal Article · Wed Oct 05 00:00:00 EDT 2016 · International Journal of High Performance Computing Applications

DOI:https://doi.org/10.1177/1094342016672081· OSTI ID:1437692

Anzt, Hartwig ^[1]; Tomov, Stanimire ^[1]; Dongarra, Jack ^[2]

University of Tennessee, Knoxville, USA
University of Tennessee, Knoxville, USA, Oak Ridge National Laboratory, USA, University of Manchester, UK

In this paper we unveil some performance and energy efficiency frontiers for sparse computations on GPU-based supercomputers. We compare the resource efficiency of different sparse matrix–vector products (SpMV) taken from libraries such as cuSPARSE and MAGMA for GPU and Intel’s MKL for multicore CPUs, and develop a GPU sparse matrix–matrix product (SpMM) implementation that handles the simultaneous multiplication of a sparse matrix with a set of vectors in block-wise fashion. While a typical sparse computation such as the SpMV reaches only a fraction of the peak of current GPUs, we show that the SpMM succeeds in exceeding the memory-bound limitations of the SpMV. We integrate this kernel into a GPU-accelerated Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG) eigensolver. LOBPCG is chosen as a benchmark algorithm for this study as it combines an interesting mix of sparse and dense linear algebra operations that is typical for complex simulation applications, and allows for hardware-aware optimizations. In a detailed analysis we compare the performance and energy efficiency against a multi-threaded CPU counterpart. The reported performance and energy efficiency results are indicative of sparse computations on supercomputers.

View Journal Article

Sponsoring Organization:: USDOE

Grant/Contract Number:: SC0010042

OSTI ID:: 1437692

Journal Information:: International Journal of High Performance Computing Applications, Journal Name: International Journal of High Performance Computing Applications Journal Issue: 5 Vol. 31; ISSN 1094-3420

Publisher:: SAGE PublicationsCopyright Statement

Country of Publication:: United States

Language:: English

References (32)

Unveiling the performance-energy trade-off in iterative linear system solvers for multithreaded processors: Unveiling the performance-energy trade-off in iterative linear system solvers for multithreaded processors Aliaga, José I.; Anzt, Hartwig; Castillo, Maribel Concurrency and Computation: Practice and Experience, Vol. 27, Issue 4 https://doi.org/10.1002/cpe.3341	journal	September 2014
Locally Optimal Block Preconditioned Conjugate Gradient Method for Hierarchical Matrices Benner, Peter; Mach, Thomas PAMM, Vol. 11, Issue 1 https://doi.org/10.1002/pamm.201110360	journal	December 2011
Preconditioned Block-Iterative Methods on GPUs Naumov, Maxim PAMM, Vol. 12, Issue 1 https://doi.org/10.1002/pamm.201210004	journal	December 2012
octopus: a tool for the application of time-dependent density functional theory Castro, Alberto; Appel, Heiko; Oliveira, Micael physica status solidi (b), Vol. 243, Issue 11 https://doi.org/10.1002/pssb.200642067	journal	September 2006
Automatically Tuning Sparse Matrix-Vector Multiplication for GPU Architectures Monakov, Alexander; Lokhmotov, Anton; Avetisyan, Arutyun High Performance Embedded Architectures and Compilers https://doi.org/10.1007/978-3-642-11515-8_10	book	January 2010
Evaluating the performance and energy efficiency of the COSMO-ART model system Charles, Joseph; Sawyer, William; Dolz, Manuel F. Computer Science - Research and Development, Vol. 30, Issue 2 https://doi.org/10.1007/s00450-014-0267-7	journal	July 2014
Evaluating application performance and energy consumption on hybrid CPU+GPU architecture Padoin, Edson Luiz; Pilla, Laércio Lima; Boito, Francieli Zanon Cluster Computing, Vol. 16, Issue 3 https://doi.org/10.1007/s10586-012-0219-6	journal	June 2012
First-principles computation of material properties: the ABINIT software project Gonze, X.; Beuken, J. -M.; Caracas, R. Computational Materials Science, Vol. 25, Issue 3 https://doi.org/10.1016/S0927-0256(02)00325-7	journal	November 2002
Multilevel preconditioned iterative eigensolvers for Maxwell eigenvalue problems Arbenz, Peter; Geus, Roman Applied Numerical Mathematics, Vol. 54, Issue 2 https://doi.org/10.1016/j.apnum.2004.09.026	journal	July 2005
Basis selection in LOBPCG Hetmaniuk, U.; Lehoucq, R. Journal of Computational Physics, Vol. 218, Issue 1 https://doi.org/10.1016/j.jcp.2006.02.007	journal	October 2006
State-of-the-art eigensolvers for electronic structure calculations of large scale nano-systems Vömel, Christof; Tomov, Stanimire Z.; Marques, Osni A. Journal of Computational Physics, Vol. 227, Issue 15 https://doi.org/10.1016/j.jcp.2008.01.018	journal	July 2008
Quantifying the energy cost of data movement in scientific applications Kestor, Gokcen; Gioiosa, Roberto; Kerbyson, Darren J. 2013 IEEE International Symposium on Workload Characterization (IISWC) https://doi.org/10.1109/IISWC.2013.6704670	conference	September 2013
Communication-Avoiding QR Decomposition for GPUs Anderson, Michael; Ballard, Grey; Demmel, James Distributed Processing Symposium (IPDPS), 2011 IEEE International Parallel & Distributed Processing Symposium https://doi.org/10.1109/IPDPS.2011.15	conference	May 2011
QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators Agullo, Emmanuel; Augonnet, Cedric; Dongarra, Jack Distributed Processing Symposium (IPDPS), 2011 IEEE International Parallel & Distributed Processing Symposium https://doi.org/10.1109/IPDPS.2011.90	conference	May 2011
Improving the Performance of CA-GMRES on Multicores with Multiple GPUs Yamazaki, Ichitaro; Anzt, Hartwig; Tomov, Stanimire 2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2014 IEEE 28th International Parallel and Distributed Processing Symposium https://doi.org/10.1109/IPDPS.2014.48	conference	May 2014
Energy-Efficient Computing for Extreme-Scale Science Donofrio, David; Oliker, Leonid; Shalf, John Computer, Vol. 42, Issue 11 https://doi.org/10.1109/MC.2009.353	journal	November 2009
16.447 TFlops and 159-Billion-dimensional Exact-diagonalization for Trapped Fermion-Hubbard Model on the Earth Simulator Yamada, S.; Imamura, T.; Machida, M. ACM/IEEE SC 2005 Conference (SC'05) https://doi.org/10.1109/SC.2005.1	conference	January 2005
Trends and techniques for energy efficient architectures Jimenez, Victor; Gioiosa, Roberto; Kursun, Eren 2010 18th IEEE/IFIP International Conference on VLSI and System-on-Chip (VLSI-SoC) https://doi.org/10.1109/VLSISOC.2010.5642673	conference	September 2010
Block Locally Optimal Preconditioned Eigenvalue Xolvers (BLOPEX) in Hypre and PETSc Knyazev, A. V.; Argentati, M. E.; Lashuk, I. SIAM Journal on Scientific Computing, Vol. 29, Issue 5 https://doi.org/10.1137/060661624	journal	January 2007
LAPACK Users' Guide Anderson, E.; Bai, Z.; Bischof, C. https://doi.org/10.1137/1.9780898719604	software	January 1999
Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods Barrett, Richard; Berry, Michael; Chan, Tony F. https://doi.org/10.1137/1.9781611971538	book	January 1994
A Unified Sparse Matrix Data Format for Efficient General Sparse Matrix-Vector Multiplication on Modern Processors with Wide SIMD Units Kreutzer, Moritz; Hager, Georg; Wellein, Gerhard SIAM Journal on Scientific Computing, Vol. 36, Issue 5 https://doi.org/10.1137/130930352	journal	January 2014
Toward the Optimal Preconditioned Eigensolver: Locally Optimal Block Preconditioned Conjugate Gradient Method Knyazev, Andrew V. SIAM Journal on Scientific Computing, Vol. 23, Issue 2 https://doi.org/10.1137/S1064827500366124	journal	January 2001
A Block Orthogonalization Procedure with Constant Synchronization Requirements Stathopoulos, Andreas; Wu, Kesheng SIAM Journal on Scientific Computing, Vol. 23, Issue 6 https://doi.org/10.1137/S1064827500370883	journal	January 2002
Gordon Bell finalists I---High-performance computing for exact numerical approaches to quantum many-body problems on the earth simulator Yamada, Susumu; Imamura, Toshiyuki; Kano, Takuma Proceedings of the 2006 ACM/IEEE conference on Supercomputing - SC '06 https://doi.org/10.1145/1188455.1188504	conference	January 2006
Anasazi software for the numerical solution of large-scale eigenvalue problems Baker, C. G.; Hetmaniuk, U. L.; Lehoucq, R. B. ACM Transactions on Mathematical Software, Vol. 36, Issue 3 https://doi.org/10.1145/1527286.1527287	journal	July 2009
Hardware/software co-design for energy-efficient seismic modeling Krueger, Jens; Donofrio, David; Shalf, John Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11 https://doi.org/10.1145/2063384.2063482	conference	January 2011
Energy efficiency and performance frontiers for sparse computations on GPU supercomputers Anzt, Hartwig; Tomov, Stanimire; Dongarra, Jack Proceedings of the Sixth International Workshop on Programming Models and Applications for Multicores and Manycores - PMAM '15 https://doi.org/10.1145/2712386.2712387	conference	January 2015
An Improved Magma Gemm For Fermi Graphics Processing Units Nath, Rajib; Tomov, Stanimire; Dongarra, Jack The International Journal of High Performance Computing Applications, Vol. 24, Issue 4 https://doi.org/10.1177/1094342010385729	journal	September 2010
The International Exascale Software Project roadmap Dongarra, Jack; Beckman, Pete; Moore, Terry The International Journal of High Performance Computing Applications, Vol. 25, Issue 1 https://doi.org/10.1177/1094342010391989	journal	January 2011
Conjugate-gradient eigenvalue solvers in computing electronic properties of nanostructure architectures Tomo, Stanimire; Langou, Julien; Dongarra, Jack International Journal of Computational Science and Engineering, Vol. 2, Issue 3/4 https://doi.org/10.1504/IJCSE.2006.012774	journal	January 2006
Towards an online-coupled chemistry-climate model: evaluation of trace gases and aerosols in COSMO-ART Knote, C.; Brunner, D.; Vogel, H. Geoscientific Model Development, Vol. 4, Issue 4 https://doi.org/10.5194/gmd-4-1077-2011	journal	January 2011

Similar Records

A High Performance Block Eigensolver for Nuclear Configuration Interaction Calculations

Journal Article · Wed May 31 20:00:00 EDT 2017 · IEEE Transactions on Parallel and Distributed Systems · OSTI ID:1379875

Optimization of sparse matrix-vector multiplication on emerging multicore platforms

Conference · Mon Apr 16 00:00:00 EDT 2007 · OSTI ID:920852

Optimization of sparse matrix-vector multiplication on emerging multicore platforms

Conference · Sun Dec 31 23:00:00 EST 2006 · OSTI ID:1407083

On the performance and energy efficiency of sparse linear algebra on GPUs

Citation Formats

References (32)

Similar Records

Related Subjects