A scalable approach to solving dense linear algebra problems on hybrid CPU-GPU systems

Song, Fengguang; Dongarra, Jack

doi:10.1002/cpe.3403

Title: A scalable approach to solving dense linear algebra problems on hybrid CPU-GPU systems

Journal Article · Wed Oct 01 00:00:00 EDT 2014 · Concurrency and Computation. Practice and Experience

DOI:https://doi.org/10.1002/cpe.3403· OSTI ID:1361295

Song, Fengguang ^[1]; Dongarra, Jack ^[2]

Indiana Univ.-Purdue Univ., Indianapolis, IN (United States)
Univ. of Tennessee, Knoxville, TN (United States); Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States); Univ. of Manchester (United Kingdom)

Aiming to fully exploit the computing power of all CPUs and all graphics processing units (GPUs) on hybrid CPU-GPU systems to solve dense linear algebra problems, in this paper we design a class of heterogeneous tile algorithms to maximize the degree of parallelism, to minimize the communication volume, and to accommodate the heterogeneity between CPUs and GPUs. The new heterogeneous tile algorithms are executed upon our decentralized dynamic scheduling runtime system, which schedules a task graph dynamically and transfers data between compute nodes automatically. The runtime system uses a new distributed task assignment protocol to solve data dependencies between tasks without any coordination between processing units. By overlapping computation and communication through dynamic scheduling, we are able to attain scalable performance for the double-precision Cholesky factorization and QR factorization. Finally, our approach demonstrates a performance comparable to Intel MKL on shared-memory multicore systems and better performance than both vendor (e.g., Intel MKL) and open source libraries (e.g., StarPU) in the following three environments: heterogeneous clusters with GPUs, conventional clusters without GPUs, and shared-memory systems with multiple GPUs.

View Accepted Manuscript (DOE)

Cite

Export

Save

Research Organization:: Indiana Univ.-Purdue Univ., Indianapolis, IN (United States); Univ. of Tennessee, Knoxville, TN (United States); Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)

Sponsoring Organization:: USDOE

Contributing Organization:: Univ. of Manchester (United Kingdom)

Grant/Contract Number:: AC05-00OR22725

OSTI ID:: 1361295

Journal Information:: Concurrency and Computation. Practice and Experience, Vol. 27, Issue 14; ISSN 1532-0626

Publisher:: WileyCopyright Statement

Country of Publication:: United States

Language:: English

Citation Metrics:

Cited by: 6 works

Citation information provided by
Web of Science

References (27)

Keeneland: Bringing Heterogeneous GPU Computing to the Computational Science Community Vetter, Jeffrey S.; Glassbrook, Richard; Dongarra, Jack Computing in Science & Engineering, Vol. 13, Issue 5 https://doi.org/10.1109/MCSE.2011.83	journal	September 2011
Data distribution for dense factorization on computers with memory heterogeneity Lastovetsky, Alexey; Reddy, Ravi Parallel Computing, Vol. 33, Issue 12 https://doi.org/10.1016/j.parco.2007.06.001	journal	December 2007
ScaLAPACK Users' Guide Blackford, L. S.; Choi, J.; Cleary, A. Society for Industrial and Applied Mathematics https://doi.org/10.1137/1.9780898719642	book	January 1997
Static tiling for heterogeneous computing platforms Boulet, Pierre; Dongarra, Jack; Robert, Yves Parallel Computing, Vol. 25, Issue 5 https://doi.org/10.1016/S0167-8191(99)00012-5	journal	May 1999
A proposal for a heterogeneous cluster ScaLAPACK (dense linear solvers) Beaumont, O.; Boudet, V.; Petitet, A. IEEE Transactions on Computers, Vol. 50, Issue 10 https://doi.org/10.1109/12.956091	journal	January 2001
Scalable parallel programming with CUDA Nickolls, John; Buck, Ian; Garland, Michael Queue, Vol. 6, Issue 2 https://doi.org/10.1145/1365490.1365500	journal	March 2008
An integrated GPU power and performance model Hong, Sunpyo; Kim, Hyesoon ACM SIGARCH Computer Architecture News, Vol. 38, Issue 3 https://doi.org/10.1145/1816038.1815998	journal	June 2010
The GPU Computing Era Nickolls, John; Dally, William J. IEEE Micro, Vol. 30, Issue 2 https://doi.org/10.1109/MM.2010.41	journal	March 2010
A class of parallel tiled linear algebra algorithms for multicore architectures Buttari, Alfredo; Langou, Julien; Kurzak, Jakub Parallel Computing, Vol. 35, Issue 1 https://doi.org/10.1016/j.parco.2008.10.002	journal	January 2009
An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Ayguadé, Eduard; Badia, Rosa M.; Igual, Francisco D. Lecture Notes in Computer Science https://doi.org/10.1007/978-3-642-03869-3_79	book	January 2009
The LINPACK Benchmark: past, present and future Dongarra, Jack J.; Luszczek, Piotr; Petitet, Antoine Concurrency and Computation: Practice and Experience, Vol. 15, Issue 9 https://doi.org/10.1002/cpe.728	journal	January 2003
A scalable framework for heterogeneous GPU-based clusters Song, Fengguang; Dongarra, Jack Proceedinbgs of the 24th ACM symposium on Parallelism in algorithms and architectures - SPAA '12 https://doi.org/10.1145/2312005.2312025	conference	January 2012
On the energy efficiency of graphics processing units for scientific computing Huang, S.; Xiao, S.; Feng, W. Distributed Processing (IPDPS), 2009 IEEE International Symposium on Parallel & Distributed Processing https://doi.org/10.1109/IPDPS.2009.5160980	conference	May 2009
The Impact of Multicore on Math Software Buttari, Alfredo; Dongarra, Jack; Kurzak, Jakub Applied Parallel Computing. State of the Art in Scientific Computing https://doi.org/10.1007/978-3-540-75755-9_1	book	January 2006
Comparative study of one-sided factorizations with multiple software packages on multi-core hardware Agullo, Emmanuel; Hadri, Bilel; Ltaief, Hatem Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis - SC '09 https://doi.org/10.1145/1654059.1654080	conference	January 2009
Dynamic task scheduling for linear algebra algorithms on distributed-memory multicore systems Song, Fengguang; YarKhan, Asim; Dongarra, Jack Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis - SC '09 https://doi.org/10.1145/1654059.1654079	conference	January 2009
CULA: hybrid GPU accelerated linear algebra routines Humphrey, John R.; Price, Daniel K.; Spagnoli, Kyle E. SPIE Defense, Security, and Sensing, SPIE Proceedings https://doi.org/10.1117/12.850538	conference	April 2010
Memory requirements for balanced computer architectures Kung, H. T. ACM SIGARCH Computer Architecture News, Vol. 14, Issue 2 https://doi.org/10.1145/17356.17362	journal	May 1986
QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators Agullo, Emmanuel; Augonnet, Cedric; Dongarra, Jack Distributed Processing Symposium (IPDPS), 2011 IEEE International Parallel & Distributed Processing Symposium https://doi.org/10.1109/IPDPS.2011.90	conference	May 2011
LU factorization for accelerator-based systems Agullo, Emmanuel; Augonnet, Cedric; Dongarra, Jack 2011 9th IEEE/ACS International Conference on Computer Systems and Applications (AICCSA) https://doi.org/10.1109/AICCSA.2011.6126599	conference	December 2011
Scaling large-data computations on multi-GPU accelerators Sabne, Amit; Sakdhnagool, Putt; Eigenmann, Rudolf Proceedings of the 27th international ACM conference on International conference on supercomputing - ICS '13 https://doi.org/10.1145/2464996.2465023	conference	January 2013
Solving dense linear systems on platforms with multiple hardware accelerators Quintana-Ortí, Gregorio; Igual, Francisco D.; Quintana-Ortí, Enrique S. ACM SIGPLAN Notices, Vol. 44, Issue 4 https://doi.org/10.1145/1594835.1504196	journal	February 2009
Overlapping communication and computation by using a hybrid MPI/SMPSs approach Marjanović, Vladimir; Labarta, Jesús; Ayguadé, Eduard Proceedings of the 24th ACM International Conference on Supercomputing - ICS '10 https://doi.org/10.1145/1810085.1810091	conference	January 2010
Retargeting PLAPACK to clusters with hardware accelerators Fogue, Manuel; Igual, Francisco D.; Quintana-Orti, Enrique S. Simulation (HPCS), 2010 International Conference on High Performance Computing & Simulation https://doi.org/10.1109/HPCS.2010.5547094	conference	June 2010
Scaling Hierarchical N-body Simulations on GPU Clusters Jetley, Pritish; Wesolowski, Lukasz; Gioachin, Filippo 2010 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2010.49	conference	November 2010
G-Charm: an adaptive runtime system for message-driven parallel applications on hybrid systems Vasudevan, R.; Vadhiyar, Sathish S.; Kalé, Laxmikant V. Proceedings of the 27th international ACM conference on International conference on supercomputing - ICS '13 https://doi.org/10.1145/2464996.2465444	conference	January 2013
Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Zhong, Ziming; Rychkov, Vladimir; Lastovetsky, Alexey 2012 IEEE International Conference on Cluster Computing (CLUSTER) https://doi.org/10.1109/CLUSTER.2012.34	conference	September 2012

Cited By (2)

Scaling Up Parallel Computation of Tiled QR Factorizations by a Distributed Scheduling Runtime System and Analytical Modeling Zheng, Weijian; Song, Fengguang; Lin, Lan Parallel Processing Letters, Vol. 28, Issue 01 https://doi.org/10.1142/s0129626418500044	journal	March 2018
Tiling-Based Programming Model for Structured Grids on GPU Clusters Bastem, Burak; Unat, Didem HPCAsia2020: International Conference on High Performance Computing in Asia-Pacific Region, Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region https://doi.org/10.1145/3368474.3368485	conference	January 2020

Similar Records

Efficient Support for Matrix Computations on Heterogeneous Multi-core and Multi-GPU Architectures

Technical Report · Wed Jun 01 00:00:00 EDT 2011 · OSTI ID:1361295

Dong, Fengguang; Tomov, Stanimire; Dongarra, Jack

Batched matrix computations on hardware accelerators based on GPUs

Journal Article · Mon Feb 09 00:00:00 EST 2015 · International Journal of High Performance Computing Applications · OSTI ID:1361295

Haidar, Azzam; Dong, Tingxing; Luszczek, Piotr; +2 more

Data Locality Enhancement of Dynamic Simulations for Exascale Computing (Final Report)

Technical Report · Fri Nov 29 00:00:00 EST 2019 · OSTI ID:1361295

Shen, Xipeng

Related Subjects

97 MATHEMATICS AND COMPUTING
dense linear algebra
heterogeneous HPC systems
distributed dataflow scheduling
runtime systems

Title: A scalable approach to solving dense linear algebra problems on hybrid CPU-GPU systems

Citation Formats

References (27)

Cited By (2)

Similar Records

Related Subjects