skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: A scalable approach to solving dense linear algebra problems on hybrid CPU-GPU systems

Journal Article · · Concurrency and Computation. Practice and Experience
DOI:https://doi.org/10.1002/cpe.3403· OSTI ID:1361295
 [1];  [2]
  1. Indiana Univ.-Purdue Univ., Indianapolis, IN (United States)
  2. Univ. of Tennessee, Knoxville, TN (United States); Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States); Univ. of Manchester (United Kingdom)

Aiming to fully exploit the computing power of all CPUs and all graphics processing units (GPUs) on hybrid CPU-GPU systems to solve dense linear algebra problems, in this paper we design a class of heterogeneous tile algorithms to maximize the degree of parallelism, to minimize the communication volume, and to accommodate the heterogeneity between CPUs and GPUs. The new heterogeneous tile algorithms are executed upon our decentralized dynamic scheduling runtime system, which schedules a task graph dynamically and transfers data between compute nodes automatically. The runtime system uses a new distributed task assignment protocol to solve data dependencies between tasks without any coordination between processing units. By overlapping computation and communication through dynamic scheduling, we are able to attain scalable performance for the double-precision Cholesky factorization and QR factorization. Finally, our approach demonstrates a performance comparable to Intel MKL on shared-memory multicore systems and better performance than both vendor (e.g., Intel MKL) and open source libraries (e.g., StarPU) in the following three environments: heterogeneous clusters with GPUs, conventional clusters without GPUs, and shared-memory systems with multiple GPUs.

Research Organization:
Indiana Univ.-Purdue Univ., Indianapolis, IN (United States); Univ. of Tennessee, Knoxville, TN (United States); Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
Sponsoring Organization:
USDOE
Contributing Organization:
Univ. of Manchester (United Kingdom)
Grant/Contract Number:
AC05-00OR22725
OSTI ID:
1361295
Journal Information:
Concurrency and Computation. Practice and Experience, Vol. 27, Issue 14; ISSN 1532-0626
Publisher:
WileyCopyright Statement
Country of Publication:
United States
Language:
English
Citation Metrics:
Cited by: 6 works
Citation information provided by
Web of Science

References (27)

Keeneland: Bringing Heterogeneous GPU Computing to the Computational Science Community journal September 2011
Data distribution for dense factorization on computers with memory heterogeneity journal December 2007
ScaLAPACK Users' Guide book January 1997
Static tiling for heterogeneous computing platforms journal May 1999
A proposal for a heterogeneous cluster ScaLAPACK (dense linear solvers) journal January 2001
Scalable parallel programming with CUDA journal March 2008
An integrated GPU power and performance model journal June 2010
The GPU Computing Era journal March 2010
A class of parallel tiled linear algebra algorithms for multicore architectures journal January 2009
An Extension of the StarSs Programming Model for Platforms with Multiple GPUs book January 2009
The LINPACK Benchmark: past, present and future
  • Dongarra, Jack J.; Luszczek, Piotr; Petitet, Antoine
  • Concurrency and Computation: Practice and Experience, Vol. 15, Issue 9 https://doi.org/10.1002/cpe.728
journal January 2003
A scalable framework for heterogeneous GPU-based clusters conference January 2012
On the energy efficiency of graphics processing units for scientific computing conference May 2009
The Impact of Multicore on Math Software book January 2006
Comparative study of one-sided factorizations with multiple software packages on multi-core hardware conference January 2009
Dynamic task scheduling for linear algebra algorithms on distributed-memory multicore systems conference January 2009
CULA: hybrid GPU accelerated linear algebra routines conference April 2010
Memory requirements for balanced computer architectures journal May 1986
QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators
  • Agullo, Emmanuel; Augonnet, Cedric; Dongarra, Jack
  • Distributed Processing Symposium (IPDPS), 2011 IEEE International Parallel & Distributed Processing Symposium https://doi.org/10.1109/IPDPS.2011.90
conference May 2011
LU factorization for accelerator-based systems conference December 2011
Scaling large-data computations on multi-GPU accelerators
  • Sabne, Amit; Sakdhnagool, Putt; Eigenmann, Rudolf
  • Proceedings of the 27th international ACM conference on International conference on supercomputing - ICS '13 https://doi.org/10.1145/2464996.2465023
conference January 2013
Solving dense linear systems on platforms with multiple hardware accelerators journal February 2009
Overlapping communication and computation by using a hybrid MPI/SMPSs approach conference January 2010
Retargeting PLAPACK to clusters with hardware accelerators conference June 2010
Scaling Hierarchical N-body Simulations on GPU Clusters
  • Jetley, Pritish; Wesolowski, Lukasz; Gioachin, Filippo
  • 2010 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2010.49
conference November 2010
G-Charm: an adaptive runtime system for message-driven parallel applications on hybrid systems
  • Vasudevan, R.; Vadhiyar, Sathish S.; Kalé, Laxmikant V.
  • Proceedings of the 27th international ACM conference on International conference on supercomputing - ICS '13 https://doi.org/10.1145/2464996.2465444
conference January 2013
Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications conference September 2012

Cited By (2)

Scaling Up Parallel Computation of Tiled QR Factorizations by a Distributed Scheduling Runtime System and Analytical Modeling journal March 2018
Tiling-Based Programming Model for Structured Grids on GPU Clusters
  • Bastem, Burak; Unat, Didem
  • HPCAsia2020: International Conference on High Performance Computing in Asia-Pacific Region, Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region https://doi.org/10.1145/3368474.3368485
conference January 2020

Similar Records

Efficient Support for Matrix Computations on Heterogeneous Multi-core and Multi-GPU Architectures
Technical Report · Wed Jun 01 00:00:00 EDT 2011 · OSTI ID:1361295

Batched matrix computations on hardware accelerators based on GPUs
Journal Article · Mon Feb 09 00:00:00 EST 2015 · International Journal of High Performance Computing Applications · OSTI ID:1361295

Data Locality Enhancement of Dynamic Simulations for Exascale Computing (Final Report)
Technical Report · Fri Nov 29 00:00:00 EST 2019 · OSTI ID:1361295