A scalable approach to solving dense linear algebra problems on hybrid CPUGPU systems
Aiming to fully exploit the computing power of all CPUs and all graphics processing units (GPUs) on hybrid CPUGPU systems to solve dense linear algebra problems, in this paper we design a class of heterogeneous tile algorithms to maximize the degree of parallelism, to minimize the communication volume, and to accommodate the heterogeneity between CPUs and GPUs. The new heterogeneous tile algorithms are executed upon our decentralized dynamic scheduling runtime system, which schedules a task graph dynamically and transfers data between compute nodes automatically. The runtime system uses a new distributed task assignment protocol to solve data dependencies between tasks without any coordination between processing units. By overlapping computation and communication through dynamic scheduling, we are able to attain scalable performance for the doubleprecision Cholesky factorization and QR factorization. Finally, our approach demonstrates a performance comparable to Intel MKL on sharedmemory multicore systems and better performance than both vendor (e.g., Intel MKL) and open source libraries (e.g., StarPU) in the following three environments: heterogeneous clusters with GPUs, conventional clusters without GPUs, and sharedmemory systems with multiple GPUs.
 Authors:

^{[1]};
^{[2]}
 Indiana Univ.Purdue Univ., Indianapolis, IN (United States)
 Univ. of Tennessee, Knoxville, TN (United States); Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States); Univ. of Manchester (United Kingdom)
 Publication Date:
 Grant/Contract Number:
 AC0500OR22725
 Type:
 Accepted Manuscript
 Journal Name:
 Concurrency and Computation. Practice and Experience
 Additional Journal Information:
 Journal Volume: 27; Journal Issue: 14; Journal ID: ISSN 15320626
 Publisher:
 Wiley
 Research Org:
 Indiana Univ.Purdue Univ., Indianapolis, IN (United States); Univ. of Tennessee, Knoxville, TN (United States); Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
 Sponsoring Org:
 USDOE
 Contributing Orgs:
 Univ. of Manchester (United Kingdom)
 Country of Publication:
 United States
 Language:
 English
 Subject:
 97 MATHEMATICS AND COMPUTING; dense linear algebra; heterogeneous HPC systems; distributed dataflow scheduling; runtime systems
 OSTI Identifier:
 1361295
Song, Fengguang, and Dongarra, Jack. A scalable approach to solving dense linear algebra problems on hybrid CPUGPU systems. United States: N. p.,
Web. doi:10.1002/cpe.3403.
Song, Fengguang, & Dongarra, Jack. A scalable approach to solving dense linear algebra problems on hybrid CPUGPU systems. United States. doi:10.1002/cpe.3403.
Song, Fengguang, and Dongarra, Jack. 2014.
"A scalable approach to solving dense linear algebra problems on hybrid CPUGPU systems". United States.
doi:10.1002/cpe.3403. https://www.osti.gov/servlets/purl/1361295.
@article{osti_1361295,
title = {A scalable approach to solving dense linear algebra problems on hybrid CPUGPU systems},
author = {Song, Fengguang and Dongarra, Jack},
abstractNote = {Aiming to fully exploit the computing power of all CPUs and all graphics processing units (GPUs) on hybrid CPUGPU systems to solve dense linear algebra problems, in this paper we design a class of heterogeneous tile algorithms to maximize the degree of parallelism, to minimize the communication volume, and to accommodate the heterogeneity between CPUs and GPUs. The new heterogeneous tile algorithms are executed upon our decentralized dynamic scheduling runtime system, which schedules a task graph dynamically and transfers data between compute nodes automatically. The runtime system uses a new distributed task assignment protocol to solve data dependencies between tasks without any coordination between processing units. By overlapping computation and communication through dynamic scheduling, we are able to attain scalable performance for the doubleprecision Cholesky factorization and QR factorization. Finally, our approach demonstrates a performance comparable to Intel MKL on sharedmemory multicore systems and better performance than both vendor (e.g., Intel MKL) and open source libraries (e.g., StarPU) in the following three environments: heterogeneous clusters with GPUs, conventional clusters without GPUs, and sharedmemory systems with multiple GPUs.},
doi = {10.1002/cpe.3403},
journal = {Concurrency and Computation. Practice and Experience},
number = 14,
volume = 27,
place = {United States},
year = {2014},
month = {10}
}