A communication-avoiding 3D algorithm for sparse LU factorization on heterogeneous systems
Abstract
We propose a new algorithm to improve the strong scalability of right-looking sparse LU factorization on distributed memory systems. Our 3D algorithm for sparse LU uses a three-dimensional MPI process grid, exploits elimination tree parallelism, and trades off increased memory for reduced per-process communication. We also analyze the asymptotic improvements for planar graphs (e.g., those arising from 2D grid or mesh discretizations) and certain non-planar graphs (specifically for 3D grids and meshes). For a planar graph with $$n$$ vertices, our algorithm reduces communication volume asymptotically in $$n$$ by a factor of $$\mathscr{O}$$ $$\Big(\sqrt{log \ n}\Big)$$ and latency by a factor of $$\mathscr{O}$$ $$(log \ n)$$. For nonplanar cases, our algorithm can reduce the per-process communication volume by 3× and latency by $$\mathscr{O}$$ $$\Big(n^\frac13\Big)$$ times. In all cases, the memory needed to achieve these gains is a constant factor. We implemented our algorithm by extending the 2D data structure used in SuperLU_DIST. Our new 3D code achieves empirical speedups up to 27× for planar graphs and up to 3.3× for non-planar graphs over the baseline 2D SuperLU_DIST when run on 24,000 cores of a Cray XC30. We extend the 3D algorithm for heterogeneous architectures by adding the Highly Asynchronous Lazy Offloadmore »
- Authors:
-
- Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
- Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
- Georgia Inst. of Technology, Atlanta, GA (United States)
- Publication Date:
- Research Org.:
- Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
- Sponsoring Org.:
- USDOE National Nuclear Security Administration (NNSA)
- OSTI Identifier:
- 1559632
- Alternate Identifier(s):
- OSTI ID: 1547464
- Grant/Contract Number:
- AC05-00OR22725
- Resource Type:
- Journal Article: Accepted Manuscript
- Journal Name:
- Journal of Parallel and Distributed Computing
- Additional Journal Information:
- Journal Volume: 131; Journal Issue: 9; Journal ID: ISSN 0743-7315
- Publisher:
- Elsevier
- Country of Publication:
- United States
- Language:
- English
- Subject:
- 97 MATHEMATICS AND COMPUTING
Citation Formats
Sao, Piyush, Li, Xiaoye Sherry, and Vuduc, Richard. A communication-avoiding 3D algorithm for sparse LU factorization on heterogeneous systems. United States: N. p., 2019.
Web. doi:10.1016/j.jpdc.2019.03.004.
Sao, Piyush, Li, Xiaoye Sherry, & Vuduc, Richard. A communication-avoiding 3D algorithm for sparse LU factorization on heterogeneous systems. United States. https://doi.org/10.1016/j.jpdc.2019.03.004
Sao, Piyush, Li, Xiaoye Sherry, and Vuduc, Richard. 2019.
"A communication-avoiding 3D algorithm for sparse LU factorization on heterogeneous systems". United States. https://doi.org/10.1016/j.jpdc.2019.03.004. https://www.osti.gov/servlets/purl/1559632.
@article{osti_1559632,
title = {A communication-avoiding 3D algorithm for sparse LU factorization on heterogeneous systems},
author = {Sao, Piyush and Li, Xiaoye Sherry and Vuduc, Richard},
abstractNote = {We propose a new algorithm to improve the strong scalability of right-looking sparse LU factorization on distributed memory systems. Our 3D algorithm for sparse LU uses a three-dimensional MPI process grid, exploits elimination tree parallelism, and trades off increased memory for reduced per-process communication. We also analyze the asymptotic improvements for planar graphs (e.g., those arising from 2D grid or mesh discretizations) and certain non-planar graphs (specifically for 3D grids and meshes). For a planar graph with $n$ vertices, our algorithm reduces communication volume asymptotically in $n$ by a factor of $\mathscr{O}$ $\Big(\sqrt{log \ n}\Big)$ and latency by a factor of $\mathscr{O}$ $(log \ n)$. For nonplanar cases, our algorithm can reduce the per-process communication volume by 3× and latency by $\mathscr{O}$ $\Big(n^\frac13\Big)$ times. In all cases, the memory needed to achieve these gains is a constant factor. We implemented our algorithm by extending the 2D data structure used in SuperLU_DIST. Our new 3D code achieves empirical speedups up to 27× for planar graphs and up to 3.3× for non-planar graphs over the baseline 2D SuperLU_DIST when run on 24,000 cores of a Cray XC30. We extend the 3D algorithm for heterogeneous architectures by adding the Highly Asynchronous Lazy Offload (Halo) algorithm for co-processor offload. On 4096 nodes of a Cray XK7 with 32,768 CPU cores and 4096 Nvidia K20x GPUs, the 3D algorithm achieves empirical speedups up to 24× for planar graphs and 3.5× for non-planar graphs over the baseline 2D SuperLU_DIST with co-processor acceleration.},
doi = {10.1016/j.jpdc.2019.03.004},
url = {https://www.osti.gov/biblio/1559632},
journal = {Journal of Parallel and Distributed Computing},
issn = {0743-7315},
number = 9,
volume = 131,
place = {United States},
year = {2019},
month = {8}
}
Web of Science
Works referenced in this record:
Robust Memory-Aware Mappings for Parallel Multifrontal Factorizations
journal, January 2016
- Agullo, Emmanuel; Amestoy, Patrick R.; Buttari, Alfredo
- SIAM Journal on Scientific Computing, Vol. 38, Issue 3
Implementing Multifrontal Sparse Solvers for Multicore Architectures with Sequential Task Flow Runtime Systems
journal, September 2016
- Agullo, Emmanuel; Buttari, Alfredo; Guermouche, Abdou
- ACM Transactions on Mathematical Software, Vol. 43, Issue 2
PT-Scotch: A tool for efficient parallel graph ordering
journal, July 2008
- Chevalier, C.; Pellegrini, F.
- Parallel Computing, Vol. 34, Issue 6-8
A Supernodal Approach to Sparse Partial Pivoting
journal, January 1999
- Demmel, James W.; Eisenstat, Stanley C.; Gilbert, John R.
- SIAM Journal on Matrix Analysis and Applications, Vol. 20, Issue 3
Communication Avoiding Rank Revealing QR Factorization with Column Pivoting
journal, January 2015
- Demmel, James W.; Grigori, Laura; Gu, Ming
- SIAM Journal on Matrix Analysis and Applications, Vol. 36, Issue 1
Parallel Scheduling of Task Trees with Limited Memory
journal, July 2015
- Eyraud-Dubois, Lionel; Marchal, Loris; Sinnen, Oliver
- ACM Transactions on Parallel Computing, Vol. 2, Issue 2
A separator theorem for graphs of bounded genus
journal, September 1984
- Gilbert, John R.; Hutchinson, Joan P.; Tarjan, Robert Endre
- Journal of Algorithms, Vol. 5, Issue 3
Graph Grammar based Multi-thread Multi-frontal Direct Solver with Galois Scheduler
journal, January 2014
- Goik, Damian; Jopek, Konrad; Paszyński, Maciej
- Procedia Computer Science, Vol. 29
CALU: A Communication Optimal LU Factorization Algorithm
journal, October 2011
- Grigori, Laura; Demmel, James W.; Xiang, Hua
- SIAM Journal on Matrix Analysis and Applications, Vol. 32, Issue 4
Highly scalable parallel algorithms for sparse matrix factorization
journal, May 1997
- Gupta, A.; Karypis, G.; Kumar, V.
- IEEE Transactions on Parallel and Distributed Systems, Vol. 8, Issue 5
Parallel Algorithms for Sparse Linear Systems
journal, September 1991
- Heath, Michael T.; Ng, Esmond; Peyton, Barry W.
- SIAM Review, Vol. 33, Issue 3
Limiting Communication in Parallel Sparse Cholesky Factorization
journal, September 1991
- Hulbert, Laurie; Zmijewski, Earl
- SIAM Journal on Scientific and Statistical Computing, Vol. 12, Issue 5
Trading Replication for Communication in Parallel Distributed-Memory Dense Solvers
journal, March 2002
- Irony, Dror; Toledo, Sivan
- Parallel Processing Letters, Vol. 12, Issue 01
Communication lower bounds for distributed-memory matrix multiplication
journal, September 2004
- Irony, Dror; Toledo, Sivan; Tiskin, Alexander
- Journal of Parallel and Distributed Computing, Vol. 64, Issue 9
LU Factorization with Panel Rank Revealing Pivoting and Its Communication Avoiding Version
journal, January 2013
- Khabou, Amal; Demmel, James W.; Grigori, Laura
- SIAM Journal on Matrix Analysis and Applications, Vol. 34, Issue 3
A Parallel Sparse Direct Solver via Hierarchical DAG Scheduling
journal, October 2014
- Kim, Kyungjoo; Eijkhout, Victor
- ACM Transactions on Mathematical Software, Vol. 41, Issue 1
An overview of SuperLU: Algorithms, implementation, and user interface
journal, September 2005
- Li, Xiaoye S.
- ACM Transactions on Mathematical Software, Vol. 31, Issue 3
SuperLU_DIST: A scalable distributed-memory sparse direct solver for unsymmetric linear systems
journal, June 2003
- Li, Xiaoye S.; Demmel, James W.
- ACM Transactions on Mathematical Software, Vol. 29, Issue 2
A Separator Theorem for Planar Graphs
journal, April 1979
- Lipton, Richard J.; Tarjan, Robert Endre
- SIAM Journal on Applied Mathematics, Vol. 36, Issue 2
SymPy: symbolic computing in Python
journal, January 2017
- Meurer, Aaron; Smith, Christopher P.; Paprocki, Mateusz
- PeerJ Computer Science, Vol. 3
A CPU–GPU hybrid approach for the unsymmetric multifrontal method
journal, December 2011
- Yu, Chenhan D.; Wang, Weichung; Pierce, Dan’l
- Parallel Computing, Vol. 37, Issue 12
Works referencing / citing this record:
Preparing sparse solvers for exascale computing
journal, January 2020
- Anzt, Hartwig; Boman, Erik; Falgout, Rob
- Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, Vol. 378, Issue 2166