A communication-avoiding 3D algorithm for sparse LU factorization on heterogeneous systems

Sao, Piyush; Li, Xiaoye Sherry; Vuduc, Richard

doi:10.1016/j.jpdc.2019.03.004

A communication-avoiding 3D algorithm for sparse LU factorization on heterogeneous systems

Journal Article · Mon Aug 19 00:00:00 EDT 2019 · Journal of Parallel and Distributed Computing

DOI:https://doi.org/10.1016/j.jpdc.2019.03.004· OSTI ID:1559632

^[1]; Li, Xiaoye Sherry ^[2]; Vuduc, Richard ^[3]

Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
Georgia Inst. of Technology, Atlanta, GA (United States)

We propose a new algorithm to improve the strong scalability of right-looking sparse LU factorization on distributed memory systems. Our 3D algorithm for sparse LU uses a three-dimensional MPI process grid, exploits elimination tree parallelism, and trades off increased memory for reduced per-process communication. We also analyze the asymptotic improvements for planar graphs (e.g., those arising from 2D grid or mesh discretizations) and certain non-planar graphs (specifically for 3D grids and meshes). For a planar graph with $$n$$ vertices, our algorithm reduces communication volume asymptotically in $$n$$ by a factor of $$\mathscr{O}$$ $$\Big(\sqrt{log \ n}\Big)$$ and latency by a factor of $$\mathscr{O}$$ $$(log \ n)$$. For nonplanar cases, our algorithm can reduce the per-process communication volume by 3× and latency by $$\mathscr{O}$$ $$\Big(n^\frac13\Big)$$ times. In all cases, the memory needed to achieve these gains is a constant factor. We implemented our algorithm by extending the 2D data structure used in SuperLU_DIST. Our new 3D code achieves empirical speedups up to 27× for planar graphs and up to 3.3× for non-planar graphs over the baseline 2D SuperLU_DIST when run on 24,000 cores of a Cray XC30. We extend the 3D algorithm for heterogeneous architectures by adding the Highly Asynchronous Lazy Offload (Halo) algorithm for co-processor offload. On 4096 nodes of a Cray XK7 with 32,768 CPU cores and 4096 Nvidia K20x GPUs, the 3D algorithm achieves empirical speedups up to 24× for planar graphs and 3.5× for non-planar graphs over the baseline 2D SuperLU_DIST with co-processor acceleration.

View Accepted Manuscript (DOE)

Research Organization:: Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)

Sponsoring Organization:: USDOE

Grant/Contract Number:: AC05-00OR22725

OSTI ID:: 1559632

Alternate ID(s):: OSTI ID: 1547464

Journal Information:: Journal of Parallel and Distributed Computing, Journal Name: Journal of Parallel and Distributed Computing Journal Issue: 9 Vol. 131; ISSN 0743-7315

Publisher:: ElsevierCopyright Statement

Country of Publication:: United States

Language:: English

References (21)

A separator theorem for graphs of bounded genus Gilbert, John R.; Hutchinson, Joan P.; Tarjan, Robert Endre Journal of Algorithms, Vol. 5, Issue 3 https://doi.org/10.1016/0196-6774(84)90019-1	journal	September 1984
Communication lower bounds for distributed-memory matrix multiplication Irony, Dror; Toledo, Sivan; Tiskin, Alexander Journal of Parallel and Distributed Computing, Vol. 64, Issue 9 https://doi.org/10.1016/j.jpdc.2004.03.021	journal	September 2004
PT-Scotch: A tool for efficient parallel graph ordering Chevalier, C.; Pellegrini, F. Parallel Computing, Vol. 34, Issue 6-8 https://doi.org/10.1016/j.parco.2007.12.001	journal	July 2008
A CPU–GPU hybrid approach for the unsymmetric multifrontal method Yu, Chenhan D.; Wang, Weichung; Pierce, Dan’l Parallel Computing, Vol. 37, Issue 12 https://doi.org/10.1016/j.parco.2011.09.002	journal	December 2011
Graph Grammar based Multi-thread Multi-frontal Direct Solver with Galois Scheduler Goik, Damian; Jopek, Konrad; Paszyński, Maciej Procedia Computer Science, Vol. 29 https://doi.org/10.1016/j.procs.2014.05.086	journal	January 2014
Highly scalable parallel algorithms for sparse matrix factorization Gupta, A.; Karypis, G.; Kumar, V. IEEE Transactions on Parallel and Distributed Systems, Vol. 8, Issue 5 https://doi.org/10.1109/71.598277	journal	May 1997
A Separator Theorem for Planar Graphs Lipton, Richard J.; Tarjan, Robert Endre SIAM Journal on Applied Mathematics, Vol. 36, Issue 2 https://doi.org/10.1137/0136016	journal	April 1979
Limiting Communication in Parallel Sparse Cholesky Factorization Hulbert, Laurie; Zmijewski, Earl SIAM Journal on Scientific and Statistical Computing, Vol. 12, Issue 5 https://doi.org/10.1137/0912063	journal	September 1991
CALU: A Communication Optimal LU Factorization Algorithm Grigori, Laura; Demmel, James W.; Xiang, Hua SIAM Journal on Matrix Analysis and Applications, Vol. 32, Issue 4 https://doi.org/10.1137/100788926	journal	October 2011
Parallel Algorithms for Sparse Linear Systems Heath, Michael T.; Ng, Esmond; Peyton, Barry W. SIAM Review, Vol. 33, Issue 3 https://doi.org/10.1137/1033099	journal	September 1991
LU Factorization with Panel Rank Revealing Pivoting and Its Communication Avoiding Version Khabou, Amal; Demmel, James W.; Grigori, Laura SIAM Journal on Matrix Analysis and Applications, Vol. 34, Issue 3 https://doi.org/10.1137/120863691	journal	January 2013
Communication Avoiding Rank Revealing QR Factorization with Column Pivoting Demmel, James W.; Grigori, Laura; Gu, Ming SIAM Journal on Matrix Analysis and Applications, Vol. 36, Issue 1 https://doi.org/10.1137/13092157X	journal	January 2015
Robust Memory-Aware Mappings for Parallel Multifrontal Factorizations Agullo, Emmanuel; Amestoy, Patrick R.; Buttari, Alfredo SIAM Journal on Scientific Computing, Vol. 38, Issue 3 https://doi.org/10.1137/130938505	journal	January 2016
A Supernodal Approach to Sparse Partial Pivoting Demmel, James W.; Eisenstat, Stanley C.; Gilbert, John R. SIAM Journal on Matrix Analysis and Applications, Vol. 20, Issue 3 https://doi.org/10.1137/S0895479895291765	journal	January 1999
Trading Replication for Communication in Parallel Distributed-Memory Dense Solvers Irony, Dror; Toledo, Sivan Parallel Processing Letters, Vol. 12, Issue 01 https://doi.org/10.1142/S0129626402000847	journal	March 2002
An overview of SuperLU: Algorithms, implementation, and user interface Li, Xiaoye S. ACM Transactions on Mathematical Software, Vol. 31, Issue 3 https://doi.org/10.1145/1089014.1089017	journal	September 2005
A Parallel Sparse Direct Solver via Hierarchical DAG Scheduling Kim, Kyungjoo; Eijkhout, Victor ACM Transactions on Mathematical Software, Vol. 41, Issue 1 https://doi.org/10.1145/2629641	journal	October 2014
Parallel Scheduling of Task Trees with Limited Memory Eyraud-Dubois, Lionel; Marchal, Loris; Sinnen, Oliver ACM Transactions on Parallel Computing, Vol. 2, Issue 2 https://doi.org/10.1145/2779052	journal	July 2015
Implementing Multifrontal Sparse Solvers for Multicore Architectures with Sequential Task Flow Runtime Systems Agullo, Emmanuel; Buttari, Alfredo; Guermouche, Abdou ACM Transactions on Mathematical Software, Vol. 43, Issue 2 https://doi.org/10.1145/2898348	journal	September 2016
SuperLU_DIST: A scalable distributed-memory sparse direct solver for unsymmetric linear systems Li, Xiaoye S.; Demmel, James W. ACM Transactions on Mathematical Software, Vol. 29, Issue 2 https://doi.org/10.1145/779359.779361	journal	June 2003
SymPy: symbolic computing in Python Meurer, Aaron; Smith, Christopher P.; Paprocki, Mateusz PeerJ Computer Science, Vol. 3 https://doi.org/10.7717/peerj-cs.103	journal	January 2017

Cited By (1)

Preparing sparse solvers for exascale computing Anzt, Hartwig; Boman, Erik; Falgout, Rob Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, Vol. 378, Issue 2166 https://doi.org/10.1098/rsta.2019.0053	journal	January 2020

Similar Records

A Communication-Avoiding 3D LU Factorization Algorithm for Sparse Matrices

Conference · Tue May 01 00:00:00 EDT 2018 · OSTI ID:1544235

Brief Announcement: Communication Optimal Sparse LU Factorization for Planar Matrices

Conference · Thu Jun 01 00:00:00 EDT 2023 · OSTI ID:1999008

A communication-avoiding 3D sparse triangular solver

Conference · Sat Jun 01 00:00:00 EDT 2019 · OSTI ID:1558528

Related Subjects

97 MATHEMATICS AND COMPUTING

A communication-avoiding 3D algorithm for sparse LU factorization on heterogeneous systems

Citation Formats

References (21)

Cited By (1)

Similar Records

Related Subjects