A communication-avoiding 3D algorithm for sparse LU factorization on heterogeneous systems

Sao, Piyush; Li, Xiaoye Sherry; Vuduc, Richard

doi:10.1016/j.jpdc.2019.03.004

Title: A communication-avoiding 3D algorithm for sparse LU factorization on heterogeneous systems

Abstract

We propose a new algorithm to improve the strong scalability of right-looking sparse LU factorization on distributed memory systems. Our 3D algorithm for sparse LU uses a three-dimensional MPI process grid, exploits elimination tree parallelism, and trades off increased memory for reduced per-process communication. We also analyze the asymptotic improvements for planar graphs (e.g., those arising from 2D grid or mesh discretizations) and certain non-planar graphs (specifically for 3D grids and meshes). For a planar graph with $$n$$ vertices, our algorithm reduces communication volume asymptotically in $$n$$ by a factor of $$\mathscr{O}$$ $$\Big(\sqrt{log \ n}\Big)$$ and latency by a factor of $$\mathscr{O}$$ $$(log \ n)$$. For nonplanar cases, our algorithm can reduce the per-process communication volume by 3× and latency by $$\mathscr{O}$$ $$\Big(n^\frac13\Big)$$ times. In all cases, the memory needed to achieve these gains is a constant factor. We implemented our algorithm by extending the 2D data structure used in SuperLU_DIST. Our new 3D code achieves empirical speedups up to 27× for planar graphs and up to 3.3× for non-planar graphs over the baseline 2D SuperLU_DIST when run on 24,000 cores of a Cray XC30. We extend the 3D algorithm for heterogeneous architectures by adding the Highly Asynchronous Lazy Offload (Halo) algorithm for co-processor offload. On 4096 nodes of a Cray XK7 with 32,768 CPU cores and 4096 Nvidia K20x GPUs, the 3D algorithm achieves empirical speedups up to 24× for planar graphs and 3.5× for non-planar graphs over the baseline 2D SuperLU_DIST with co-processor acceleration.

Authors:

^[1]; Li, Xiaoye Sherry ^[2]; Vuduc, Richard ^[3]

Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
Georgia Inst. of Technology, Atlanta, GA (United States)

Publication Date:: Mon Aug 19 00:00:00 EDT 2019

Research Org.:: Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)

Sponsoring Org.:: USDOE National Nuclear Security Administration (NNSA)

OSTI Identifier:: 1559632

Alternate Identifier(s):: OSTI ID: 1547464

Grant/Contract Number:: AC05-00OR22725

Resource Type:: Accepted Manuscript

Journal Name:: Journal of Parallel and Distributed Computing

Additional Journal Information:: Journal Volume: 131; Journal Issue: 9; Journal ID: ISSN 0743-7315

Publisher:: Elsevier

Country of Publication:: United States

Language:: English

Subject:: 97 MATHEMATICS AND COMPUTING

Citation Formats


                    Sao, Piyush, Li, Xiaoye Sherry, and Vuduc, Richard. A communication-avoiding 3D algorithm for sparse LU factorization on heterogeneous systems.  United States: N. p., 2019. 
Web.  doi:10.1016/j.jpdc.2019.03.004.

Copy to clipboard


                    Sao, Piyush, Li, Xiaoye Sherry, & Vuduc, Richard. A communication-avoiding 3D algorithm for sparse LU factorization on heterogeneous systems.  United States.  https://doi.org/10.1016/j.jpdc.2019.03.004

Copy to clipboard


                    Sao, Piyush, Li, Xiaoye Sherry, and Vuduc, Richard. Mon .  
"A communication-avoiding 3D algorithm for sparse LU factorization on heterogeneous systems".  United States.  https://doi.org/10.1016/j.jpdc.2019.03.004.  https://www.osti.gov/servlets/purl/1559632.

Copy to clipboard


                    
@article{osti_1559632,

  title        = {A communication-avoiding 3D algorithm for sparse LU factorization on heterogeneous systems},

  author       = {Sao, Piyush and Li, Xiaoye Sherry and Vuduc, Richard},

  abstractNote = {We propose a new algorithm to improve the strong scalability of right-looking sparse LU factorization on distributed memory systems. Our 3D algorithm for sparse LU uses a three-dimensional MPI process grid, exploits elimination tree parallelism, and trades off increased memory for reduced per-process communication. We also analyze the asymptotic improvements for planar graphs (e.g., those arising from 2D grid or mesh discretizations) and certain non-planar graphs (specifically for 3D grids and meshes). For a planar graph with $n$ vertices, our algorithm reduces communication volume asymptotically in $n$ by a factor of $\mathscr{O}$ $\Big(\sqrt{log \ n}\Big)$ and latency by a factor of $\mathscr{O}$ $(log \ n)$. For nonplanar cases, our algorithm can reduce the per-process communication volume by 3× and latency by $\mathscr{O}$ $\Big(n^\frac13\Big)$ times. In all cases, the memory needed to achieve these gains is a constant factor. We implemented our algorithm by extending the 2D data structure used in SuperLU_DIST. Our new 3D code achieves empirical speedups up to 27× for planar graphs and up to 3.3× for non-planar graphs over the baseline 2D SuperLU_DIST when run on 24,000 cores of a Cray XC30. We extend the 3D algorithm for heterogeneous architectures by adding the Highly Asynchronous Lazy Offload (Halo) algorithm for co-processor offload. On 4096 nodes of a Cray XK7 with 32,768 CPU cores and 4096 Nvidia K20x GPUs, the 3D algorithm achieves empirical speedups up to 24× for planar graphs and 3.5× for non-planar graphs over the baseline 2D SuperLU_DIST with co-processor acceleration.},

  doi          = {10.1016/j.jpdc.2019.03.004},

  journal      = {Journal of Parallel and Distributed Computing},

  number       = 9,

  volume       = 131,

  place        = {United States},

  year         = {Mon Aug 19 00:00:00 EDT 2019},

  month        = {Mon Aug 19 00:00:00 EDT 2019}

}

Copy to clipboard

Journal Article:

Free Publicly Available Full Text

Accepted Manuscript (Publisher)

Accepted Manuscript (DOE)

Publisher's Version of Record

https://doi.org/10.1016/j.jpdc.2019.03.004

Other availability

Search WorldCat to find libraries that may hold this journal

Citation Metrics:

Cited by: 2 works

Citation information provided by
Web of Science

Save / Share:

Export Metadata

Save to My Library

Works referenced in this record:

Robust Memory-Aware Mappings for Parallel Multifrontal Factorizations
journal, January 2016

Agullo, Emmanuel; Amestoy, Patrick R.; Buttari, Alfredo
SIAM Journal on Scientific Computing, Vol. 38, Issue 3
DOI: 10.1137/130938505

Implementing Multifrontal Sparse Solvers for Multicore Architectures with Sequential Task Flow Runtime Systems
journal, September 2016

Agullo, Emmanuel; Buttari, Alfredo; Guermouche, Abdou
ACM Transactions on Mathematical Software, Vol. 43, Issue 2
DOI: 10.1145/2898348

PT-Scotch: A tool for efficient parallel graph ordering
journal, July 2008

Chevalier, C.; Pellegrini, F.
Parallel Computing, Vol. 34, Issue 6-8
DOI: 10.1016/j.parco.2007.12.001

Communication Avoiding Rank Revealing QR Factorization with Column Pivoting
journal, January 2015

Demmel, James W.; Grigori, Laura; Gu, Ming
SIAM Journal on Matrix Analysis and Applications, Vol. 36, Issue 1
DOI: 10.1137/13092157X

Parallel Scheduling of Task Trees with Limited Memory
journal, July 2015

Eyraud-Dubois, Lionel; Marchal, Loris; Sinnen, Oliver
ACM Transactions on Parallel Computing, Vol. 2, Issue 2
DOI: 10.1145/2779052

A separator theorem for graphs of bounded genus
journal, September 1984

Gilbert, John R.; Hutchinson, Joan P.; Tarjan, Robert Endre
Journal of Algorithms, Vol. 5, Issue 3
DOI: 10.1016/0196-6774(84)90019-1

Graph Grammar based Multi-thread Multi-frontal Direct Solver with Galois Scheduler
journal, January 2014

Goik, Damian; Jopek, Konrad; Paszyński, Maciej
Procedia Computer Science, Vol. 29
DOI: 10.1016/j.procs.2014.05.086

CALU: A Communication Optimal LU Factorization Algorithm
journal, October 2011

Grigori, Laura; Demmel, James W.; Xiang, Hua
SIAM Journal on Matrix Analysis and Applications, Vol. 32, Issue 4
DOI: 10.1137/100788926

Highly scalable parallel algorithms for sparse matrix factorization
journal, May 1997

Gupta, A.; Karypis, G.; Kumar, V.
IEEE Transactions on Parallel and Distributed Systems, Vol. 8, Issue 5
DOI: 10.1109/71.598277

Parallel Algorithms for Sparse Linear Systems
journal, September 1991

Heath, Michael T.; Ng, Esmond; Peyton, Barry W.
SIAM Review, Vol. 33, Issue 3
DOI: 10.1137/1033099

Limiting Communication in Parallel Sparse Cholesky Factorization
journal, September 1991

Hulbert, Laurie; Zmijewski, Earl
SIAM Journal on Scientific and Statistical Computing, Vol. 12, Issue 5
DOI: 10.1137/0912063

Trading Replication for Communication in Parallel Distributed-Memory Dense Solvers
journal, March 2002

Irony, Dror; Toledo, Sivan
Parallel Processing Letters, Vol. 12, Issue 01
DOI: 10.1142/S0129626402000847

Communication lower bounds for distributed-memory matrix multiplication
journal, September 2004

Irony, Dror; Toledo, Sivan; Tiskin, Alexander
Journal of Parallel and Distributed Computing, Vol. 64, Issue 9
DOI: 10.1016/j.jpdc.2004.03.021

LU Factorization with Panel Rank Revealing Pivoting and Its Communication Avoiding Version
journal, January 2013

Khabou, Amal; Demmel, James W.; Grigori, Laura
SIAM Journal on Matrix Analysis and Applications, Vol. 34, Issue 3
DOI: 10.1137/120863691

A Parallel Sparse Direct Solver via Hierarchical DAG Scheduling
journal, October 2014

Kim, Kyungjoo; Eijkhout, Victor
ACM Transactions on Mathematical Software, Vol. 41, Issue 1
DOI: 10.1145/2629641

An overview of SuperLU: Algorithms, implementation, and user interface
journal, September 2005

Li, Xiaoye S.
ACM Transactions on Mathematical Software, Vol. 31, Issue 3
DOI: 10.1145/1089014.1089017

SuperLU_DIST: A scalable distributed-memory sparse direct solver for unsymmetric linear systems
journal, June 2003

Li, Xiaoye S.; Demmel, James W.
ACM Transactions on Mathematical Software, Vol. 29, Issue 2
DOI: 10.1145/779359.779361

A Separator Theorem for Planar Graphs
journal, April 1979

Lipton, Richard J.; Tarjan, Robert Endre
SIAM Journal on Applied Mathematics, Vol. 36, Issue 2
DOI: 10.1137/0136016

SymPy: symbolic computing in Python
journal, January 2017

Meurer, Aaron; Smith, Christopher P.; Paprocki, Mateusz
PeerJ Computer Science, Vol. 3
DOI: 10.7717/peerj-cs.103

A CPU–GPU hybrid approach for the unsymmetric multifrontal method
journal, December 2011

Yu, Chenhan D.; Wang, Weichung; Pierce, Dan’l
Parallel Computing, Vol. 37, Issue 12
DOI: 10.1016/j.parco.2011.09.002

Works referencing / citing this record:

Preparing sparse solvers for exascale computing
journal, January 2020

Anzt, Hartwig; Boman, Erik; Falgout, Rob
Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, Vol. 378, Issue 2166
DOI: 10.1098/rsta.2019.0053

Similar Records in DOE PAGES and OSTI.GOV collections:

A Communication-Avoiding 3D LU Factorization Algorithm for Sparse Matrices

Conference Sao, Piyush ; Li, Xiaoye Sherry ; Vuduc, Richard

We propose a new algorithm to improve the strong scalability of right-looking sparse LU factorization on distributed memory systems. Our 3D sparse LU algorithm uses a three-dimensional MPI process grid, aggressively exploits elimination tree parallelism and trades off increased memory for reduced per-process communication. We also analyze the asymptotic improvements for planar graphs (e.g., from 2D grid or mesh domains) and certain non-planar graphs (specifically for 3D grids and meshes). For planar graphs with n vertices, our algorithm reduces communication volume asymptotically in n by a factor of O{log n} and latency by a factor of O{log n}. For non-planarmore »« less
https://doi.org/10.1109/IPDPS.2018.00100
A communication-avoiding 3D sparse triangular solver

Conference Sao, Piyush ; Kannan, Ramakrishnan ; Li, Xiaoye Sherry ; ...

We present a novel distributed memory algorithm to improve the strong scalability of the solution of a sparse triangular system. This operation appears in the solve phase of direct methods for solving general sparse linear systems, Ax = b. Our 3D sparse triangular solver employs several techniques, including a 3D MPI process grid, elimination tree parallelism, and data replication, all of which reduce the per-process communication when combined. We present analytical models to understand the communication cost of our algorithm and show that our 3D sparse triangular solver can reduce the per-process communication volume asymptotically by a factor of O(n1/4)more »« less
https://doi.org/10.1145/3330345.3330357

Full Text Available
Highly scalable distributed-memory sparse triangular solution algorithms.

Conference Liu, Yang ; Jacquelin, Mathias ; Ghysels, Pieter ; ...

This paper presents a highly efficient distributed-memory parallel sparse triangular solver. The triangular solution phase is often performed following factorization phase in the sparse linear solvers and has become increasingly computationally expensive for direct solvers with many right hand sides (RHSs) or preconditioned iterative solvers. However, the low arithmetic intensity and sequential nature of the triangular solve algorithm pose performance challenges for its large-scale distributed-memory parallelization. In this work, we propose several strategies to enhance scalability of an algorithm with 2D block cyclic process layout. First, an asynchronous binary-tree-based communication scheme implemented via non-blocking MPI functions is leveraged to broadcastmore »« less
https://doi.org/10.1137/1.9781611975215.9

Full Text Available
Brief Announcement: Communication Optimal Sparse LU Factorization for Planar Matrices

Conference Sao, Piyush ; Li, Xiaoye Sherry

We introduce a new parallel algorithm for solving sparse LU factorization of planar matrices, which commonly arise in the finite element method for 2D PDEs. Existing scalable methods, such as the multifrontal approach with subtree-to-subcube mapping by Gupta et al. [1] and right-looking with 3D mapping by Sao et al. [2] fail to achieve optimal communication costs for these matrices. Our new algorithm combines 3D mapping and subtree-to-subcube mapping to minimize communication costs while allowing trade-offs between extra memory and reduced communication. We demonstrate that our proposed algorithm attains the communication lower bound up to a factor of O(log logmore »« less
https://doi.org/10.1145/3558481.3591315

Full Text Available
GPU acceleration of a petascale application for turbulent mixing at high Schmidt number using OpenMP 4.5

Journal Article Clay, M. P. ; Buaria, D. ; Yeung, P. K. ; ... - Computer Physics Communications

This paper reports on the successful implementation of a massively parallel GPU-accelerated algorithm for the direct numerical simulation of turbulent mixing at high Schmidt number. The work stems from a recent development (Comput. Phys. Commun., vol. 219, 2017, 313–328), in which a low-communication algorithm was shown to attain high degrees of scalability on the Cray XE6 architecture when overlapping communication and computation via dedicated communication threads. An even higher level of performance has now been achieved using OpenMP 4.5 on the Cray XK7 architecture, where on each node the 16 integer cores of an AMD Interlagos processor share a singlemore »« less
https://doi.org/10.1016/j.cpc.2018.02.020

Similar Records

Title: A communication-avoiding 3D algorithm for sparse LU factorization on heterogeneous systems

Abstract

Citation Formats

Robust Memory-Aware Mappings for Parallel Multifrontal Factorizations journal, January 2016

Implementing Multifrontal Sparse Solvers for Multicore Architectures with Sequential Task Flow Runtime Systems journal, September 2016

PT-Scotch: A tool for efficient parallel graph ordering journal, July 2008

Communication Avoiding Rank Revealing QR Factorization with Column Pivoting journal, January 2015

Parallel Scheduling of Task Trees with Limited Memory journal, July 2015

A separator theorem for graphs of bounded genus journal, September 1984

Graph Grammar based Multi-thread Multi-frontal Direct Solver with Galois Scheduler journal, January 2014

CALU: A Communication Optimal LU Factorization Algorithm journal, October 2011

Highly scalable parallel algorithms for sparse matrix factorization journal, May 1997

Parallel Algorithms for Sparse Linear Systems journal, September 1991

Limiting Communication in Parallel Sparse Cholesky Factorization journal, September 1991

Trading Replication for Communication in Parallel Distributed-Memory Dense Solvers journal, March 2002

Communication lower bounds for distributed-memory matrix multiplication journal, September 2004

LU Factorization with Panel Rank Revealing Pivoting and Its Communication Avoiding Version journal, January 2013

A Parallel Sparse Direct Solver via Hierarchical DAG Scheduling journal, October 2014

An overview of SuperLU: Algorithms, implementation, and user interface journal, September 2005

SuperLU_DIST: A scalable distributed-memory sparse direct solver for unsymmetric linear systems journal, June 2003

A Separator Theorem for Planar Graphs journal, April 1979

SymPy: symbolic computing in Python journal, January 2017

A CPU–GPU hybrid approach for the unsymmetric multifrontal method journal, December 2011

Preparing sparse solvers for exascale computing journal, January 2020

Robust Memory-Aware Mappings for Parallel Multifrontal Factorizations
journal, January 2016

Implementing Multifrontal Sparse Solvers for Multicore Architectures with Sequential Task Flow Runtime Systems
journal, September 2016

PT-Scotch: A tool for efficient parallel graph ordering
journal, July 2008

Communication Avoiding Rank Revealing QR Factorization with Column Pivoting
journal, January 2015

Parallel Scheduling of Task Trees with Limited Memory
journal, July 2015

A separator theorem for graphs of bounded genus
journal, September 1984

Graph Grammar based Multi-thread Multi-frontal Direct Solver with Galois Scheduler
journal, January 2014

CALU: A Communication Optimal LU Factorization Algorithm
journal, October 2011

Highly scalable parallel algorithms for sparse matrix factorization
journal, May 1997

Parallel Algorithms for Sparse Linear Systems
journal, September 1991

Limiting Communication in Parallel Sparse Cholesky Factorization
journal, September 1991

Trading Replication for Communication in Parallel Distributed-Memory Dense Solvers
journal, March 2002

Communication lower bounds for distributed-memory matrix multiplication
journal, September 2004

LU Factorization with Panel Rank Revealing Pivoting and Its Communication Avoiding Version
journal, January 2013

A Parallel Sparse Direct Solver via Hierarchical DAG Scheduling
journal, October 2014

An overview of SuperLU: Algorithms, implementation, and user interface
journal, September 2005

SuperLU_DIST: A scalable distributed-memory sparse direct solver for unsymmetric linear systems
journal, June 2003

A Separator Theorem for Planar Graphs
journal, April 1979

SymPy: symbolic computing in Python
journal, January 2017

A CPU–GPU hybrid approach for the unsymmetric multifrontal method
journal, December 2011

Preparing sparse solvers for exascale computing
journal, January 2020