DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: A communication-avoiding 3D algorithm for sparse LU factorization on heterogeneous systems

Abstract

We propose a new algorithm to improve the strong scalability of right-looking sparse LU factorization on distributed memory systems. Our 3D algorithm for sparse LU uses a three-dimensional MPI process grid, exploits elimination tree parallelism, and trades off increased memory for reduced per-process communication. We also analyze the asymptotic improvements for planar graphs (e.g., those arising from 2D grid or mesh discretizations) and certain non-planar graphs (specifically for 3D grids and meshes). For a planar graph with $$n$$ vertices, our algorithm reduces communication volume asymptotically in $$n$$ by a factor of $$\mathscr{O}$$ $$\Big(\sqrt{log \ n}\Big)$$ and latency by a factor of $$\mathscr{O}$$ $$(log \ n)$$. For nonplanar cases, our algorithm can reduce the per-process communication volume by 3× and latency by $$\mathscr{O}$$ $$\Big(n^\frac13\Big)$$ times. In all cases, the memory needed to achieve these gains is a constant factor. We implemented our algorithm by extending the 2D data structure used in SuperLU_DIST. Our new 3D code achieves empirical speedups up to 27× for planar graphs and up to 3.3× for non-planar graphs over the baseline 2D SuperLU_DIST when run on 24,000 cores of a Cray XC30. We extend the 3D algorithm for heterogeneous architectures by adding the Highly Asynchronous Lazy Offload (Halo) algorithm for co-processor offload. On 4096 nodes of a Cray XK7 with 32,768 CPU cores and 4096 Nvidia K20x GPUs, the 3D algorithm achieves empirical speedups up to 24× for planar graphs and 3.5× for non-planar graphs over the baseline 2D SuperLU_DIST with co-processor acceleration.

Authors:
ORCiD logo [1];  [2];  [3]
  1. Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
  2. Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
  3. Georgia Inst. of Technology, Atlanta, GA (United States)
Publication Date:
Research Org.:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Sponsoring Org.:
USDOE National Nuclear Security Administration (NNSA)
OSTI Identifier:
1559632
Alternate Identifier(s):
OSTI ID: 1547464
Grant/Contract Number:  
AC05-00OR22725
Resource Type:
Accepted Manuscript
Journal Name:
Journal of Parallel and Distributed Computing
Additional Journal Information:
Journal Volume: 131; Journal Issue: 9; Journal ID: ISSN 0743-7315
Publisher:
Elsevier
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING

Citation Formats

Sao, Piyush, Li, Xiaoye Sherry, and Vuduc, Richard. A communication-avoiding 3D algorithm for sparse LU factorization on heterogeneous systems. United States: N. p., 2019. Web. doi:10.1016/j.jpdc.2019.03.004.
Sao, Piyush, Li, Xiaoye Sherry, & Vuduc, Richard. A communication-avoiding 3D algorithm for sparse LU factorization on heterogeneous systems. United States. https://doi.org/10.1016/j.jpdc.2019.03.004
Sao, Piyush, Li, Xiaoye Sherry, and Vuduc, Richard. Mon . "A communication-avoiding 3D algorithm for sparse LU factorization on heterogeneous systems". United States. https://doi.org/10.1016/j.jpdc.2019.03.004. https://www.osti.gov/servlets/purl/1559632.
@article{osti_1559632,
title = {A communication-avoiding 3D algorithm for sparse LU factorization on heterogeneous systems},
author = {Sao, Piyush and Li, Xiaoye Sherry and Vuduc, Richard},
abstractNote = {We propose a new algorithm to improve the strong scalability of right-looking sparse LU factorization on distributed memory systems. Our 3D algorithm for sparse LU uses a three-dimensional MPI process grid, exploits elimination tree parallelism, and trades off increased memory for reduced per-process communication. We also analyze the asymptotic improvements for planar graphs (e.g., those arising from 2D grid or mesh discretizations) and certain non-planar graphs (specifically for 3D grids and meshes). For a planar graph with $n$ vertices, our algorithm reduces communication volume asymptotically in $n$ by a factor of $\mathscr{O}$ $\Big(\sqrt{log \ n}\Big)$ and latency by a factor of $\mathscr{O}$ $(log \ n)$. For nonplanar cases, our algorithm can reduce the per-process communication volume by 3× and latency by $\mathscr{O}$ $\Big(n^\frac13\Big)$ times. In all cases, the memory needed to achieve these gains is a constant factor. We implemented our algorithm by extending the 2D data structure used in SuperLU_DIST. Our new 3D code achieves empirical speedups up to 27× for planar graphs and up to 3.3× for non-planar graphs over the baseline 2D SuperLU_DIST when run on 24,000 cores of a Cray XC30. We extend the 3D algorithm for heterogeneous architectures by adding the Highly Asynchronous Lazy Offload (Halo) algorithm for co-processor offload. On 4096 nodes of a Cray XK7 with 32,768 CPU cores and 4096 Nvidia K20x GPUs, the 3D algorithm achieves empirical speedups up to 24× for planar graphs and 3.5× for non-planar graphs over the baseline 2D SuperLU_DIST with co-processor acceleration.},
doi = {10.1016/j.jpdc.2019.03.004},
journal = {Journal of Parallel and Distributed Computing},
number = 9,
volume = 131,
place = {United States},
year = {Mon Aug 19 00:00:00 EDT 2019},
month = {Mon Aug 19 00:00:00 EDT 2019}
}

Journal Article:

Citation Metrics:
Cited by: 2 works
Citation information provided by
Web of Science

Save / Share:

Works referenced in this record:

Robust Memory-Aware Mappings for Parallel Multifrontal Factorizations
journal, January 2016

  • Agullo, Emmanuel; Amestoy, Patrick R.; Buttari, Alfredo
  • SIAM Journal on Scientific Computing, Vol. 38, Issue 3
  • DOI: 10.1137/130938505

Implementing Multifrontal Sparse Solvers for Multicore Architectures with Sequential Task Flow Runtime Systems
journal, September 2016

  • Agullo, Emmanuel; Buttari, Alfredo; Guermouche, Abdou
  • ACM Transactions on Mathematical Software, Vol. 43, Issue 2
  • DOI: 10.1145/2898348

PT-Scotch: A tool for efficient parallel graph ordering
journal, July 2008


Communication Avoiding Rank Revealing QR Factorization with Column Pivoting
journal, January 2015

  • Demmel, James W.; Grigori, Laura; Gu, Ming
  • SIAM Journal on Matrix Analysis and Applications, Vol. 36, Issue 1
  • DOI: 10.1137/13092157X

Parallel Scheduling of Task Trees with Limited Memory
journal, July 2015

  • Eyraud-Dubois, Lionel; Marchal, Loris; Sinnen, Oliver
  • ACM Transactions on Parallel Computing, Vol. 2, Issue 2
  • DOI: 10.1145/2779052

A separator theorem for graphs of bounded genus
journal, September 1984


Graph Grammar based Multi-thread Multi-frontal Direct Solver with Galois Scheduler
journal, January 2014


CALU: A Communication Optimal LU Factorization Algorithm
journal, October 2011

  • Grigori, Laura; Demmel, James W.; Xiang, Hua
  • SIAM Journal on Matrix Analysis and Applications, Vol. 32, Issue 4
  • DOI: 10.1137/100788926

Highly scalable parallel algorithms for sparse matrix factorization
journal, May 1997

  • Gupta, A.; Karypis, G.; Kumar, V.
  • IEEE Transactions on Parallel and Distributed Systems, Vol. 8, Issue 5
  • DOI: 10.1109/71.598277

Parallel Algorithms for Sparse Linear Systems
journal, September 1991

  • Heath, Michael T.; Ng, Esmond; Peyton, Barry W.
  • SIAM Review, Vol. 33, Issue 3
  • DOI: 10.1137/1033099

Limiting Communication in Parallel Sparse Cholesky Factorization
journal, September 1991

  • Hulbert, Laurie; Zmijewski, Earl
  • SIAM Journal on Scientific and Statistical Computing, Vol. 12, Issue 5
  • DOI: 10.1137/0912063

Trading Replication for Communication in Parallel Distributed-Memory Dense Solvers
journal, March 2002


Communication lower bounds for distributed-memory matrix multiplication
journal, September 2004

  • Irony, Dror; Toledo, Sivan; Tiskin, Alexander
  • Journal of Parallel and Distributed Computing, Vol. 64, Issue 9
  • DOI: 10.1016/j.jpdc.2004.03.021

LU Factorization with Panel Rank Revealing Pivoting and Its Communication Avoiding Version
journal, January 2013

  • Khabou, Amal; Demmel, James W.; Grigori, Laura
  • SIAM Journal on Matrix Analysis and Applications, Vol. 34, Issue 3
  • DOI: 10.1137/120863691

A Parallel Sparse Direct Solver via Hierarchical DAG Scheduling
journal, October 2014

  • Kim, Kyungjoo; Eijkhout, Victor
  • ACM Transactions on Mathematical Software, Vol. 41, Issue 1
  • DOI: 10.1145/2629641

An overview of SuperLU: Algorithms, implementation, and user interface
journal, September 2005


SuperLU_DIST: A scalable distributed-memory sparse direct solver for unsymmetric linear systems
journal, June 2003

  • Li, Xiaoye S.; Demmel, James W.
  • ACM Transactions on Mathematical Software, Vol. 29, Issue 2
  • DOI: 10.1145/779359.779361

A Separator Theorem for Planar Graphs
journal, April 1979

  • Lipton, Richard J.; Tarjan, Robert Endre
  • SIAM Journal on Applied Mathematics, Vol. 36, Issue 2
  • DOI: 10.1137/0136016

SymPy: symbolic computing in Python
journal, January 2017

  • Meurer, Aaron; Smith, Christopher P.; Paprocki, Mateusz
  • PeerJ Computer Science, Vol. 3
  • DOI: 10.7717/peerj-cs.103

A CPU–GPU hybrid approach for the unsymmetric multifrontal method
journal, December 2011


Works referencing / citing this record:

Preparing sparse solvers for exascale computing
journal, January 2020

  • Anzt, Hartwig; Boman, Erik; Falgout, Rob
  • Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, Vol. 378, Issue 2166
  • DOI: 10.1098/rsta.2019.0053