Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Reducing Communication in Graph Neural Network Training

Conference · · SC '20: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
 [1];  [1];  [1]
  1. Univ. of California, Berkeley, CA (United States); Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
Graph Neural Networks (GNNs) are powerful and flexible neural networks that use the naturally sparse connectivity information of the data. GNNs represent this connectivity as sparse matrices, which have lower arithmetic intensity and thus higher communication costs compared to dense matrices, making GNNs harder to scale to high concurrencies than convolutional or fully-connected neural networks. In this paper, we introduce a family of parallel algorithms for training GNNs and show that they can asymptotically reduce communication compared to previous parallel GNN training methods. We implement these algorithms, which are based on 1D, 1.5D, 2D, and 3D sparse-dense matrix multiplication, using torch.distributed on GPU-equipped clusters. Our algorithms optimize communication across the full GNN training pipeline. We train GNNs on over a hundred GPUs on multiple datasets, including a protein network with over a billion edges.
Research Organization:
Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States); Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF)
Sponsoring Organization:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR); National Science Foundation (NSF)
DOE Contract Number:
AC02-05CH11231; AC05-00OR22725
OSTI ID:
1647608
Conference Information:
Journal Name: SC '20: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
Country of Publication:
United States
Language:
English

References (17)

Optimizing Sparse Matrix-Multiple Vectors Multiplication for Nuclear Configuration Interaction Calculations
  • Aktulga, Hasan Metin; Buluc, Aydin; Williams, Samuel
  • 2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2014 IEEE 28th International Parallel and Distributed Processing Symposium https://doi.org/10.1109/ipdps.2014.125
conference May 2014
The Graph Neural Network Model journal December 2008
Minimizing Communication in Numerical Linear Algebra journal July 2011
On the representation and multiplication of hypersparse matrices conference April 2008
HipMCL: a high-performance parallel implementation of the Markov clustering algorithm for large-scale networks journal January 2018
Channel and filter parallelism for large-scale CNN training conference November 2019
The Combinatorial BLAS: design, implementation, and applications journal May 2011
Demystifying Parallel and Distributed Deep Learning: An In-depth Concurrency Analysis journal August 2019
Collective communication: theory, practice, and experience journal January 2007
A Comprehensive Survey on Graph Neural Networks journal January 2021
Integrated Model, Batch, and Domain Parallelism in Training Neural Networks conference July 2018
A three-dimensional approach to parallel matrix multiplication journal September 1995
Improving Strong-Scaling of CNN Training by Exploiting Finer-Grained Parallelism conference May 2019
SUMMA: scalable universal matrix multiplication algorithm journal April 1997
A Comprehensive Survey on Graph Neural Networks journal January 2021
Exploiting Multiple Levels of Parallelism in Sparse Matrix-Matrix Multiplication journal January 2016
AliGraph journal August 2019

Similar Records

Reducing Communication in Graph Neural Network Training
Journal Article · Sat Oct 31 20:00:00 EDT 2020 · International Conference for High Performance Computing, Networking, Storage and Analysis · OSTI ID:1772909

GSplit: Scaling Graph Neural Network Training on Large Graphs via Split-Parallelism
Conference · Thu May 01 00:00:00 EDT 2025 · OSTI ID:3002431

Scalable training of trustworthy and energy-efficient predictive graph foundation models for atomistic materials modeling: a case study with HydraGNN
Journal Article · Thu Mar 13 20:00:00 EDT 2025 · Journal of Supercomputing · OSTI ID:2538215