skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Scalable All-pairs Shortest Paths for Huge Graphs on Multi-GPU Clusters

Abstract

We present an optimized Floyd-Warshall (Floyd-Warshall) algorithm that computes the All-pairs shortest path (APSP) for GPU accelerated clusters. The Floyd-Warshall algorithm due to its structural similarities to matrix-multiplication is well suited for highly parallel GPU architectures. To achieve high parallel efficiency, we address two key algorithmic challenges: reducing high communication overhead and addressing limited GPU memory. To reduce high communication costs, we redesign the parallel (a) to expose more parallelism, (b) aggressively overlap communication and computation with pipelined and asynchronous scheduling of operations, and (c) tailored MPI-collective. To cope with limited GPU memory, we employ an offload model, where the data resides on the host and is transferred to GPU on-demand. The proposed optimizations are supported with detailed performance models for tuning. Our optimized parallel Floyd-Warshall implementation is up to 5x faster than a strong baseline and achieves 8.1 PetaFLOPS/sec on 256~nodes of the Summit supercomputer at Oak Ridge National Laboratory. This performance represents 70% of the theoretical peak and 80% parallel efficiency. The offload algorithm can handle 2.5x larger graphs with a 20% increase in overall running time.

Authors:
ORCiD logo [1];  [1]; ORCiD logo [1];  [2];  [3]; ORCiD logo [1]
  1. ORNL
  2. Georgia Institute of Technology
  3. Georgia Institute of Technology, Atlanta
Publication Date:
Research Org.:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Sponsoring Org.:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)
OSTI Identifier:
1814306
DOE Contract Number:  
AC05-00OR22725
Resource Type:
Conference
Resource Relation:
Conference: 30th International Symposium on High-Performance Parallel and Distributed Computing (HPDC '21) - Stockholm, , Sweden - 6/21/2021 4:00:00 AM-6/24/2021 4:00:00 AM
Country of Publication:
United States
Language:
English

Citation Formats

Sao, Piyush, Lu, Hao, Kannan, Ramakrishnan, Thakkar, Vijay, Vuduc, Richard, and Potok, Thomas. Scalable All-pairs Shortest Paths for Huge Graphs on Multi-GPU Clusters. United States: N. p., 2020. Web.
Sao, Piyush, Lu, Hao, Kannan, Ramakrishnan, Thakkar, Vijay, Vuduc, Richard, & Potok, Thomas. Scalable All-pairs Shortest Paths for Huge Graphs on Multi-GPU Clusters. United States.
Sao, Piyush, Lu, Hao, Kannan, Ramakrishnan, Thakkar, Vijay, Vuduc, Richard, and Potok, Thomas. 2020. "Scalable All-pairs Shortest Paths for Huge Graphs on Multi-GPU Clusters". United States. https://www.osti.gov/servlets/purl/1814306.
@article{osti_1814306,
title = {Scalable All-pairs Shortest Paths for Huge Graphs on Multi-GPU Clusters},
author = {Sao, Piyush and Lu, Hao and Kannan, Ramakrishnan and Thakkar, Vijay and Vuduc, Richard and Potok, Thomas},
abstractNote = {We present an optimized Floyd-Warshall (Floyd-Warshall) algorithm that computes the All-pairs shortest path (APSP) for GPU accelerated clusters. The Floyd-Warshall algorithm due to its structural similarities to matrix-multiplication is well suited for highly parallel GPU architectures. To achieve high parallel efficiency, we address two key algorithmic challenges: reducing high communication overhead and addressing limited GPU memory. To reduce high communication costs, we redesign the parallel (a) to expose more parallelism, (b) aggressively overlap communication and computation with pipelined and asynchronous scheduling of operations, and (c) tailored MPI-collective. To cope with limited GPU memory, we employ an offload model, where the data resides on the host and is transferred to GPU on-demand. The proposed optimizations are supported with detailed performance models for tuning. Our optimized parallel Floyd-Warshall implementation is up to 5x faster than a strong baseline and achieves 8.1 PetaFLOPS/sec on 256~nodes of the Summit supercomputer at Oak Ridge National Laboratory. This performance represents 70% of the theoretical peak and 80% parallel efficiency. The offload algorithm can handle 2.5x larger graphs with a 20% increase in overall running time.},
doi = {},
url = {https://www.osti.gov/biblio/1814306}, journal = {},
number = ,
volume = ,
place = {United States},
year = {2020},
month = {6}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share: