Optimizing High Performance Markov Clustering for Pre-Exascale Architectures

Selvitopi, Oguz; Hussain, Md Taufique; Azad, Ariful; Buluc, Aydin

doi:10.1109/ipdps47924.2020.00022

Title: Optimizing High Performance Markov Clustering for Pre-Exascale Architectures

Journal Article · Fri May 01 00:00:00 EDT 2020 · Proceedings - IEEE International Parallel and Distributed Processing Symposium (IPDPS)

DOI:https://doi.org/10.1109/ipdps47924.2020.00022· OSTI ID:1650092

Selvitopi, Oguz ^[1]; Hussain, Md Taufique ^[2]; Azad, Ariful ^[2]; Buluc, Aydin ^[1]

Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
Indiana Univ., Bloomington, IN (United States)

HipMCL is a high-performance distributed memory implementation of the popular Markov Cluster Algorithm (MCL) and can cluster large-scale networks within hours using a few thousand CPU-equipped nodes. It relies on sparse matrix computations and heavily makes use of the sparse matrix-sparse matrix multiplication kernel (SpGEMM). The existing parallel algorithms in HipMCL are not scalable to Exascale architectures, both due to their communication costs dominating the runtime at large concurrencies and also due to their inability to take advantage of accelerators that are increasingly popular. In this work, we systematically remove scalability and performance bottlenecks of HipMCL. We enable GPUs by performing the expensive expansion phase of the MCL algorithm on GPU. Additionally, we propose a CPU-GPU joint distributed SpGEMM algorithm called pipelined Sparse SUMMA and integrate a probabilistic memory requirement estimator that is fast and accurate. Furthermore, we develop a new merging algorithm for the incremental processing of partial results produced by the GPUs, which improves the overlap efficiency and the peak memory usage. We also integrate a recent and faster algorithm for performing SpGEMM on CPUs. We validate our new algorithms and optimizations with extensive evaluations. With the enabling of the GPUs and integration of new algorithms, HipMCL is up to 12.4x faster, being able to cluster a network with 70 million proteins and 68 billion connections just under 15 minutes using 1024 nodes of ORNL's Summit supercomputer.

View Accepted Manuscript (DOE)

Cite

Export

Save

Research Organization:: Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)

Sponsoring Organization:: USDOE Office of Science (SC)

Grant/Contract Number:: AC02-05CH11231; AC05-00OR22725

OSTI ID:: 1650092

Journal Information:: Proceedings - IEEE International Parallel and Distributed Processing Symposium (IPDPS), Vol. 2020; Conference: 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), New Orleans, LA (United States), 18-22 May 2020

Publisher:: IEEECopyright Statement

Country of Publication:: United States

Language:: English

References (21)

HipMCL: a high-performance parallel implementation of the Markov clustering algorithm for large-scale networks Azad, Ariful; Pavlopoulos, Georgios A.; Ouzounis, Christos A. Nucleic Acids Research, Vol. 46, Issue 6 https://doi.org/10.1093/nar/gkx1313	journal	January 2018
Brief Announcement: Hypergraph Partitioning for Parallel Sparse Matrix-Matrix Multiplication Ballard, Grey; Druinsky, Alex; Knight, Nicholas Proceedings of the 27th ACM on Symposium on Parallelism in Algorithms and Architectures - SPAA '15 https://doi.org/10.1145/2755573.2755613	conference	January 2015
IMG/M: integrated genome and metagenome comparative data analysis system Chen, I-Min A.; Markowitz, Victor M.; Chu, Ken Nucleic Acids Research, Vol. 45, Issue D1 https://doi.org/10.1093/nar/gkw929	journal	October 2016
Partitioning Models for Scaling Parallel Sparse Matrix-Matrix Multiplication Akbudak, Kadir; Selvitopi, Oguz; Aykanat, Cevdet ACM Transactions on Parallel Computing, Vol. 4, Issue 3 https://doi.org/10.1145/3155292	journal	April 2018
ViennaCL---Linear Algebra Library for Multi- and Many-Core Architectures Rupp, Karl; Tillet, Philippe; Rudolf, Florian SIAM Journal on Scientific Computing, Vol. 38, Issue 5 https://doi.org/10.1137/15M1026419	journal	January 2016
An Efficient GPU General Sparse Matrix-Matrix Multiplication for Irregular Data Liu, Weifeng; Vinter, Brian 2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2014 IEEE 28th International Parallel and Distributed Processing Symposium https://doi.org/10.1109/IPDPS.2014.47	conference	May 2014
Performance-portable sparse matrix-matrix multiplication for many-core architectures Deveci, Mehmet; Trott, Christian; Rajamanickam, Sivasankaran 2017 IEEE International Parallel and Distributed Processing Symposium: Workshops (IPDPSW), 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) https://doi.org/10.1109/IPDPSW.2017.8	conference	May 2017
Memory-Efficient Sparse Matrix-Matrix Multiplication by Row Merging on Many-Core Architectures Gremse, Felix; Küpper, Kerstin; Naumann, Uwe SIAM Journal on Scientific Computing, Vol. 40, Issue 4 https://doi.org/10.1137/17M1121378	journal	January 2018
Optimizing Sparse Matrix—Matrix Multiplication for the GPU Dalton, Steven; Olson, Luke; Bell, Nathan ACM Transactions on Mathematical Software, Vol. 41, Issue 4 https://doi.org/10.1145/2699470	journal	October 2015
Balanced Hashing and Efficient GPU Sparse General Matrix-Matrix Multiplication Anh, Pham Nguyen Quang; Fan, Rui; Wen, Yonggang Proceedings of the 2016 International Conference on Supercomputing https://doi.org/10.1145/2925426.2926273	conference	June 2016
On improving performance of sparse matrix-matrix multiplication on GPUs Kunchum, Rakshith; Chaudhry, Ankur; Sukumaran-Rajam, Aravind Proceedings of the International Conference on Supercomputing https://doi.org/10.1145/3079079.3079106	conference	June 2017
The Combinatorial BLAS: design, implementation, and applications Buluç, Aydın; Gilbert, John R. The International Journal of High Performance Computing Applications, Vol. 25, Issue 4 https://doi.org/10.1177/1094342011403516	journal	May 2011
A Local Clustering Algorithm for Massive Graphs and Its Application to Nearly Linear Time Graph Partitioning Spielman, Daniel A.; Teng, Shang-Hua SIAM Journal on Computing, Vol. 42, Issue 1 https://doi.org/10.1137/080744888	journal	January 2013
Communication optimal parallel multiplication of sparse random matrices Ballard, Grey; Buluc, Aydin; Demmel, James Proceedings of the 25th ACM symposium on Parallelism in algorithms and architectures - SPAA '13 https://doi.org/10.1145/2486159.2486196	conference	January 2013
Parallel Sparse Matrix-Matrix Multiplication and Indexing: Implementation and Experiments Buluç, Aydin; Gilbert, John R. SIAM Journal on Scientific Computing, Vol. 34, Issue 4 https://doi.org/10.1137/110848244	journal	January 2012
A fast implementation of MLR-MCL algorithm on multi-core processors Niu, Qingpeng; Lai, Pai-Wei; Faisal, S. M. 2014 21st International Conference on High Performance Computing (HiPC) https://doi.org/10.1109/HiPC.2014.7116888	conference	December 2014
Exploiting Multiple Levels of Parallelism in Sparse Matrix-Matrix Multiplication Azad, Ariful; Ballard, Grey; Buluç, Aydin SIAM Journal on Scientific Computing, Vol. 38, Issue 6 https://doi.org/10.1137/15M104253X	journal	January 2016
Two Fast Algorithms for Sparse Matrices: Multiplication and Permuted Transposition Gustavson, Fred G. ACM Transactions on Mathematical Software, Vol. 4, Issue 3 https://doi.org/10.1145/355791.355796	journal	September 1978
High-Performance Sparse Matrix-Matrix Products on Intel KNL and Multicore Architectures Nagasaka, Yusuke; Matsuoka, Satoshi; Azad, Ariful Proceedings of the 47th International Conference on Parallel Processing Companion - ICPP '18 https://doi.org/10.1145/3229710.3229720	conference	January 2018
Exposing Fine-Grained Parallelism in Algebraic Multigrid Methods Bell, Nathan; Dalton, Steven; Olson, Luke N. SIAM Journal on Scientific Computing, Vol. 34, Issue 4 https://doi.org/10.1137/110838844	journal	January 2012
Sparse Matrices in MATLAB: Design and Implementation Gilbert, John R.; Moler, Cleve; Schreiber, Robert SIAM Journal on Matrix Analysis and Applications, Vol. 13, Issue 1 https://doi.org/10.1137/0613024	journal	January 1992

Similar Records

HipMCL: a high-performance parallel implementation of the Markov clustering algorithm for large-scale networks

Journal Article · Fri Jan 05 00:00:00 EST 2018 · Nucleic Acids Research · OSTI ID:1650092

Azad, Ariful; Pavlopoulos, Georgios A.; Ouzounis, Christos A.; +2 more

GSoFa: Scalable Sparse Symbolic LU Factorization on GPUs

Journal Article · Fri Apr 01 00:00:00 EDT 2022 · IEEE Transactions on Parallel and Distributed Systems · OSTI ID:1650092

Gaihre, Anil; Li, Xiaoye Sherry; Liu, Hang

Data Locality Enhancement of Dynamic Simulations for Exascale Computing (Final Report)

Technical Report · Fri Nov 29 00:00:00 EST 2019 · OSTI ID:1650092

Shen, Xipeng

Related Subjects

97 MATHEMATICS AND COMPUTING

Title: Optimizing High Performance Markov Clustering for Pre-Exascale Architectures

Citation Formats

References (21)

Similar Records

Related Subjects