DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Optimizing High Performance Markov Clustering for Pre-Exascale Architectures

Abstract

HipMCL is a high-performance distributed memory implementation of the popular Markov Cluster Algorithm (MCL) and can cluster large-scale networks within hours using a few thousand CPU-equipped nodes. It relies on sparse matrix computations and heavily makes use of the sparse matrix-sparse matrix multiplication kernel (SpGEMM). The existing parallel algorithms in HipMCL are not scalable to Exascale architectures, both due to their communication costs dominating the runtime at large concurrencies and also due to their inability to take advantage of accelerators that are increasingly popular. In this work, we systematically remove scalability and performance bottlenecks of HipMCL. We enable GPUs by performing the expensive expansion phase of the MCL algorithm on GPU. Additionally, we propose a CPU-GPU joint distributed SpGEMM algorithm called pipelined Sparse SUMMA and integrate a probabilistic memory requirement estimator that is fast and accurate. Furthermore, we develop a new merging algorithm for the incremental processing of partial results produced by the GPUs, which improves the overlap efficiency and the peak memory usage. We also integrate a recent and faster algorithm for performing SpGEMM on CPUs. We validate our new algorithms and optimizations with extensive evaluations. With the enabling of the GPUs and integration of new algorithms, HipMCL ismore » up to 12.4x faster, being able to cluster a network with 70 million proteins and 68 billion connections just under 15 minutes using 1024 nodes of ORNL's Summit supercomputer.« less

Authors:
 [1];  [2];  [2];  [1]
  1. Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
  2. Indiana Univ., Bloomington, IN (United States)
Publication Date:
Research Org.:
Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)
Sponsoring Org.:
USDOE Office of Science (SC)
OSTI Identifier:
1650092
Grant/Contract Number:  
AC02-05CH11231; AC05-00OR22725
Resource Type:
Accepted Manuscript
Journal Name:
Proceedings - IEEE International Parallel and Distributed Processing Symposium (IPDPS)
Additional Journal Information:
Journal Name: Proceedings - IEEE International Parallel and Distributed Processing Symposium (IPDPS); Journal Volume: 2020; Conference: 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), New Orleans, LA (United States), 18-22 May 2020
Publisher:
IEEE
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING

Citation Formats

Selvitopi, Oguz, Hussain, Md Taufique, Azad, Ariful, and Buluc, Aydin. Optimizing High Performance Markov Clustering for Pre-Exascale Architectures. United States: N. p., 2020. Web. doi:10.1109/ipdps47924.2020.00022.
Selvitopi, Oguz, Hussain, Md Taufique, Azad, Ariful, & Buluc, Aydin. Optimizing High Performance Markov Clustering for Pre-Exascale Architectures. United States. https://doi.org/10.1109/ipdps47924.2020.00022
Selvitopi, Oguz, Hussain, Md Taufique, Azad, Ariful, and Buluc, Aydin. Fri . "Optimizing High Performance Markov Clustering for Pre-Exascale Architectures". United States. https://doi.org/10.1109/ipdps47924.2020.00022. https://www.osti.gov/servlets/purl/1650092.
@article{osti_1650092,
title = {Optimizing High Performance Markov Clustering for Pre-Exascale Architectures},
author = {Selvitopi, Oguz and Hussain, Md Taufique and Azad, Ariful and Buluc, Aydin},
abstractNote = {HipMCL is a high-performance distributed memory implementation of the popular Markov Cluster Algorithm (MCL) and can cluster large-scale networks within hours using a few thousand CPU-equipped nodes. It relies on sparse matrix computations and heavily makes use of the sparse matrix-sparse matrix multiplication kernel (SpGEMM). The existing parallel algorithms in HipMCL are not scalable to Exascale architectures, both due to their communication costs dominating the runtime at large concurrencies and also due to their inability to take advantage of accelerators that are increasingly popular. In this work, we systematically remove scalability and performance bottlenecks of HipMCL. We enable GPUs by performing the expensive expansion phase of the MCL algorithm on GPU. Additionally, we propose a CPU-GPU joint distributed SpGEMM algorithm called pipelined Sparse SUMMA and integrate a probabilistic memory requirement estimator that is fast and accurate. Furthermore, we develop a new merging algorithm for the incremental processing of partial results produced by the GPUs, which improves the overlap efficiency and the peak memory usage. We also integrate a recent and faster algorithm for performing SpGEMM on CPUs. We validate our new algorithms and optimizations with extensive evaluations. With the enabling of the GPUs and integration of new algorithms, HipMCL is up to 12.4x faster, being able to cluster a network with 70 million proteins and 68 billion connections just under 15 minutes using 1024 nodes of ORNL's Summit supercomputer.},
doi = {10.1109/ipdps47924.2020.00022},
journal = {Proceedings - IEEE International Parallel and Distributed Processing Symposium (IPDPS)},
number = ,
volume = 2020,
place = {United States},
year = {Fri May 01 00:00:00 EDT 2020},
month = {Fri May 01 00:00:00 EDT 2020}
}

Works referenced in this record:

HipMCL: a high-performance parallel implementation of the Markov clustering algorithm for large-scale networks
journal, January 2018

  • Azad, Ariful; Pavlopoulos, Georgios A.; Ouzounis, Christos A.
  • Nucleic Acids Research, Vol. 46, Issue 6
  • DOI: 10.1093/nar/gkx1313

Brief Announcement: Hypergraph Partitioning for Parallel Sparse Matrix-Matrix Multiplication
conference, January 2015

  • Ballard, Grey; Druinsky, Alex; Knight, Nicholas
  • Proceedings of the 27th ACM on Symposium on Parallelism in Algorithms and Architectures - SPAA '15
  • DOI: 10.1145/2755573.2755613

IMG/M: integrated genome and metagenome comparative data analysis system
journal, October 2016

  • Chen, I-Min A.; Markowitz, Victor M.; Chu, Ken
  • Nucleic Acids Research, Vol. 45, Issue D1
  • DOI: 10.1093/nar/gkw929

Brief Announcement: Hypergraph Partitioning for Parallel Sparse Matrix-Matrix Multiplication
conference, January 2015

  • Ballard, Grey; Druinsky, Alex; Knight, Nicholas
  • Proceedings of the 27th ACM on Symposium on Parallelism in Algorithms and Architectures - SPAA '15
  • DOI: 10.1145/2755573.2755613

Partitioning Models for Scaling Parallel Sparse Matrix-Matrix Multiplication
journal, April 2018

  • Akbudak, Kadir; Selvitopi, Oguz; Aykanat, Cevdet
  • ACM Transactions on Parallel Computing, Vol. 4, Issue 3
  • DOI: 10.1145/3155292

ViennaCL---Linear Algebra Library for Multi- and Many-Core Architectures
journal, January 2016

  • Rupp, Karl; Tillet, Philippe; Rudolf, Florian
  • SIAM Journal on Scientific Computing, Vol. 38, Issue 5
  • DOI: 10.1137/15M1026419

An Efficient GPU General Sparse Matrix-Matrix Multiplication for Irregular Data
conference, May 2014

  • Liu, Weifeng; Vinter, Brian
  • 2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2014 IEEE 28th International Parallel and Distributed Processing Symposium
  • DOI: 10.1109/IPDPS.2014.47

Performance-portable sparse matrix-matrix multiplication for many-core architectures
conference, May 2017

  • Deveci, Mehmet; Trott, Christian; Rajamanickam, Sivasankaran
  • 2017 IEEE International Parallel and Distributed Processing Symposium: Workshops (IPDPSW), 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
  • DOI: 10.1109/IPDPSW.2017.8

Memory-Efficient Sparse Matrix-Matrix Multiplication by Row Merging on Many-Core Architectures
journal, January 2018

  • Gremse, Felix; Küpper, Kerstin; Naumann, Uwe
  • SIAM Journal on Scientific Computing, Vol. 40, Issue 4
  • DOI: 10.1137/17M1121378

Optimizing Sparse Matrix—Matrix Multiplication for the GPU
journal, October 2015

  • Dalton, Steven; Olson, Luke; Bell, Nathan
  • ACM Transactions on Mathematical Software, Vol. 41, Issue 4
  • DOI: 10.1145/2699470

Balanced Hashing and Efficient GPU Sparse General Matrix-Matrix Multiplication
conference, June 2016

  • Anh, Pham Nguyen Quang; Fan, Rui; Wen, Yonggang
  • Proceedings of the 2016 International Conference on Supercomputing
  • DOI: 10.1145/2925426.2926273

On improving performance of sparse matrix-matrix multiplication on GPUs
conference, June 2017

  • Kunchum, Rakshith; Chaudhry, Ankur; Sukumaran-Rajam, Aravind
  • Proceedings of the International Conference on Supercomputing
  • DOI: 10.1145/3079079.3079106

The Combinatorial BLAS: design, implementation, and applications
journal, May 2011

  • Buluç, Aydın; Gilbert, John R.
  • The International Journal of High Performance Computing Applications, Vol. 25, Issue 4
  • DOI: 10.1177/1094342011403516

HipMCL: a high-performance parallel implementation of the Markov clustering algorithm for large-scale networks
journal, January 2018

  • Azad, Ariful; Pavlopoulos, Georgios A.; Ouzounis, Christos A.
  • Nucleic Acids Research, Vol. 46, Issue 6
  • DOI: 10.1093/nar/gkx1313

A Local Clustering Algorithm for Massive Graphs and Its Application to Nearly Linear Time Graph Partitioning
journal, January 2013

  • Spielman, Daniel A.; Teng, Shang-Hua
  • SIAM Journal on Computing, Vol. 42, Issue 1
  • DOI: 10.1137/080744888

Communication optimal parallel multiplication of sparse random matrices
conference, January 2013

  • Ballard, Grey; Buluc, Aydin; Demmel, James
  • Proceedings of the 25th ACM symposium on Parallelism in algorithms and architectures - SPAA '13
  • DOI: 10.1145/2486159.2486196

Parallel Sparse Matrix-Matrix Multiplication and Indexing: Implementation and Experiments
journal, January 2012

  • Buluç, Aydin; Gilbert, John R.
  • SIAM Journal on Scientific Computing, Vol. 34, Issue 4
  • DOI: 10.1137/110848244

A fast implementation of MLR-MCL algorithm on multi-core processors
conference, December 2014

  • Niu, Qingpeng; Lai, Pai-Wei; Faisal, S. M.
  • 2014 21st International Conference on High Performance Computing (HiPC)
  • DOI: 10.1109/HiPC.2014.7116888

Exploiting Multiple Levels of Parallelism in Sparse Matrix-Matrix Multiplication
journal, January 2016

  • Azad, Ariful; Ballard, Grey; Buluç, Aydin
  • SIAM Journal on Scientific Computing, Vol. 38, Issue 6
  • DOI: 10.1137/15M104253X

Two Fast Algorithms for Sparse Matrices: Multiplication and Permuted Transposition
journal, September 1978

  • Gustavson, Fred G.
  • ACM Transactions on Mathematical Software, Vol. 4, Issue 3
  • DOI: 10.1145/355791.355796

High-Performance Sparse Matrix-Matrix Products on Intel KNL and Multicore Architectures
conference, January 2018

  • Nagasaka, Yusuke; Matsuoka, Satoshi; Azad, Ariful
  • Proceedings of the 47th International Conference on Parallel Processing Companion - ICPP '18
  • DOI: 10.1145/3229710.3229720

Exposing Fine-Grained Parallelism in Algebraic Multigrid Methods
journal, January 2012

  • Bell, Nathan; Dalton, Steven; Olson, Luke N.
  • SIAM Journal on Scientific Computing, Vol. 34, Issue 4
  • DOI: 10.1137/110838844

Sparse Matrices in MATLAB: Design and Implementation
journal, January 1992

  • Gilbert, John R.; Moler, Cleve; Schreiber, Robert
  • SIAM Journal on Matrix Analysis and Applications, Vol. 13, Issue 1
  • DOI: 10.1137/0613024