Optimizing High Performance Markov Clustering for Pre-Exascale Architectures

Selvitopi, Oguz; Hussain, Md Taufique; Azad, Ariful; Buluc, Aydin

doi:10.1109/ipdps47924.2020.00022

Title: Optimizing High Performance Markov Clustering for Pre-Exascale Architectures

Abstract

HipMCL is a high-performance distributed memory implementation of the popular Markov Cluster Algorithm (MCL) and can cluster large-scale networks within hours using a few thousand CPU-equipped nodes. It relies on sparse matrix computations and heavily makes use of the sparse matrix-sparse matrix multiplication kernel (SpGEMM). The existing parallel algorithms in HipMCL are not scalable to Exascale architectures, both due to their communication costs dominating the runtime at large concurrencies and also due to their inability to take advantage of accelerators that are increasingly popular. In this work, we systematically remove scalability and performance bottlenecks of HipMCL. We enable GPUs by performing the expensive expansion phase of the MCL algorithm on GPU. Additionally, we propose a CPU-GPU joint distributed SpGEMM algorithm called pipelined Sparse SUMMA and integrate a probabilistic memory requirement estimator that is fast and accurate. Furthermore, we develop a new merging algorithm for the incremental processing of partial results produced by the GPUs, which improves the overlap efficiency and the peak memory usage. We also integrate a recent and faster algorithm for performing SpGEMM on CPUs. We validate our new algorithms and optimizations with extensive evaluations. With the enabling of the GPUs and integration of new algorithms, HipMCL ismore »« less

Authors:

Selvitopi, Oguz ^[1]; Hussain, Md Taufique ^[2]; Azad, Ariful ^[2]; Buluc, Aydin ^[1]

Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
Indiana Univ., Bloomington, IN (United States)

Publication Date:: Fri May 01 00:00:00 EDT 2020

Research Org.:: Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)

Sponsoring Org.:: USDOE Office of Science (SC)

OSTI Identifier:: 1650092

Grant/Contract Number:: AC02-05CH11231; AC05-00OR22725

Resource Type:: Accepted Manuscript

Journal Name:: Proceedings - IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Additional Journal Information:: Journal Name: Proceedings - IEEE International Parallel and Distributed Processing Symposium (IPDPS); Journal Volume: 2020; Conference: 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), New Orleans, LA (United States), 18-22 May 2020

Publisher:: IEEE

Country of Publication:: United States

Language:: English

Subject:: 97 MATHEMATICS AND COMPUTING

Citation Formats


                    Selvitopi, Oguz, Hussain, Md Taufique, Azad, Ariful, and Buluc, Aydin. Optimizing High Performance Markov Clustering for Pre-Exascale Architectures.  United States: N. p., 2020. 
Web.  doi:10.1109/ipdps47924.2020.00022.

Copy to clipboard


                    Selvitopi, Oguz, Hussain, Md Taufique, Azad, Ariful, & Buluc, Aydin. Optimizing High Performance Markov Clustering for Pre-Exascale Architectures.  United States.  https://doi.org/10.1109/ipdps47924.2020.00022

Copy to clipboard


                    Selvitopi, Oguz, Hussain, Md Taufique, Azad, Ariful, and Buluc, Aydin. Fri .  
"Optimizing High Performance Markov Clustering for Pre-Exascale Architectures".  United States.  https://doi.org/10.1109/ipdps47924.2020.00022.  https://www.osti.gov/servlets/purl/1650092.

Copy to clipboard


                    
@article{osti_1650092,

  title        = {Optimizing High Performance Markov Clustering for Pre-Exascale Architectures},

  author       = {Selvitopi, Oguz and Hussain, Md Taufique and Azad, Ariful and Buluc, Aydin},

  abstractNote = {HipMCL is a high-performance distributed memory implementation of the popular Markov Cluster Algorithm (MCL) and can cluster large-scale networks within hours using a few thousand CPU-equipped nodes. It relies on sparse matrix computations and heavily makes use of the sparse matrix-sparse matrix multiplication kernel (SpGEMM). The existing parallel algorithms in HipMCL are not scalable to Exascale architectures, both due to their communication costs dominating the runtime at large concurrencies and also due to their inability to take advantage of accelerators that are increasingly popular. In this work, we systematically remove scalability and performance bottlenecks of HipMCL. We enable GPUs by performing the expensive expansion phase of the MCL algorithm on GPU. Additionally, we propose a CPU-GPU joint distributed SpGEMM algorithm called pipelined Sparse SUMMA and integrate a probabilistic memory requirement estimator that is fast and accurate. Furthermore, we develop a new merging algorithm for the incremental processing of partial results produced by the GPUs, which improves the overlap efficiency and the peak memory usage. We also integrate a recent and faster algorithm for performing SpGEMM on CPUs. We validate our new algorithms and optimizations with extensive evaluations. With the enabling of the GPUs and integration of new algorithms, HipMCL is up to 12.4x faster, being able to cluster a network with 70 million proteins and 68 billion connections just under 15 minutes using 1024 nodes of ORNL's Summit supercomputer.},

  doi          = {10.1109/ipdps47924.2020.00022},

  journal      = {Proceedings - IEEE International Parallel and Distributed Processing Symposium (IPDPS)},

  number       = ,

  volume       = 2020,

  place        = {United States},

  year         = {Fri May 01 00:00:00 EDT 2020},

  month        = {Fri May 01 00:00:00 EDT 2020}

}

Copy to clipboard

Journal Article:

Free Publicly Available Full Text

Accepted Manuscript (DOE)

Publisher's Version of Record

https://doi.org/10.1109/ipdps47924.2020.00022

Other availability

Search WorldCat to find libraries that may hold this journal

Save / Share:

Export Metadata

Save to My Library

Works referenced in this record:

HipMCL: a high-performance parallel implementation of the Markov clustering algorithm for large-scale networks
journal, January 2018

Azad, Ariful; Pavlopoulos, Georgios A.; Ouzounis, Christos A.
Nucleic Acids Research, Vol. 46, Issue 6
DOI: 10.1093/nar/gkx1313

Brief Announcement: Hypergraph Partitioning for Parallel Sparse Matrix-Matrix Multiplication
conference, January 2015

Ballard, Grey; Druinsky, Alex; Knight, Nicholas
Proceedings of the 27th ACM on Symposium on Parallelism in Algorithms and Architectures - SPAA '15
DOI: 10.1145/2755573.2755613

IMG/M: integrated genome and metagenome comparative data analysis system
journal, October 2016

Chen, I-Min A.; Markowitz, Victor M.; Chu, Ken
Nucleic Acids Research, Vol. 45, Issue D1
DOI: 10.1093/nar/gkw929

Brief Announcement: Hypergraph Partitioning for Parallel Sparse Matrix-Matrix Multiplication
conference, January 2015

Ballard, Grey; Druinsky, Alex; Knight, Nicholas
Proceedings of the 27th ACM on Symposium on Parallelism in Algorithms and Architectures - SPAA '15
DOI: 10.1145/2755573.2755613

Partitioning Models for Scaling Parallel Sparse Matrix-Matrix Multiplication
journal, April 2018

Akbudak, Kadir; Selvitopi, Oguz; Aykanat, Cevdet
ACM Transactions on Parallel Computing, Vol. 4, Issue 3
DOI: 10.1145/3155292

ViennaCL---Linear Algebra Library for Multi- and Many-Core Architectures
journal, January 2016

Rupp, Karl; Tillet, Philippe; Rudolf, Florian
SIAM Journal on Scientific Computing, Vol. 38, Issue 5
DOI: 10.1137/15M1026419

An Efficient GPU General Sparse Matrix-Matrix Multiplication for Irregular Data
conference, May 2014

Liu, Weifeng; Vinter, Brian
2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2014 IEEE 28th International Parallel and Distributed Processing Symposium
DOI: 10.1109/IPDPS.2014.47

Performance-portable sparse matrix-matrix multiplication for many-core architectures
conference, May 2017

Deveci, Mehmet; Trott, Christian; Rajamanickam, Sivasankaran
2017 IEEE International Parallel and Distributed Processing Symposium: Workshops (IPDPSW), 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
DOI: 10.1109/IPDPSW.2017.8

Memory-Efficient Sparse Matrix-Matrix Multiplication by Row Merging on Many-Core Architectures
journal, January 2018

Gremse, Felix; Küpper, Kerstin; Naumann, Uwe
SIAM Journal on Scientific Computing, Vol. 40, Issue 4
DOI: 10.1137/17M1121378

Optimizing Sparse Matrix—Matrix Multiplication for the GPU
journal, October 2015

Dalton, Steven; Olson, Luke; Bell, Nathan
ACM Transactions on Mathematical Software, Vol. 41, Issue 4
DOI: 10.1145/2699470

Balanced Hashing and Efficient GPU Sparse General Matrix-Matrix Multiplication
conference, June 2016

Anh, Pham Nguyen Quang; Fan, Rui; Wen, Yonggang
Proceedings of the 2016 International Conference on Supercomputing
DOI: 10.1145/2925426.2926273

On improving performance of sparse matrix-matrix multiplication on GPUs
conference, June 2017

Kunchum, Rakshith; Chaudhry, Ankur; Sukumaran-Rajam, Aravind
Proceedings of the International Conference on Supercomputing
DOI: 10.1145/3079079.3079106

The Combinatorial BLAS: design, implementation, and applications
journal, May 2011

Buluç, Aydın; Gilbert, John R.
The International Journal of High Performance Computing Applications, Vol. 25, Issue 4
DOI: 10.1177/1094342011403516

HipMCL: a high-performance parallel implementation of the Markov clustering algorithm for large-scale networks
journal, January 2018

Azad, Ariful; Pavlopoulos, Georgios A.; Ouzounis, Christos A.
Nucleic Acids Research, Vol. 46, Issue 6
DOI: 10.1093/nar/gkx1313

A Local Clustering Algorithm for Massive Graphs and Its Application to Nearly Linear Time Graph Partitioning
journal, January 2013

Spielman, Daniel A.; Teng, Shang-Hua
SIAM Journal on Computing, Vol. 42, Issue 1
DOI: 10.1137/080744888

Communication optimal parallel multiplication of sparse random matrices
conference, January 2013

Ballard, Grey; Buluc, Aydin; Demmel, James
Proceedings of the 25th ACM symposium on Parallelism in algorithms and architectures - SPAA '13
DOI: 10.1145/2486159.2486196

Parallel Sparse Matrix-Matrix Multiplication and Indexing: Implementation and Experiments
journal, January 2012

Buluç, Aydin; Gilbert, John R.
SIAM Journal on Scientific Computing, Vol. 34, Issue 4
DOI: 10.1137/110848244

A fast implementation of MLR-MCL algorithm on multi-core processors
conference, December 2014

Niu, Qingpeng; Lai, Pai-Wei; Faisal, S. M.
2014 21st International Conference on High Performance Computing (HiPC)
DOI: 10.1109/HiPC.2014.7116888

Exploiting Multiple Levels of Parallelism in Sparse Matrix-Matrix Multiplication
journal, January 2016

Azad, Ariful; Ballard, Grey; Buluç, Aydin
SIAM Journal on Scientific Computing, Vol. 38, Issue 6
DOI: 10.1137/15M104253X

Two Fast Algorithms for Sparse Matrices: Multiplication and Permuted Transposition
journal, September 1978

Gustavson, Fred G.
ACM Transactions on Mathematical Software, Vol. 4, Issue 3
DOI: 10.1145/355791.355796

High-Performance Sparse Matrix-Matrix Products on Intel KNL and Multicore Architectures
conference, January 2018

Nagasaka, Yusuke; Matsuoka, Satoshi; Azad, Ariful
Proceedings of the 47th International Conference on Parallel Processing Companion - ICPP '18
DOI: 10.1145/3229710.3229720

Exposing Fine-Grained Parallelism in Algebraic Multigrid Methods
journal, January 2012

Bell, Nathan; Dalton, Steven; Olson, Luke N.
SIAM Journal on Scientific Computing, Vol. 34, Issue 4
DOI: 10.1137/110838844

Sparse Matrices in MATLAB: Design and Implementation
journal, January 1992

Gilbert, John R.; Moler, Cleve; Schreiber, Robert
SIAM Journal on Matrix Analysis and Applications, Vol. 13, Issue 1
DOI: 10.1137/0613024

Similar Records in DOE PAGES and OSTI.GOV collections:

HipMCL: a high-performance parallel implementation of the Markov clustering algorithm for large-scale networks

Journal Article Azad, Ariful ; Pavlopoulos, Georgios A. ; Ouzounis, Christos A. ; ... - Nucleic Acids Research

Biological networks capture structural or functional properties of relevant entities such as molecules, proteins or genes. Characteristic examples are gene expression networks or protein–protein interaction networks, which hold information about functional affinities or structural similarities. Such networks have been expanding in size due to increasing scale and abundance of biological data. While various clustering algorithms have been proposed to find highly connected regions, Markov Clustering (MCL) has been one of the most successful approaches to cluster sequence similarity or expression networks. Despite its popularity, MCL’s scalability to cluster large datasets still remains a bottleneck due to high running times andmore »« less
Cited by 62
https://doi.org/10.1093/nar/gkx1313

Full Text Available
GSoFa: Scalable Sparse Symbolic LU Factorization on GPUs

Journal Article Gaihre, Anil ; Li, Xiaoye Sherry ; Liu, Hang - IEEE Transactions on Parallel and Distributed Systems

Decomposing a matrixmore »« less
https://doi.org/10.1109/tpds.2021.3090316

Full Text Available
Data Locality Enhancement of Dynamic Simulations for Exascale Computing (Final Report)

Technical Report Shen, Xipeng

The development of modern processors exhibits two trends that complicate the optimizations of modern software. The first is the increasing sensitivity of processors' throughput to irregularities in computation. With more processors produced through a massive integration of simple cores, future systems will increasingly favor regular data-level parallel computations, but deviate from the needs of applications with complex patterns. Some evidences are already shown on Graphic Processing Units (GPU): Irregular data accesses (e.g., indirect references A[D[i]]) and conditional branches are limiting many GPU applications' performance at a level an order of magnitude lower than the peak of GPU. The second hardwaremore »« less
https://doi.org/10.2172/1576175

Full Text Available
Performance optimization, modeling and analysis of sparse matrix-matrix products on multi-core and many-core processors

Journal Article Nagasaka, Yusuke ; Matsuoka, Satoshi ; Azad, Ariful ; ... - Parallel Computing

Sparse matrix-matrix multiplication (SpGEMM) is a computational primitive that is widely used in areas ranging from traditional numerical applications to recent big data analysis and machine learning. Although many SpGEMM algorithms have been proposed, hardware specific optimizations for multi- and many-core processors are lacking and a detailed analysis of their performance under various use cases and matrices is not available. In this work, we firstly identify and mitigate multiple bottlenecks with memory management and thread scheduling on Intel Xeon Phi (Knights Landing or KNL). Specifically targeting many-core processors, we develop a hash-table-based algorithm and optimize a heap-based shared-memory SpGEMM algorithm.more »« less
Cited by 16
https://doi.org/10.1016/j.parco.2019.102545

Full Text Available
Scalable multiscale modeling of platelets with 100 million particles

Journal Article Han, Changnian ; Zhang, Peng ; Zhu, Yicong ; ... - Journal of Supercomputing

Here, we developed the core components of the AI-aided multiple time stepping algorithm for multiscale modeling of cell dynamics. This algorithm was implemented and analyzed on two supercomputer architectures with an application of simulating the aggregation of 250 platelets, or 102 million particles. To scale on these computers with complex memory and network architectures with GPUs, we devised a biomechanics-informed task mapping scheme to optimize load imbalance, communications, and memory utilization. Our simulations, scaling well up to 192 nodes on a Summit-like supercomputer with a peak speed of 11 petaflops, achieved a rate of 423 μs/day which is 500 timesmore »« less
https://doi.org/10.1007/s11227-022-04648-4

Full Text Available

Similar Records

Title: Optimizing High Performance Markov Clustering for Pre-Exascale Architectures

Abstract

Citation Formats

HipMCL: a high-performance parallel implementation of the Markov clustering algorithm for large-scale networks journal, January 2018

Brief Announcement: Hypergraph Partitioning for Parallel Sparse Matrix-Matrix Multiplication conference, January 2015

IMG/M: integrated genome and metagenome comparative data analysis system journal, October 2016

Brief Announcement: Hypergraph Partitioning for Parallel Sparse Matrix-Matrix Multiplication conference, January 2015

Partitioning Models for Scaling Parallel Sparse Matrix-Matrix Multiplication journal, April 2018

ViennaCL---Linear Algebra Library for Multi- and Many-Core Architectures journal, January 2016

An Efficient GPU General Sparse Matrix-Matrix Multiplication for Irregular Data conference, May 2014

Performance-portable sparse matrix-matrix multiplication for many-core architectures conference, May 2017

Memory-Efficient Sparse Matrix-Matrix Multiplication by Row Merging on Many-Core Architectures journal, January 2018

Optimizing Sparse Matrix—Matrix Multiplication for the GPU journal, October 2015

Balanced Hashing and Efficient GPU Sparse General Matrix-Matrix Multiplication conference, June 2016

On improving performance of sparse matrix-matrix multiplication on GPUs conference, June 2017

The Combinatorial BLAS: design, implementation, and applications journal, May 2011

HipMCL: a high-performance parallel implementation of the Markov clustering algorithm for large-scale networks journal, January 2018

A Local Clustering Algorithm for Massive Graphs and Its Application to Nearly Linear Time Graph Partitioning journal, January 2013

Communication optimal parallel multiplication of sparse random matrices conference, January 2013

Parallel Sparse Matrix-Matrix Multiplication and Indexing: Implementation and Experiments journal, January 2012

A fast implementation of MLR-MCL algorithm on multi-core processors conference, December 2014

Exploiting Multiple Levels of Parallelism in Sparse Matrix-Matrix Multiplication journal, January 2016

Two Fast Algorithms for Sparse Matrices: Multiplication and Permuted Transposition journal, September 1978

High-Performance Sparse Matrix-Matrix Products on Intel KNL and Multicore Architectures conference, January 2018

Exposing Fine-Grained Parallelism in Algebraic Multigrid Methods journal, January 2012

Sparse Matrices in MATLAB: Design and Implementation journal, January 1992

HipMCL: a high-performance parallel implementation of the Markov clustering algorithm for large-scale networks
journal, January 2018

Brief Announcement: Hypergraph Partitioning for Parallel Sparse Matrix-Matrix Multiplication
conference, January 2015

IMG/M: integrated genome and metagenome comparative data analysis system
journal, October 2016

Brief Announcement: Hypergraph Partitioning for Parallel Sparse Matrix-Matrix Multiplication
conference, January 2015

Partitioning Models for Scaling Parallel Sparse Matrix-Matrix Multiplication
journal, April 2018

ViennaCL---Linear Algebra Library for Multi- and Many-Core Architectures
journal, January 2016

An Efficient GPU General Sparse Matrix-Matrix Multiplication for Irregular Data
conference, May 2014

Performance-portable sparse matrix-matrix multiplication for many-core architectures
conference, May 2017

Memory-Efficient Sparse Matrix-Matrix Multiplication by Row Merging on Many-Core Architectures
journal, January 2018

Optimizing Sparse Matrix—Matrix Multiplication for the GPU
journal, October 2015

Balanced Hashing and Efficient GPU Sparse General Matrix-Matrix Multiplication
conference, June 2016

On improving performance of sparse matrix-matrix multiplication on GPUs
conference, June 2017

The Combinatorial BLAS: design, implementation, and applications
journal, May 2011

HipMCL: a high-performance parallel implementation of the Markov clustering algorithm for large-scale networks
journal, January 2018

A Local Clustering Algorithm for Massive Graphs and Its Application to Nearly Linear Time Graph Partitioning
journal, January 2013

Communication optimal parallel multiplication of sparse random matrices
conference, January 2013

Parallel Sparse Matrix-Matrix Multiplication and Indexing: Implementation and Experiments
journal, January 2012

A fast implementation of MLR-MCL algorithm on multi-core processors
conference, December 2014

Exploiting Multiple Levels of Parallelism in Sparse Matrix-Matrix Multiplication
journal, January 2016

Two Fast Algorithms for Sparse Matrices: Multiplication and Permuted Transposition
journal, September 1978

High-Performance Sparse Matrix-Matrix Products on Intel KNL and Multicore Architectures
conference, January 2018

Exposing Fine-Grained Parallelism in Algebraic Multigrid Methods
journal, January 2012

Sparse Matrices in MATLAB: Design and Implementation
journal, January 1992