HipMCL: a high-performance parallel implementation of the Markov clustering algorithm for large-scale networks
Abstract
Biological networks capture structural or functional properties of relevant entities such as molecules, proteins or genes. Characteristic examples are gene expression networks or protein–protein interaction networks, which hold information about functional affinities or structural similarities. Such networks have been expanding in size due to increasing scale and abundance of biological data. While various clustering algorithms have been proposed to find highly connected regions, Markov Clustering (MCL) has been one of the most successful approaches to cluster sequence similarity or expression networks. Despite its popularity, MCL’s scalability to cluster large datasets still remains a bottleneck due to high running times and memory demands. In this paper, we present High-performance MCL (HipMCL), a parallel implementation of the original MCL algorithm that can run on distributed-memory computers. We show that HipMCL can efficiently utilize 2000 compute nodes and cluster a network of ~70 million nodes with ~68 billion edges in ~2.4 h. By exploiting distributed-memory environments, HipMCL clusters large-scale networks several orders of magnitude faster than MCL and enables clustering of even bigger networks. Finally, HipMCL is based on MPI and OpenMP and is freely available under a modified BSD license.
- Authors:
-
- Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States). Computational Research Division
- USDOE Joint Genome Institute (JGI), Walnut Creek, CA (United States)
- Centre for Research & Technology Hellas, Thessalonica (Greece). Biological Computation & Process Lab. Chemical Process & Energy Resources Inst.
- Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States). Computational Research Division; Univ. of California, Berkeley, CA (United States). Dept. of Electrical Engineering and Computer Sciences
- Publication Date:
- Research Org.:
- Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
- Sponsoring Org.:
- USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR); USDOE National Nuclear Security Administration (NNSA)
- OSTI Identifier:
- 1439241
- Grant/Contract Number:
- AC02-05CH11231
- Resource Type:
- Accepted Manuscript
- Journal Name:
- Nucleic Acids Research
- Additional Journal Information:
- Journal Volume: 46; Journal Issue: 6; Journal ID: ISSN 0305-1048
- Publisher:
- Oxford University Press
- Country of Publication:
- United States
- Language:
- English
- Subject:
- 97 MATHEMATICS AND COMPUTING; 59 BASIC BIOLOGICAL SCIENCES; computational methods; genomics
Citation Formats
Azad, Ariful, Pavlopoulos, Georgios A., Ouzounis, Christos A., Kyrpides, Nikos C., and Buluc, Aydin. HipMCL: a high-performance parallel implementation of the Markov clustering algorithm for large-scale networks. United States: N. p., 2018.
Web. doi:10.1093/nar/gkx1313.
Azad, Ariful, Pavlopoulos, Georgios A., Ouzounis, Christos A., Kyrpides, Nikos C., & Buluc, Aydin. HipMCL: a high-performance parallel implementation of the Markov clustering algorithm for large-scale networks. United States. doi:10.1093/nar/gkx1313.
Azad, Ariful, Pavlopoulos, Georgios A., Ouzounis, Christos A., Kyrpides, Nikos C., and Buluc, Aydin. Fri .
"HipMCL: a high-performance parallel implementation of the Markov clustering algorithm for large-scale networks". United States. doi:10.1093/nar/gkx1313. https://www.osti.gov/servlets/purl/1439241.
@article{osti_1439241,
title = {HipMCL: a high-performance parallel implementation of the Markov clustering algorithm for large-scale networks},
author = {Azad, Ariful and Pavlopoulos, Georgios A. and Ouzounis, Christos A. and Kyrpides, Nikos C. and Buluc, Aydin},
abstractNote = {Biological networks capture structural or functional properties of relevant entities such as molecules, proteins or genes. Characteristic examples are gene expression networks or protein–protein interaction networks, which hold information about functional affinities or structural similarities. Such networks have been expanding in size due to increasing scale and abundance of biological data. While various clustering algorithms have been proposed to find highly connected regions, Markov Clustering (MCL) has been one of the most successful approaches to cluster sequence similarity or expression networks. Despite its popularity, MCL’s scalability to cluster large datasets still remains a bottleneck due to high running times and memory demands. In this paper, we present High-performance MCL (HipMCL), a parallel implementation of the original MCL algorithm that can run on distributed-memory computers. We show that HipMCL can efficiently utilize 2000 compute nodes and cluster a network of ~70 million nodes with ~68 billion edges in ~2.4 h. By exploiting distributed-memory environments, HipMCL clusters large-scale networks several orders of magnitude faster than MCL and enables clustering of even bigger networks. Finally, HipMCL is based on MPI and OpenMP and is freely available under a modified BSD license.},
doi = {10.1093/nar/gkx1313},
journal = {Nucleic Acids Research},
number = 6,
volume = 46,
place = {United States},
year = {2018},
month = {1}
}
Web of Science
Figures / Tables:

Works referenced in this record:
Survey of Clustering Algorithms
journal, May 2005
- Xu, R.; WunschII, D.
- IEEE Transactions on Neural Networks, Vol. 16, Issue 3
A survey of visualization tools for biological network analysis
journal, November 2008
- Pavlopoulos, Georgios A.; Wegener, Anna-Lynn; Schneider, Reinhard
- BioData Mining, Vol. 1, Issue 1
Uncovering the overlapping community structure of complex networks in nature and society
journal, June 2005
- Palla, Gergely; Derényi, Imre; Farkas, Illés
- Nature, Vol. 435, Issue 7043
SNAP: A General-Purpose Network Analysis and Graph-Mining Library
journal, July 2016
- Leskovec, Jure; Sosič, Rok
- ACM Transactions on Intelligent Systems and Technology, Vol. 8, Issue 1
Fast unfolding of communities in large networks
journal, October 2008
- Blondel, Vincent D.; Guillaume, Jean-Loup; Lambiotte, Renaud
- Journal of Statistical Mechanics: Theory and Experiment, Vol. 2008, Issue 10
Time bounds for selection
journal, August 1973
- Blum, Manuel; Floyd, Robert W.; Pratt, Vaughan
- Journal of Computer and System Sciences, Vol. 7, Issue 4
Clustering by Passing Messages Between Data Points
journal, February 2007
- Frey, B. J.; Dueck, D.
- Science, Vol. 315, Issue 5814
Detection of functional modules from protein interaction networks
journal, September 2003
- Pereira-Leal, Jose B.; Enright, Anton J.; Ouzounis, Christos A.
- Proteins: Structure, Function, and Bioinformatics, Vol. 54, Issue 1
SUMMA: scalable universal matrix multiplication algorithm
journal, April 1997
- Van De Geijn, R. A.; Watts, J.
- Concurrency: Practice and Experience, Vol. 9, Issue 4
Domain enhanced lookup time accelerated BLAST
journal, January 2012
- Boratyn, Grzegorz M.; Schäffer, Alejandro A.; Agarwala, Richa
- Biology Direct, Vol. 7, Issue 1
SWORD—a highly efficient protein database search
journal, September 2016
- Vaser, Robert; Pavlović, Dario; Šikić, Mile
- Bioinformatics, Vol. 32, Issue 17
CoGenT++: an extensive and extensible data environment for computational genomics
journal, October 2005
- Goldovsky, L.; Janssen, P.; Ahren, D.
- Bioinformatics, Vol. 21, Issue 19
Parallel Sparse Matrix-Matrix Multiplication and Indexing: Implementation and Experiments
journal, January 2012
- Buluç, Aydin; Gilbert, John R.
- SIAM Journal on Scientific Computing, Vol. 34, Issue 4
Graph Clustering Via a Discrete Uncoupling Process
journal, January 2008
- Van Dongen, Stijn
- SIAM Journal on Matrix Analysis and Applications, Vol. 30, Issue 1
Markov clustering versus affinity propagation for the partitioning of protein interaction graphs
journal, March 2009
- Vlasblom, James; Wodak, Shoshana J.
- BMC Bioinformatics, Vol. 10, Issue 1
jClust: a clustering and visualization toolbox
journal, May 2009
- Pavlopoulos, G. A.; Moschopoulos, C. N.; Hooper, S. D.
- Bioinformatics, Vol. 25, Issue 15
Microbiome Data Science: Understanding Our Microbial Planet
journal, June 2016
- Kyrpides, Nikos C.; Eloe-Fadrosh, Emiley A.; Ivanova, Natalia N.
- Trends in Microbiology, Vol. 24, Issue 6
BSW: FPGA-accelerated BLAST-Wrapped Smith-Waterman aligner
conference, December 2013
- Lam, Bryant C.; Pascoe, Carlo; Schaecher, Scott
- 2013 International Conference on ReConFigurable Computing and FPGAs (ReConFig)
Adaptive seeds tame genomic sequence comparison
journal, January 2011
- Kielbasa, S. M.; Wan, R.; Sato, K.
- Genome Research, Vol. 21, Issue 3
Parallel Reproducible Summation
journal, July 2015
- Demmel, James; Nguyen, Hong Diep
- IEEE Transactions on Computers, Vol. 64, Issue 7
Classification schemes for protein structure and function
journal, July 2003
- Ouzounis, Christos A.; Coulson, Richard M. R.; Enright, Anton J.
- Nature Reviews Genetics, Vol. 4, Issue 7
Comparing the performance of biomedical clustering methods
journal, September 2015
- Wiwie, Christian; Baumbach, Jan; Röttger, Richard
- Nature Methods, Vol. 12, Issue 11
Empirical Comparison of Visualization Tools for Larger-Scale Network Analysis
journal, January 2017
- Pavlopoulos, Georgios A.; Paez-Espino, David; Kyrpides, Nikos C.
- Advances in Bioinformatics, Vol. 2017
SPICi: a fast clustering algorithm for large biological networks
journal, February 2010
- Jiang, Peng; Singh, Mona
- Bioinformatics, Vol. 26, Issue 8
Visualizing genome and systems biology: technologies, tools, implementation techniques and trends, past, present and future
journal, August 2015
- Pavlopoulos, Georgios A.; Malliarakis, Dimitris; Papanikolaou, Nikolas
- GigaScience, Vol. 4, Issue 1
Cytoscape: A Software Environment for Integrated Models of Biomolecular Interaction Networks
journal, November 2003
- Shannon, P.
- Genome Research, Vol. 13, Issue 11
clusterMaker: a multi-algorithm clustering plugin for Cytoscape
journal, November 2011
- Morris, John H.; Apeltsin, Leonard; Newman, Aaron M.
- BMC Bioinformatics, Vol. 12, Issue 1
Basic local alignment search tool
journal, October 1990
- Altschul, Stephen F.; Gish, Warren; Miller, Webb
- Journal of Molecular Biology, Vol. 215, Issue 3, p. 403-410
Network biology: understanding the cell's functional organization
journal, February 2004
- Barabási, Albert-László; Oltvai, Zoltán N.
- Nature Reviews Genetics, Vol. 5, Issue 2
Evaluation of clustering algorithms for protein-protein interaction networks
journal, November 2006
- Brohée, Sylvain; van Helden, Jacques
- BMC Bioinformatics, Vol. 7, Issue 1
Fast Parallel Markov Clustering in Bioinformatics Using Massively Parallel Computing on GPU with CUDA and ELLPACK-R Sparse Format
journal, May 2012
- Bustamam, A.; Burrage, K.; Hamilton, N. A.
- IEEE/ACM Transactions on Computational Biology and Bioinformatics, Vol. 9, Issue 3
New Connectivity and MSF Algorithms for Shuffle-Exchange Network and PRAM
journal, October 1987
- Awerbuch,
- IEEE Transactions on Computers, Vol. C-36, Issue 10
NAP: The Network Analysis Profiler, a web tool for easier topological analysis and comparison of medium-scale biological networks
journal, July 2017
- Theodosiou, Theodosios; Efstathiou, Georgios; Papanikolaou, Nikolas
- BMC Research Notes, Vol. 10, Issue 1
Using graph theory to analyze biological networks
journal, April 2011
- Pavlopoulos, Georgios A.; Secrier, Maria; Moschopoulos, Charalampos N.
- BioData Mining, Vol. 4, Issue 1
Superparamagnetic Clustering of Data
journal, April 1996
- Blatt, Marcelo; Wiseman, Shai; Domany, Eytan
- Physical Review Letters, Vol. 76, Issue 18
An efficient algorithm for large-scale detection of protein families
journal, April 2002
- Enright, A. J.
- Nucleic Acids Research, Vol. 30, Issue 7
Medusa: A tool for exploring and clustering biological networks
journal, October 2011
- Pavlopoulos, Georgios A.; Hooper, Sean D.; Sifrim, Alejandro
- BMC Research Notes, Vol. 4, Issue 1
A Genomic Perspective on Protein Families
journal, October 1997
- Tatusov, R. L.
- Science, Vol. 278, Issue 5338
Protein complex prediction via cost-based clustering
journal, June 2004
- King, A. D.; Przulj, N.; Jurisica, I.
- Bioinformatics, Vol. 20, Issue 17
IMG/M: integrated genome and metagenome comparative data analysis system
journal, October 2016
- Chen, I-Min A.; Markowitz, Victor M.; Chu, Ken
- Nucleic Acids Research, Vol. 45, Issue D1
CALU: A Communication Optimal LU Factorization Algorithm
journal, October 2011
- Grigori, Laura; Demmel, James W.; Xiang, Hua
- SIAM Journal on Matrix Analysis and Applications, Vol. 32, Issue 4
Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space
journal, June 2008
- Loewenstein, Y.; Portugaly, E.; Fromer, M.
- Bioinformatics, Vol. 24, Issue 13
The Combinatorial BLAS: design, implementation, and applications
journal, May 2011
- Buluç, Aydın; Gilbert, John R.
- The International Journal of High Performance Computing Applications, Vol. 25, Issue 4
Construction, Visualisation, and Clustering of Transcription Networks from Microarray Expression Data
journal, January 2007
- Freeman, Tom C.; Goldovsky, Leon; Brosch, Markus
- PLoS Computational Biology, Vol. 3, Issue 10
Which clustering algorithm is better for predicting protein complexes?
journal, December 2011
- Moschopoulos, Charalampos N.; Pavlopoulos, Georgios A.; Iacucci, Ernesto
- BMC Research Notes, Vol. 4, Issue 1
Works referencing / citing this record:
Register-Aware Optimizations for Parallel Sparse Matrix–Matrix Multiplication
journal, January 2019
- Liu, Junhong; He, Xin; Liu, Weifeng
- International Journal of Parallel Programming, Vol. 47, Issue 3
RAFTS3G: an efficient and versatile clustering software to analyses in large protein datasets
journal, July 2019
- de Lima Nichio, Bruno Thiago; de Oliveira, Aryel Marlus Repula; de Pierri, Camilla Reginatto
- BMC Bioinformatics, Vol. 20, Issue 1
Register-Aware Optimizations for Parallel Sparse Matrix–Matrix Multiplication
journal, January 2019
- Liu, Junhong; He, Xin; Liu, Weifeng
- International Journal of Parallel Programming, Vol. 47, Issue 3
RAFTS3G: an efficient and versatile clustering software to analyses in large protein datasets
journal, July 2019
- de Lima Nichio, Bruno Thiago; de Oliveira, Aryel Marlus Repula; de Pierri, Camilla Reginatto
- BMC Bioinformatics, Vol. 20, Issue 1
Self-analysis of repeat proteins reveals evolutionarily conserved patterns
journal, May 2020
- Merski, Matthew; Młynarczyk, Krzysztof; Ludwiczak, Jan
- BMC Bioinformatics, Vol. 21, Issue 1
A novel parallel Markov clustering method in biological interaction network analysis under multi-GPU computing environment
journal, February 2020
- Fu, You; Zhou, Wei
- The Journal of Supercomputing, Vol. 76, Issue 10
The parallelism motifs of genomic data analysis
journal, January 2020
- Yelick, Katherine; Buluç, Aydın; Awan, Muaaz
- Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, Vol. 378, Issue 2166
Figures / Tables found in this record: