skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: HipMCL: a high-performance parallel implementation of the Markov clustering algorithm for large-scale networks

Abstract

Biological networks capture structural or functional properties of relevant entities such as molecules, proteins or genes. Characteristic examples are gene expression networks or protein–protein interaction networks, which hold information about functional affinities or structural similarities. Such networks have been expanding in size due to increasing scale and abundance of biological data. While various clustering algorithms have been proposed to find highly connected regions, Markov Clustering (MCL) has been one of the most successful approaches to cluster sequence similarity or expression networks. Despite its popularity, MCL’s scalability to cluster large datasets still remains a bottleneck due to high running times and memory demands. In this paper, we present High-performance MCL (HipMCL), a parallel implementation of the original MCL algorithm that can run on distributed-memory computers. We show that HipMCL can efficiently utilize 2000 compute nodes and cluster a network of ~70 million nodes with ~68 billion edges in ~2.4 h. By exploiting distributed-memory environments, HipMCL clusters large-scale networks several orders of magnitude faster than MCL and enables clustering of even bigger networks. Finally, HipMCL is based on MPI and OpenMP and is freely available under a modified BSD license.

Authors:
 [1]; ORCiD logo [2];  [3];  [2];  [4]
  1. Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States). Computational Research Division
  2. USDOE Joint Genome Institute (JGI), Walnut Creek, CA (United States)
  3. Centre for Research & Technology Hellas, Thessalonica (Greece). Biological Computation & Process Lab. Chemical Process & Energy Resources Inst.
  4. Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States). Computational Research Division; Univ. of California, Berkeley, CA (United States). Dept. of Electrical Engineering and Computer Sciences
Publication Date:
Research Org.:
Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); USDOE Joint Genome Institute (JGI), Walnut Creek, CA (United States)
Sponsoring Org.:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR) (SC-21); USDOE National Nuclear Security Administration (NNSA)
OSTI Identifier:
1439241
Grant/Contract Number:  
AC02-05CH11231
Resource Type:
Journal Article: Accepted Manuscript
Journal Name:
Nucleic Acids Research
Additional Journal Information:
Journal Volume: 46; Journal Issue: 6; Journal ID: ISSN 0305-1048
Publisher:
Oxford University Press
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING; 59 BASIC BIOLOGICAL SCIENCES; computational methods; genomics

Citation Formats

Azad, Ariful, Pavlopoulos, Georgios A., Ouzounis, Christos A., Kyrpides, Nikos C., and Buluc, Aydin. HipMCL: a high-performance parallel implementation of the Markov clustering algorithm for large-scale networks. United States: N. p., 2018. Web. doi:10.1093/nar/gkx1313.
Azad, Ariful, Pavlopoulos, Georgios A., Ouzounis, Christos A., Kyrpides, Nikos C., & Buluc, Aydin. HipMCL: a high-performance parallel implementation of the Markov clustering algorithm for large-scale networks. United States. doi:10.1093/nar/gkx1313.
Azad, Ariful, Pavlopoulos, Georgios A., Ouzounis, Christos A., Kyrpides, Nikos C., and Buluc, Aydin. Fri . "HipMCL: a high-performance parallel implementation of the Markov clustering algorithm for large-scale networks". United States. doi:10.1093/nar/gkx1313. https://www.osti.gov/servlets/purl/1439241.
@article{osti_1439241,
title = {HipMCL: a high-performance parallel implementation of the Markov clustering algorithm for large-scale networks},
author = {Azad, Ariful and Pavlopoulos, Georgios A. and Ouzounis, Christos A. and Kyrpides, Nikos C. and Buluc, Aydin},
abstractNote = {Biological networks capture structural or functional properties of relevant entities such as molecules, proteins or genes. Characteristic examples are gene expression networks or protein–protein interaction networks, which hold information about functional affinities or structural similarities. Such networks have been expanding in size due to increasing scale and abundance of biological data. While various clustering algorithms have been proposed to find highly connected regions, Markov Clustering (MCL) has been one of the most successful approaches to cluster sequence similarity or expression networks. Despite its popularity, MCL’s scalability to cluster large datasets still remains a bottleneck due to high running times and memory demands. In this paper, we present High-performance MCL (HipMCL), a parallel implementation of the original MCL algorithm that can run on distributed-memory computers. We show that HipMCL can efficiently utilize 2000 compute nodes and cluster a network of ~70 million nodes with ~68 billion edges in ~2.4 h. By exploiting distributed-memory environments, HipMCL clusters large-scale networks several orders of magnitude faster than MCL and enables clustering of even bigger networks. Finally, HipMCL is based on MPI and OpenMP and is freely available under a modified BSD license.},
doi = {10.1093/nar/gkx1313},
journal = {Nucleic Acids Research},
issn = {0305-1048},
number = 6,
volume = 46,
place = {United States},
year = {2018},
month = {1}
}

Journal Article:
Free Publicly Available Full Text
Publisher's Version of Record

Citation Metrics:
Cited by: 5 works
Citation information provided by
Web of Science

Figures / Tables:

Figure 1 Figure 1: An example of expansion and pruning of b ( = 2) columns of a column stochastic matrix A. Non-zero entries are shown with filled circles. Here, A$_b$ is a submatrix of A, consisting all N rows and b ( = 2) columns that are currently being expanded. Themore » product AxA$_b$ is computed and pruned to obtain the final result for these b columns. Parts of matrices that are active in the current expansion are shown in darker shades. For comparison, MCL sets b to 1. HipMCL dynamically selects a large value for b from the range [1,N] such that the expanded columns of A2 do not overflow memory. When these columns are expanded and pruned, the computation moves to the next set of b columns.« less

Save / Share:

Works referenced in this record:

Survey of Clustering Algorithms
journal, May 2005


A survey of visualization tools for biological network analysis
journal, November 2008

  • Pavlopoulos, Georgios A.; Wegener, Anna-Lynn; Schneider, Reinhard
  • BioData Mining, Vol. 1, Issue 1
  • DOI: 10.1186/1756-0381-1-12

Uncovering the overlapping community structure of complex networks in nature and society
journal, June 2005

  • Palla, Gergely; Derényi, Imre; Farkas, Illés
  • Nature, Vol. 435, Issue 7043
  • DOI: 10.1038/nature03607

SNAP: A General-Purpose Network Analysis and Graph-Mining Library
journal, July 2016

  • Leskovec, Jure; Sosič, Rok
  • ACM Transactions on Intelligent Systems and Technology, Vol. 8, Issue 1
  • DOI: 10.1145/2898361

Fast unfolding of communities in large networks
journal, October 2008

  • Blondel, Vincent D.; Guillaume, Jean-Loup; Lambiotte, Renaud
  • Journal of Statistical Mechanics: Theory and Experiment, Vol. 2008, Issue 10
  • DOI: 10.1088/1742-5468/2008/10/P10008

Time bounds for selection
journal, August 1973


Clustering by Passing Messages Between Data Points
journal, February 2007


Detection of functional modules from protein interaction networks
journal, September 2003

  • Pereira-Leal, Jose B.; Enright, Anton J.; Ouzounis, Christos A.
  • Proteins: Structure, Function, and Bioinformatics, Vol. 54, Issue 1
  • DOI: 10.1002/prot.10505

SUMMA: scalable universal matrix multiplication algorithm
journal, April 1997


Domain enhanced lookup time accelerated BLAST
journal, January 2012

  • Boratyn, Grzegorz M.; Schäffer, Alejandro A.; Agarwala, Richa
  • Biology Direct, Vol. 7, Issue 1
  • DOI: 10.1186/1745-6150-7-12

SWORD—a highly efficient protein database search
journal, September 2016


CoGenT++: an extensive and extensible data environment for computational genomics
journal, October 2005


Parallel Sparse Matrix-Matrix Multiplication and Indexing: Implementation and Experiments
journal, January 2012

  • Buluç, Aydin; Gilbert, John R.
  • SIAM Journal on Scientific Computing, Vol. 34, Issue 4
  • DOI: 10.1137/110848244

Graph Clustering Via a Discrete Uncoupling Process
journal, January 2008

  • Van Dongen, Stijn
  • SIAM Journal on Matrix Analysis and Applications, Vol. 30, Issue 1
  • DOI: 10.1137/040608635

Markov clustering versus affinity propagation for the partitioning of protein interaction graphs
journal, March 2009


jClust: a clustering and visualization toolbox
journal, May 2009


Microbiome Data Science: Understanding Our Microbial Planet
journal, June 2016

  • Kyrpides, Nikos C.; Eloe-Fadrosh, Emiley A.; Ivanova, Natalia N.
  • Trends in Microbiology, Vol. 24, Issue 6
  • DOI: 10.1016/j.tim.2016.02.011

Adaptive seeds tame genomic sequence comparison
journal, January 2011


Parallel Reproducible Summation
journal, July 2015


Classification schemes for protein structure and function
journal, July 2003

  • Ouzounis, Christos A.; Coulson, Richard M. R.; Enright, Anton J.
  • Nature Reviews Genetics, Vol. 4, Issue 7
  • DOI: 10.1038/nrg1113

Comparing the performance of biomedical clustering methods
journal, September 2015

  • Wiwie, Christian; Baumbach, Jan; Röttger, Richard
  • Nature Methods, Vol. 12, Issue 11
  • DOI: 10.1038/nmeth.3583

Empirical Comparison of Visualization Tools for Larger-Scale Network Analysis
journal, January 2017

  • Pavlopoulos, Georgios A.; Paez-Espino, David; Kyrpides, Nikos C.
  • Advances in Bioinformatics, Vol. 2017
  • DOI: 10.1155/2017/1278932

SPICi: a fast clustering algorithm for large biological networks
journal, February 2010


Visualizing genome and systems biology: technologies, tools, implementation techniques and trends, past, present and future
journal, August 2015

  • Pavlopoulos, Georgios A.; Malliarakis, Dimitris; Papanikolaou, Nikolas
  • GigaScience, Vol. 4, Issue 1
  • DOI: 10.1186/s13742-015-0077-2

Cytoscape: A Software Environment for Integrated Models of Biomolecular Interaction Networks
journal, November 2003


clusterMaker: a multi-algorithm clustering plugin for Cytoscape
journal, November 2011

  • Morris, John H.; Apeltsin, Leonard; Newman, Aaron M.
  • BMC Bioinformatics, Vol. 12, Issue 1
  • DOI: 10.1186/1471-2105-12-436

Basic local alignment search tool
journal, October 1990

  • Altschul, Stephen F.; Gish, Warren; Miller, Webb
  • Journal of Molecular Biology, Vol. 215, Issue 3, p. 403-410
  • DOI: 10.1016/S0022-2836(05)80360-2

Network biology: understanding the cell's functional organization
journal, February 2004

  • Barabási, Albert-László; Oltvai, Zoltán N.
  • Nature Reviews Genetics, Vol. 5, Issue 2
  • DOI: 10.1038/nrg1272

Evaluation of clustering algorithms for protein-protein interaction networks
journal, November 2006


Fast Parallel Markov Clustering in Bioinformatics Using Massively Parallel Computing on GPU with CUDA and ELLPACK-R Sparse Format
journal, May 2012

  • Bustamam, A.; Burrage, K.; Hamilton, N. A.
  • IEEE/ACM Transactions on Computational Biology and Bioinformatics, Vol. 9, Issue 3
  • DOI: 10.1109/TCBB.2011.68

New Connectivity and MSF Algorithms for Shuffle-Exchange Network and PRAM
journal, October 1987


NAP: The Network Analysis Profiler, a web tool for easier topological analysis and comparison of medium-scale biological networks
journal, July 2017

  • Theodosiou, Theodosios; Efstathiou, Georgios; Papanikolaou, Nikolas
  • BMC Research Notes, Vol. 10, Issue 1
  • DOI: 10.1186/s13104-017-2607-8

Using graph theory to analyze biological networks
journal, April 2011

  • Pavlopoulos, Georgios A.; Secrier, Maria; Moschopoulos, Charalampos N.
  • BioData Mining, Vol. 4, Issue 1
  • DOI: 10.1186/1756-0381-4-10

Superparamagnetic Clustering of Data
journal, April 1996


An efficient algorithm for large-scale detection of protein families
journal, April 2002


Medusa: A tool for exploring and clustering biological networks
journal, October 2011

  • Pavlopoulos, Georgios A.; Hooper, Sean D.; Sifrim, Alejandro
  • BMC Research Notes, Vol. 4, Issue 1
  • DOI: 10.1186/1756-0500-4-384

A Genomic Perspective on Protein Families
journal, October 1997


Protein complex prediction via cost-based clustering
journal, June 2004


IMG/M: integrated genome and metagenome comparative data analysis system
journal, October 2016

  • Chen, I-Min A.; Markowitz, Victor M.; Chu, Ken
  • Nucleic Acids Research, Vol. 45, Issue D1
  • DOI: 10.1093/nar/gkw929

CALU: A Communication Optimal LU Factorization Algorithm
journal, October 2011

  • Grigori, Laura; Demmel, James W.; Xiang, Hua
  • SIAM Journal on Matrix Analysis and Applications, Vol. 32, Issue 4
  • DOI: 10.1137/100788926

Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space
journal, June 2008


The Combinatorial BLAS: design, implementation, and applications
journal, May 2011

  • Buluç, Aydın; Gilbert, John R.
  • The International Journal of High Performance Computing Applications, Vol. 25, Issue 4
  • DOI: 10.1177/1094342011403516

Construction, Visualisation, and Clustering of Transcription Networks from Microarray Expression Data
journal, January 2007


Which clustering algorithm is better for predicting protein complexes?
journal, December 2011

  • Moschopoulos, Charalampos N.; Pavlopoulos, Georgios A.; Iacucci, Ernesto
  • BMC Research Notes, Vol. 4, Issue 1
  • DOI: 10.1186/1756-0500-4-549

    Works referencing / citing this record:

    Register-Aware Optimizations for Parallel Sparse Matrix–Matrix Multiplication
    journal, January 2019

    • Liu, Junhong; He, Xin; Liu, Weifeng
    • International Journal of Parallel Programming, Vol. 47, Issue 3
    • DOI: 10.1007/s10766-018-0604-8

    RAFTS3G: an efficient and versatile clustering software to analyses in large protein datasets
    journal, July 2019

    • de Lima Nichio, Bruno Thiago; de Oliveira, Aryel Marlus Repula; de Pierri, Camilla Reginatto
    • BMC Bioinformatics, Vol. 20, Issue 1
    • DOI: 10.1186/s12859-019-2973-4

      Figures/Tables have been extracted from DOE-funded journal article accepted manuscripts.