skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: HipMCL: a high-performance parallel implementation of the Markov clustering algorithm for large-scale networks

Abstract

Biological networks capture structural or functional properties of relevant entities such as molecules, proteins or genes. Characteristic examples are gene expression networks or protein–protein interaction networks, which hold information about functional affinities or structural similarities. Such networks have been expanding in size due to increasing scale and abundance of biological data. While various clustering algorithms have been proposed to find highly connected regions, Markov Clustering (MCL) has been one of the most successful approaches to cluster sequence similarity or expression networks. Despite its popularity, MCL’s scalability to cluster large datasets still remains a bottleneck due to high running times and memory demands. In this paper, we present High-performance MCL (HipMCL), a parallel implementation of the original MCL algorithm that can run on distributed-memory computers. We show that HipMCL can efficiently utilize 2000 compute nodes and cluster a network of ~70 million nodes with ~68 billion edges in ~2.4 h. By exploiting distributed-memory environments, HipMCL clusters large-scale networks several orders of magnitude faster than MCL and enables clustering of even bigger networks. Finally, HipMCL is based on MPI and OpenMP and is freely available under a modified BSD license.

Authors:
 [1]; ORCiD logo [2];  [3];  [2];  [4]
  1. Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States). Computational Research Division
  2. USDOE Joint Genome Institute (JGI), Walnut Creek, CA (United States)
  3. Centre for Research & Technology Hellas, Thessalonica (Greece). Biological Computation & Process Lab. Chemical Process & Energy Resources Inst.
  4. Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States). Computational Research Division; Univ. of California, Berkeley, CA (United States). Dept. of Electrical Engineering and Computer Sciences
Publication Date:
Research Org.:
Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); USDOE Joint Genome Institute (JGI), Walnut Creek, CA (United States)
Sponsoring Org.:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR) (SC-21); USDOE National Nuclear Security Administration (NNSA)
OSTI Identifier:
1439241
Grant/Contract Number:
AC02-05CH11231
Resource Type:
Journal Article: Accepted Manuscript
Journal Name:
Nucleic Acids Research
Additional Journal Information:
Journal Volume: 46; Journal Issue: 6; Journal ID: ISSN 0305-1048
Publisher:
Oxford University Press
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING; 59 BASIC BIOLOGICAL SCIENCES; computational methods; genomics

Citation Formats

Azad, Ariful, Pavlopoulos, Georgios A., Ouzounis, Christos A., Kyrpides, Nikos C., and Buluc, Aydin. HipMCL: a high-performance parallel implementation of the Markov clustering algorithm for large-scale networks. United States: N. p., 2018. Web. doi:10.1093/nar/gkx1313.
Azad, Ariful, Pavlopoulos, Georgios A., Ouzounis, Christos A., Kyrpides, Nikos C., & Buluc, Aydin. HipMCL: a high-performance parallel implementation of the Markov clustering algorithm for large-scale networks. United States. doi:10.1093/nar/gkx1313.
Azad, Ariful, Pavlopoulos, Georgios A., Ouzounis, Christos A., Kyrpides, Nikos C., and Buluc, Aydin. Fri . "HipMCL: a high-performance parallel implementation of the Markov clustering algorithm for large-scale networks". United States. doi:10.1093/nar/gkx1313.
@article{osti_1439241,
title = {HipMCL: a high-performance parallel implementation of the Markov clustering algorithm for large-scale networks},
author = {Azad, Ariful and Pavlopoulos, Georgios A. and Ouzounis, Christos A. and Kyrpides, Nikos C. and Buluc, Aydin},
abstractNote = {Biological networks capture structural or functional properties of relevant entities such as molecules, proteins or genes. Characteristic examples are gene expression networks or protein–protein interaction networks, which hold information about functional affinities or structural similarities. Such networks have been expanding in size due to increasing scale and abundance of biological data. While various clustering algorithms have been proposed to find highly connected regions, Markov Clustering (MCL) has been one of the most successful approaches to cluster sequence similarity or expression networks. Despite its popularity, MCL’s scalability to cluster large datasets still remains a bottleneck due to high running times and memory demands. In this paper, we present High-performance MCL (HipMCL), a parallel implementation of the original MCL algorithm that can run on distributed-memory computers. We show that HipMCL can efficiently utilize 2000 compute nodes and cluster a network of ~70 million nodes with ~68 billion edges in ~2.4 h. By exploiting distributed-memory environments, HipMCL clusters large-scale networks several orders of magnitude faster than MCL and enables clustering of even bigger networks. Finally, HipMCL is based on MPI and OpenMP and is freely available under a modified BSD license.},
doi = {10.1093/nar/gkx1313},
journal = {Nucleic Acids Research},
number = 6,
volume = 46,
place = {United States},
year = {Fri Jan 05 00:00:00 EST 2018},
month = {Fri Jan 05 00:00:00 EST 2018}
}

Journal Article:
Free Publicly Available Full Text
This content will become publicly available on January 5, 2019
Publisher's Version of Record

Save / Share: