CoreCruncher: Fast and Robust Construction of Core Genomes in Large Prokaryotic Data Sets

Harris, Connor D.; Torrance, Ellis L.; Raymann, Kasie; Bobay, Louis-Marie

doi:10.1093/molbev/msaa224

Title: CoreCruncher: Fast and Robust Construction of Core Genomes in Large Prokaryotic Data Sets

Journal Article · Fri Sep 04 00:00:00 EDT 2020 · Molecular Biology and Evolution (Online)

DOI:https://doi.org/10.1093/molbev/msaa224· OSTI ID:1816371

Harris, Connor D. ^[1]; Torrance, Ellis L. ^[1]; Raymann, Kasie ^[1]; Bobay, Louis-Marie ^[1]

University of North Carolina, Greensboro, NC (United States)

The core genome represents the set of genes shared by all, or nearly all, strains of a given population or species of prokaryotes. Inferring the core genome is integral to many genomic analyses, however, most methods rely on the comparison of all the pairs of genomes; a step that is becoming increasingly difficult given the massive accumulation of genomic data. Here, we present CoreCruncher; a program that robustly and rapidly constructs core genomes across hundreds or thousands of genomes. CoreCruncher does not compute all pairwise genome comparisons and uses a heuristic based on the distributions of identity scores to classify sequences as orthologs or paralogs/xenologs. Although it is much faster than current methods, our results indicate that our approach is more conservative than other tools and less sensitive to the presence of paralogs and xenologs. CoreCruncher is freely available from: https://github.com/lbobay/CoreCruncher. CoreCruncher is written in Python 3.7 and can also run on Python 2.7 without modification. It requires the python library Numpy and either Usearch or Blast. Certain options require the programs muscle or mafft.

View Accepted Manuscript (DOE)

Cite

Export

Save

Research Organization:: University of North Carolina, Greensboro, NC (United States)

Sponsoring Organization:: USDOE; National Science Foundation (NSF); National Institute of General Medical Sciences (NIGMS)

Grant/Contract Number:: DEB-1831730; R01GM132137; DEB-11930776

OSTI ID:: 1816371

Journal Information:: Molecular Biology and Evolution (Online), Vol. 38, Issue 2; ISSN 1537-1719

Publisher:: Oxford University PressCopyright Statement

Country of Publication:: United States

Language:: English

References (28)

Phylogenetic and Functional Assessment of Orthologs Inference Projects and Methods Altenhoff, Adrian M.; Dessimoz, Christophe PLoS Computational Biology, Vol. 5, Issue 1 https://doi.org/10.1371/journal.pcbi.1000262	journal	January 2009
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs Altschul, Stephen F.; Madden, Thomas L.; Schäffer, Alejandro A. Nucleic Acids Research, Vol. 25, Issue 17, p. 3389-3402 https://doi.org/10.1093/nar/25.17.3389	journal	September 1997
Biological Species Are Universal across Life’s Domains Bobay, Louis-Marie; Ochman, Howard Genome Biology and Evolution, Vol. 9, Issue 3 https://doi.org/10.1093/gbe/evx026	journal	March 2017
Factors driving effective population size and pan-genome evolution in bacteria Bobay, Louis-Marie; Ochman, Howard BMC Evolutionary Biology, Vol. 18, Issue 1 https://doi.org/10.1186/s12862-018-1272-4	journal	October 2018
Assessing Performance of Orthology Detection Strategies Applied to Eukaryotic Genomes Chen, Feng; Mackey, Aaron J.; Vermunt, Jeroen K. PLoS ONE, Vol. 2, Issue 4 https://doi.org/10.1371/journal.pone.0000383	journal	April 2007
GET_HOMOLOGUES, a Versatile Software Package for Scalable and Robust Microbial Pangenome Analysis Contreras-Moreira, Bruno; Vinuesa, Pablo Applied and Environmental Microbiology, Vol. 79, Issue 24 https://doi.org/10.1128/AEM.02411-13	journal	October 2013
SonicParanoid: fast, accurate and easy orthology inference Cosentino, Salvatore; Iwasaki, Wataru Bioinformatics, Vol. 35, Issue 1 https://doi.org/10.1093/bioinformatics/bty631	journal	July 2018
MUSCLE: multiple sequence alignment with high accuracy and high throughput Edgar, R. C. Nucleic Acids Research, Vol. 32, Issue 5, p. 1792-1797 https://doi.org/10.1093/nar/gkh340	journal	March 2004
Search and clustering orders of magnitude faster than BLAST Edgar, Robert C. Bioinformatics, Vol. 26, Issue 19, p. 2460-2461 https://doi.org/10.1093/bioinformatics/btq461	journal	August 2010
Primary orthologs from local sequence context Gao, Kun; Miller, Jonathan BMC Bioinformatics, Vol. 21, Issue 1 https://doi.org/10.1186/s12859-020-3384-2	journal	February 2020
eggNOG 4.5: a hierarchical orthology framework with improved functional annotations for eukaryotic, prokaryotic and viral sequences Huerta-Cepas, Jaime; Szklarczyk, Damian; Forslund, Kristoffer Nucleic Acids Research, Vol. 44, Issue D1 https://doi.org/10.1093/nar/gkv1248	journal	November 2015
COCO-CL: hierarchical clustering of homology relations based on evolutionary correlations Jothi, Raja; Zotenko, Elena; Tasneem, Asba Bioinformatics, Vol. 22, Issue 7 https://doi.org/10.1093/bioinformatics/btl009	journal	January 2006
MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability Katoh, K.; Standley, D. M. Molecular Biology and Evolution, Vol. 30, Issue 4 https://doi.org/10.1093/molbev/mst010	journal	January 2013
Computational methods for Gene Orthology inference Kristensen, D. M.; Wolf, Y. I.; Mushegian, A. R. Briefings in Bioinformatics, Vol. 12, Issue 5 https://doi.org/10.1093/bib/bbr030	journal	June 2011
OrthoDB: the hierarchical catalog of eukaryotic orthologs Kriventseva, E. V.; Rahman, N.; Espinosa, O. Nucleic Acids Research, Vol. 36, Issue Database https://doi.org/10.1093/nar/gkm845	journal	December 2007
Accurate prediction of orthologs in the presence of divergence after duplication Lafond, Manuel; Meghdari Miardan, Mona; Sankoff, David Bioinformatics, Vol. 34, Issue 13 https://doi.org/10.1093/bioinformatics/bty242	journal	June 2018
OrthoMCL: Identification of Ortholog Groups for Eukaryotic Genomes Li, L. Genome Research, Vol. 13, Issue 9 https://doi.org/10.1101/gr.1224503	journal	September 2003
Disentangling the impact of environmental and phylogenetic constraints on prokaryotic within-species diversity Maistrenko, Oleksandr M.; Mende, Daniel R.; Luetge, Mechthild The ISME Journal, Vol. 14, Issue 5 https://doi.org/10.1038/s41396-020-0600-z	journal	February 2020
The microbial pan-genome Medini, Duccio; Donati, Claudio; Tettelin, Hervé Current Opinion in Genetics & Development, Vol. 15, Issue 6 https://doi.org/10.1016/j.gde.2005.09.006	journal	December 2005
Ultra-fast sequence clustering from similarity networks with SiLiX Miele, Vincent; Penel, Simon; Duret, Laurent BMC Bioinformatics, Vol. 12, Issue 1 https://doi.org/10.1186/1471-2105-12-116	journal	April 2011
Roary: rapid large-scale prokaryote pan genome analysis Page, Andrew J.; Cummins, Carla A.; Hunt, Martin Bioinformatics, Vol. 31, Issue 22 https://doi.org/10.1093/bioinformatics/btv421	journal	July 2015
A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life Parks, Donovan H.; Chuvochina, Maria; Waite, David W. Nature Biotechnology, Vol. 36, Issue 10 https://doi.org/10.1038/nbt.4229	journal	August 2018
Automatic clustering of orthologs and in-paralogs from pairwise species comparisons Remm, Maido; Storm, Christian E. V.; Sonnhammer, Erik L. L. Journal of Molecular Biology, Vol. 314, Issue 5 https://doi.org/10.1006/jmbi.2000.5197	journal	December 2001
Algorithm of OMA for large-scale orthology inference Roth, Alexander CJ; Gonnet, Gaston H.; Dessimoz, Christophe BMC Bioinformatics, Vol. 9, Issue 1 https://doi.org/10.1186/1471-2105-9-518	journal	December 2008
Big data and other challenges in the quest for orthologs Sonnhammer, E. L. L.; Gabaldon, T.; Sousa da Silva, A. W. Bioinformatics, Vol. 30, Issue 21 https://doi.org/10.1093/bioinformatics/btu492	journal	July 2014
A Genomic Perspective on Protein Families Tatusov, R. L. Science, Vol. 278, Issue 5338 https://doi.org/10.1126/science.278.5338.631	journal	October 1997
Horizontal Transfer, Not Duplication, Drives the Expansion of Protein Families in Prokaryotes Treangen, Todd J.; Rocha, Eduardo P. C. PLoS Genetics, Vol. 7, Issue 1 https://doi.org/10.1371/journal.pgen.1001284	journal	January 2011
Ten years of pan-genome analyses Vernikos, George; Medini, Duccio; Riley, David R. Current Opinion in Microbiology, Vol. 23 https://doi.org/10.1016/j.mib.2014.11.016	journal	February 2015

Similar Records

ATGC: a database of orthologous genes from closely related prokaryotic genomes and a research platform for microevolution of prokaryotes

Journal Article · Thu Jul 23 00:00:00 EDT 2009 · Nucleic Acids Res. · OSTI ID:1816371

Novichkov, Pavel S; Ratnere, Igor; Wolf, Yuri I; +2 more

SPOCS: Software for Predicting and Visualizing Orthology/Paralogy Relationships Among Genomes

Journal Article · Tue Oct 15 00:00:00 EDT 2013 · Bioinformatics, 29(20):2641-2642 · OSTI ID:1816371

Curtis, Darren S.; Phillips, Aaron R.; Callister, Stephen J.; +2 more

RegPredict: an integrated system for regulon inference in prokaryotes by comparative genomics approach

Journal Article · Wed May 26 00:00:00 EDT 2010 · Nucleic Acids Research · OSTI ID:1816371

Novichkov, Pavel S; Rodionov, Dmitry A; Stavrovskaya, Elena D; +6 more

Related Subjects

97 MATHEMATICS AND COMPUTING
59 BASIC BIOLOGICAL SCIENCES
core genome
prokaryotes
orthology

Title: CoreCruncher: Fast and Robust Construction of Core Genomes in Large Prokaryotic Data Sets

Citation Formats

References (28)

Similar Records

Related Subjects