PATtyFams: Protein families for the microbial genomes in the PATRIC database
Abstract
The ability to build accurate protein families is a fundamental operation in bioinformatics that influences comparative analyses, genome annotation, and metabolic modeling. For several years we have been maintaining protein families for all microbial genomes in the PATRIC database (Pathosystems Resource Integration Center, patricbrc.org) in order to drive many of the comparative analysis tools that are available through the PATRIC website. However, due to the burgeoning number of genomes, traditional approaches for generating protein families are becoming prohibitive. In this report, we describe a new approach for generating protein families, which we call PATtyFams. This method uses the k-mer-based function assignments available through RAST (Rapid Annotation using Subsystem Technology) to rapidly guide family formation, and then differentiates the function-based groups into families using a Markov Cluster algorithm (MCL). In conclusion, this new approach for generating protein families is rapid, scalable and has properties that are consistent with alignment-based methods.
- Authors:
- Univ. of Chicago, IL (United States); Argonne National Lab. (ANL), Argonne, IL (United States)
- Argonne National Lab. (ANL), Argonne, IL (United States); Fellowship for Interpretation of Genomes, Burr Ridge, IL (United States)
- Univ. of Illinois at Urbana-Champaign, Urbana, IL (United States)
- Virginia Bioinformatics Institute, Virginia Tech University, Blacksburg,VA (United States)
- Publication Date:
- Research Org.:
- Argonne National Lab. (ANL), Argonne, IL (United States)
- Sponsoring Org.:
- USDOE
- OSTI Identifier:
- 1248167
- Grant/Contract Number:
- AC02-06CH11357; NNA13AA91A
- Resource Type:
- Accepted Manuscript
- Journal Name:
- Frontiers in Microbiology
- Additional Journal Information:
- Journal Volume: 7; Journal ID: ISSN 1664-302X
- Publisher:
- Frontiers Research Foundation
- Country of Publication:
- United States
- Language:
- English
- Subject:
- 59 BASIC BIOLOGICAL SCIENCES; genome annotation; comparative genomics; metabolic modeling; FIGfams; RAST
Citation Formats
Davis, James J., Gerdes, Svetlana, Olsen, Gary J., Olson, Robert, Pusch, Gordon D., Shukla, Maulik, Vonstein, Veronika, Wattam, Alice R., and Yoo, Hyunseung. PATtyFams: Protein families for the microbial genomes in the PATRIC database. United States: N. p., 2016.
Web. doi:10.3389/fmicb.2016.00118.
Davis, James J., Gerdes, Svetlana, Olsen, Gary J., Olson, Robert, Pusch, Gordon D., Shukla, Maulik, Vonstein, Veronika, Wattam, Alice R., & Yoo, Hyunseung. PATtyFams: Protein families for the microbial genomes in the PATRIC database. United States. doi:10.3389/fmicb.2016.00118.
Davis, James J., Gerdes, Svetlana, Olsen, Gary J., Olson, Robert, Pusch, Gordon D., Shukla, Maulik, Vonstein, Veronika, Wattam, Alice R., and Yoo, Hyunseung. Mon .
"PATtyFams: Protein families for the microbial genomes in the PATRIC database". United States. doi:10.3389/fmicb.2016.00118. https://www.osti.gov/servlets/purl/1248167.
@article{osti_1248167,
title = {PATtyFams: Protein families for the microbial genomes in the PATRIC database},
author = {Davis, James J. and Gerdes, Svetlana and Olsen, Gary J. and Olson, Robert and Pusch, Gordon D. and Shukla, Maulik and Vonstein, Veronika and Wattam, Alice R. and Yoo, Hyunseung},
abstractNote = {The ability to build accurate protein families is a fundamental operation in bioinformatics that influences comparative analyses, genome annotation, and metabolic modeling. For several years we have been maintaining protein families for all microbial genomes in the PATRIC database (Pathosystems Resource Integration Center, patricbrc.org) in order to drive many of the comparative analysis tools that are available through the PATRIC website. However, due to the burgeoning number of genomes, traditional approaches for generating protein families are becoming prohibitive. In this report, we describe a new approach for generating protein families, which we call PATtyFams. This method uses the k-mer-based function assignments available through RAST (Rapid Annotation using Subsystem Technology) to rapidly guide family formation, and then differentiates the function-based groups into families using a Markov Cluster algorithm (MCL). In conclusion, this new approach for generating protein families is rapid, scalable and has properties that are consistent with alignment-based methods.},
doi = {10.3389/fmicb.2016.00118},
journal = {Frontiers in Microbiology},
number = ,
volume = 7,
place = {United States},
year = {2016},
month = {2}
}
Web of Science
Works referenced in this record:
The RAST Server: Rapid Annotations using Subsystems Technology
journal, January 2008
- Aziz, Ramy K.; Bartels, Daniela; Best, Aaron A.
- BMC Genomics, Vol. 9, Issue 1, Article No. 75
Likelihood-Based Gene Annotations for Gap Filling and Quality Assessment in Genome-Scale Metabolic Models
journal, October 2014
- Benedict, Matthew N.; Mundy, Michael B.; Henry, Christopher S.
- PLoS Computational Biology, Vol. 10, Issue 10
GenBank
journal, November 2012
- Benson, Dennis A.; Cavanaugh, Mark; Clark, Karen
- Nucleic Acids Research, Vol. 41, Issue D1
Complete genome sequence of the model actinomycete Streptomyces coelicolor A3(2)
journal, May 2002
- Bentley, S. D.; Chater, K. F.; Cerdeño-Tárraga, A. -M.
- Nature, Vol. 417, Issue 6885
RASTtk: A modular and extensible implementation of the RAST algorithm for building custom annotation pipelines and annotating batches of genomes
journal, February 2015
- Brettin, Thomas; Davis, James J.; Disz, Terry
- Scientific Reports, Vol. 5, Issue 1
BLAST+: architecture and applications
journal, January 2009
- Camacho, Christiam; Coulouris, George; Avagyan, Vahram
- BMC Bioinformatics, Vol. 10, Issue 1
Toward Automatic Reconstruction of a Highly Resolved Tree of Life
journal, March 2006
- Ciccarelli, F. D.
- Science, Vol. 311, Issue 5765
In search of genome annotation consistency: solid gene clusters and how to use them
journal, July 2013
- Davis, James J.; Olsen, Gary J.; Overbeek, Ross
- 3 Biotech, Vol. 4, Issue 3
Measures of the Amount of Ecologic Association Between Species
journal, July 1945
- Dice, Lee R.
- Ecology, Vol. 26, Issue 3
Search and clustering orders of magnitude faster than BLAST
journal, August 2010
- Edgar, Robert C.
- Bioinformatics, Vol. 26, Issue 19, p. 2460-2461
Real Time Metagenomics: Using k-mers to annotate metagenomes
journal, October 2012
- Edwards, Robert A.; Olson, Robert; Disz, Terry
- Bioinformatics, Vol. 28, Issue 24
An efficient algorithm for large-scale detection of protein families
journal, April 2002
- Enright, A. J.
- Nucleic Acids Research, Vol. 30, Issue 7
Enabling comparative modeling of closely related genomes: example genus Brucella
journal, March 2014
- Faria, José P.; Edirisinghe, Janaka N.; Davis, James J.
- 3 Biotech, Vol. 5, Issue 1
Expanded microbial genome coverage and improved protein family annotation in the COG database
journal, November 2014
- Galperin, Michael Y.; Makarova, Kira S.; Wolf, Yuri I.
- Nucleic Acids Research, Vol. 43, Issue D1
TIGRFAMs and Genome Properties in 2013
journal, November 2012
- Haft, Daniel H.; Selengut, Jeremy D.; Richter, Roland A.
- Nucleic Acids Research, Vol. 41, Issue D1
Calculating Orthologs in Bacteria and Archaea: A Divide and Conquer Approach
journal, December 2011
- Halachev, Mihail R.; Loman, Nicholas J.; Pallen, Mark J.
- PLoS ONE, Vol. 6, Issue 12
kClust: fast and sensitive clustering of large protein sequence databases
journal, January 2013
- Hauser, Maria; Mayer, Christian E.; Söding, Johannes
- BMC Bioinformatics, Vol. 14, Issue 1
High-throughput generation, optimization and analysis of genome-scale metabolic models
journal, August 2010
- Henry, Christopher S.; DeJongh, Matthew; Best, Aaron A.
- Nature Biotechnology, Vol. 28, Issue 9
Selection of representative protein data sets
journal, March 1992
- Hobohm, Uwe; Scharf, Michael; Schneider, Reinhard
- Protein Science, Vol. 1, Issue 3
Genome Sequences of the Primary Endosymbiont “Candidatus Portiera aleyrodidarum” in the Whitefly Bemisia tabaci B and Q Biotypes
journal, November 2012
- Jiang, Zi-Feng; Xia, Fangfang; Johnson, Kipp W.
- Journal of Bacteriology, Vol. 194, Issue 23
OrthoMCL: Identification of Ortholog Groups for Eukaryotic Genomes
journal, September 2003
- Li, L.
- Genome Research, Vol. 13, Issue 9
Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences
journal, May 2006
- Li, W.; Godzik, A.
- Bioinformatics, Vol. 22, Issue 13
Efficient large-scale protein sequence comparison and gene matching to identify orthologs and co-orthologs
journal, December 2011
- Mahmood, Khalid; Webb, Geoffrey I.; Song, Jiangning
- Nucleic Acids Research, Vol. 40, Issue 6
CDD: NCBI's conserved domain database
journal, November 2014
- Marchler-Bauer, Aron; Derbyshire, Myra K.; Gonzales, Noreen R.
- Nucleic Acids Research, Vol. 43, Issue D1
FIGfams: yet another set of protein families
journal, September 2009
- Meyer, Folker; Overbeek, Ross; Rodriguez, Alex
- Nucleic Acids Research, Vol. 37, Issue 20
PANTHER in 2013: modeling the evolution of gene function, and other gene attributes, in the context of phylogenetic trees
journal, November 2012
- Mi, Huaiyu; Muruganujan, Anushya; Thomas, Paul D.
- Nucleic Acids Research, Vol. 41, Issue D1
The Subsystems Approach to Genome Annotation and its Use in the Project to Annotate 1000 Genomes
journal, September 2005
- Overbeek, R.
- Nucleic Acids Research, Vol. 33, Issue 17
The SEED and the Rapid Annotation of microbial genomes using Subsystems Technology (RAST)
journal, November 2013
- Overbeek, Ross; Olson, Robert; Pusch, Gordon D.
- Nucleic Acids Research, Vol. 42, Issue D1
Databases of homologous gene families for comparative genomics
journal, January 2009
- Penel, Simon; Arigon, Anne-Muriel; Dufayard, Jean-François
- BMC Bioinformatics, Vol. 10, Issue Suppl 6
Genome sequence of enterohaemorrhagic Escherichia coli O157:H7
journal, January 2001
- Perna, Nicole T.; Plunkett, Guy; Burland, Valerie
- Nature, Vol. 409, Issue 6819
FastTree 2 – Approximately Maximum-Likelihood Trees for Large Alignments
journal, March 2010
- Price, Morgan N.; Dehal, Paramvir S.; Arkin, Adam P.
- PLoS ONE, Vol. 5, Issue 3
The Pfam protein families database
journal, November 2011
- Punta, M.; Coggill, P. C.; Eberhardt, R. Y.
- Nucleic Acids Research, Vol. 40, Issue D1
Automatic clustering of orthologs and in-paralogs from pairwise species comparisons
journal, December 2001
- Remm, Maido; Storm, Christian E. V.; Sonnhammer, Erik L. L.
- Journal of Molecular Biology, Vol. 314, Issue 5
Twilight zone of protein sequence alignments
journal, February 1999
- Rost, Burkhard
- Protein Engineering, Design and Selection, Vol. 12, Issue 2
Database resources of the National Center for Biotechnology Information
journal, May 2009
- Sayers, E. W.; Barrett, T.; Benson, D. A.
- Nucleic Acids Research, Vol. 37, Issue 9
High-throughput comparison, functional annotation, and metabolic modeling of plant genomes using the PlantSEED resource
journal, June 2014
- Seaver, Samuel M. D.; Gerdes, Svetlana; Frelin, Océane
- Proceedings of the National Academy of Sciences, Vol. 111, Issue 26
The history of the genetic sequence databases
journal, April 1990
- Smith, Temple F.
- Genomics, Vol. 6, Issue 4
RefSeq microbial genomes database: new representation and annotation strategy
journal, December 2013
- Tatusova, Tatiana; Ciufo, Stacy; Fedorov, Boris
- Nucleic Acids Research, Vol. 42, Issue D1
Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: Implications for the microbial "pan-genome"
journal, September 2005
- Tettelin, H.; Masignani, V.; Cieslewicz, M. J.
- Proceedings of the National Academy of Sciences, Vol. 102, Issue 39
PATRIC, the bacterial bioinformatics database and analysis resource
journal, November 2013
- Wattam, Alice R.; Abraham, David; Dalay, Oral
- Nucleic Acids Research, Vol. 42, Issue D1
Comparative Phylogenomics and Evolution of the Brucellae Reveal a Path to Virulence
journal, December 2013
- Wattam, A. R.; Foster, J. T.; Mane, S. P.
- Journal of Bacteriology, Vol. 196, Issue 5
Works referencing / citing this record:
Trait-based analysis of the human skin microbiome
journal, July 2019
- Bewick, Sharon; Gurarie, Eliezer; Weissman, Jake L.
- Microbiome, Vol. 7, Issue 1