skip to main content
DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: PATtyFams: Protein families for the microbial genomes in the PATRIC database

Abstract

The ability to build accurate protein families is a fundamental operation in bioinformatics that influences comparative analyses, genome annotation, and metabolic modeling. For several years we have been maintaining protein families for all microbial genomes in the PATRIC database (Pathosystems Resource Integration Center, patricbrc.org) in order to drive many of the comparative analysis tools that are available through the PATRIC website. However, due to the burgeoning number of genomes, traditional approaches for generating protein families are becoming prohibitive. In this report, we describe a new approach for generating protein families, which we call PATtyFams. This method uses the k-mer-based function assignments available through RAST (Rapid Annotation using Subsystem Technology) to rapidly guide family formation, and then differentiates the function-based groups into families using a Markov Cluster algorithm (MCL). In conclusion, this new approach for generating protein families is rapid, scalable and has properties that are consistent with alignment-based methods.

Authors:
 [1];  [2];  [3];  [1];  [2];  [1];  [2];  [4];  [1]
  1. Univ. of Chicago, IL (United States); Argonne National Lab. (ANL), Argonne, IL (United States)
  2. Argonne National Lab. (ANL), Argonne, IL (United States); Fellowship for Interpretation of Genomes, Burr Ridge, IL (United States)
  3. Univ. of Illinois at Urbana-Champaign, Urbana, IL (United States)
  4. Virginia Bioinformatics Institute, Virginia Tech University, Blacksburg,VA (United States)
Publication Date:
Research Org.:
Argonne National Lab. (ANL), Argonne, IL (United States)
Sponsoring Org.:
USDOE
OSTI Identifier:
1248167
Grant/Contract Number:  
AC02-06CH11357; NNA13AA91A
Resource Type:
Accepted Manuscript
Journal Name:
Frontiers in Microbiology
Additional Journal Information:
Journal Volume: 7; Journal ID: ISSN 1664-302X
Publisher:
Frontiers Research Foundation
Country of Publication:
United States
Language:
English
Subject:
59 BASIC BIOLOGICAL SCIENCES; genome annotation; comparative genomics; metabolic modeling; FIGfams; RAST

Citation Formats

Davis, James J., Gerdes, Svetlana, Olsen, Gary J., Olson, Robert, Pusch, Gordon D., Shukla, Maulik, Vonstein, Veronika, Wattam, Alice R., and Yoo, Hyunseung. PATtyFams: Protein families for the microbial genomes in the PATRIC database. United States: N. p., 2016. Web. doi:10.3389/fmicb.2016.00118.
Davis, James J., Gerdes, Svetlana, Olsen, Gary J., Olson, Robert, Pusch, Gordon D., Shukla, Maulik, Vonstein, Veronika, Wattam, Alice R., & Yoo, Hyunseung. PATtyFams: Protein families for the microbial genomes in the PATRIC database. United States. doi:10.3389/fmicb.2016.00118.
Davis, James J., Gerdes, Svetlana, Olsen, Gary J., Olson, Robert, Pusch, Gordon D., Shukla, Maulik, Vonstein, Veronika, Wattam, Alice R., and Yoo, Hyunseung. Mon . "PATtyFams: Protein families for the microbial genomes in the PATRIC database". United States. doi:10.3389/fmicb.2016.00118. https://www.osti.gov/servlets/purl/1248167.
@article{osti_1248167,
title = {PATtyFams: Protein families for the microbial genomes in the PATRIC database},
author = {Davis, James J. and Gerdes, Svetlana and Olsen, Gary J. and Olson, Robert and Pusch, Gordon D. and Shukla, Maulik and Vonstein, Veronika and Wattam, Alice R. and Yoo, Hyunseung},
abstractNote = {The ability to build accurate protein families is a fundamental operation in bioinformatics that influences comparative analyses, genome annotation, and metabolic modeling. For several years we have been maintaining protein families for all microbial genomes in the PATRIC database (Pathosystems Resource Integration Center, patricbrc.org) in order to drive many of the comparative analysis tools that are available through the PATRIC website. However, due to the burgeoning number of genomes, traditional approaches for generating protein families are becoming prohibitive. In this report, we describe a new approach for generating protein families, which we call PATtyFams. This method uses the k-mer-based function assignments available through RAST (Rapid Annotation using Subsystem Technology) to rapidly guide family formation, and then differentiates the function-based groups into families using a Markov Cluster algorithm (MCL). In conclusion, this new approach for generating protein families is rapid, scalable and has properties that are consistent with alignment-based methods.},
doi = {10.3389/fmicb.2016.00118},
journal = {Frontiers in Microbiology},
number = ,
volume = 7,
place = {United States},
year = {2016},
month = {2}
}

Journal Article:
Free Publicly Available Full Text
Publisher's Version of Record

Citation Metrics:
Cited by: 2 works
Citation information provided by
Web of Science

Save / Share:

Works referenced in this record:

The RAST Server: Rapid Annotations using Subsystems Technology
journal, January 2008

  • Aziz, Ramy K.; Bartels, Daniela; Best, Aaron A.
  • BMC Genomics, Vol. 9, Issue 1, Article No. 75
  • DOI: 10.1186/1471-2164-9-75

Likelihood-Based Gene Annotations for Gap Filling and Quality Assessment in Genome-Scale Metabolic Models
journal, October 2014

  • Benedict, Matthew N.; Mundy, Michael B.; Henry, Christopher S.
  • PLoS Computational Biology, Vol. 10, Issue 10
  • DOI: 10.1371/journal.pcbi.1003882

GenBank
journal, November 2012

  • Benson, Dennis A.; Cavanaugh, Mark; Clark, Karen
  • Nucleic Acids Research, Vol. 41, Issue D1
  • DOI: 10.1093/nar/gks1195

Complete genome sequence of the model actinomycete Streptomyces coelicolor A3(2)
journal, May 2002

  • Bentley, S. D.; Chater, K. F.; Cerdeño-Tárraga, A. -M.
  • Nature, Vol. 417, Issue 6885
  • DOI: 10.1038/417141a

RASTtk: A modular and extensible implementation of the RAST algorithm for building custom annotation pipelines and annotating batches of genomes
journal, February 2015

  • Brettin, Thomas; Davis, James J.; Disz, Terry
  • Scientific Reports, Vol. 5, Issue 1
  • DOI: 10.1038/srep08365

BLAST+: architecture and applications
journal, January 2009

  • Camacho, Christiam; Coulouris, George; Avagyan, Vahram
  • BMC Bioinformatics, Vol. 10, Issue 1
  • DOI: 10.1186/1471-2105-10-421

Toward Automatic Reconstruction of a Highly Resolved Tree of Life
journal, March 2006


In search of genome annotation consistency: solid gene clusters and how to use them
journal, July 2013


Measures of the Amount of Ecologic Association Between Species
journal, July 1945


Search and clustering orders of magnitude faster than BLAST
journal, August 2010


Real Time Metagenomics: Using k-mers to annotate metagenomes
journal, October 2012


An efficient algorithm for large-scale detection of protein families
journal, April 2002


Enabling comparative modeling of closely related genomes: example genus Brucella
journal, March 2014


Expanded microbial genome coverage and improved protein family annotation in the COG database
journal, November 2014

  • Galperin, Michael Y.; Makarova, Kira S.; Wolf, Yuri I.
  • Nucleic Acids Research, Vol. 43, Issue D1
  • DOI: 10.1093/nar/gku1223

TIGRFAMs and Genome Properties in 2013
journal, November 2012

  • Haft, Daniel H.; Selengut, Jeremy D.; Richter, Roland A.
  • Nucleic Acids Research, Vol. 41, Issue D1
  • DOI: 10.1093/nar/gks1234

Calculating Orthologs in Bacteria and Archaea: A Divide and Conquer Approach
journal, December 2011


kClust: fast and sensitive clustering of large protein sequence databases
journal, January 2013

  • Hauser, Maria; Mayer, Christian E.; Söding, Johannes
  • BMC Bioinformatics, Vol. 14, Issue 1
  • DOI: 10.1186/1471-2105-14-248

High-throughput generation, optimization and analysis of genome-scale metabolic models
journal, August 2010

  • Henry, Christopher S.; DeJongh, Matthew; Best, Aaron A.
  • Nature Biotechnology, Vol. 28, Issue 9
  • DOI: 10.1038/nbt.1672

Selection of representative protein data sets
journal, March 1992

  • Hobohm, Uwe; Scharf, Michael; Schneider, Reinhard
  • Protein Science, Vol. 1, Issue 3
  • DOI: 10.1002/pro.5560010313

Genome Sequences of the Primary Endosymbiont “Candidatus Portiera aleyrodidarum” in the Whitefly Bemisia tabaci B and Q Biotypes
journal, November 2012

  • Jiang, Zi-Feng; Xia, Fangfang; Johnson, Kipp W.
  • Journal of Bacteriology, Vol. 194, Issue 23
  • DOI: 10.1128/JB.01841-12

OrthoMCL: Identification of Ortholog Groups for Eukaryotic Genomes
journal, September 2003


Efficient large-scale protein sequence comparison and gene matching to identify orthologs and co-orthologs
journal, December 2011

  • Mahmood, Khalid; Webb, Geoffrey I.; Song, Jiangning
  • Nucleic Acids Research, Vol. 40, Issue 6
  • DOI: 10.1093/nar/gkr1261

CDD: NCBI's conserved domain database
journal, November 2014

  • Marchler-Bauer, Aron; Derbyshire, Myra K.; Gonzales, Noreen R.
  • Nucleic Acids Research, Vol. 43, Issue D1
  • DOI: 10.1093/nar/gku1221

FIGfams: yet another set of protein families
journal, September 2009

  • Meyer, Folker; Overbeek, Ross; Rodriguez, Alex
  • Nucleic Acids Research, Vol. 37, Issue 20
  • DOI: 10.1093/nar/gkp698

PANTHER in 2013: modeling the evolution of gene function, and other gene attributes, in the context of phylogenetic trees
journal, November 2012

  • Mi, Huaiyu; Muruganujan, Anushya; Thomas, Paul D.
  • Nucleic Acids Research, Vol. 41, Issue D1
  • DOI: 10.1093/nar/gks1118

The Subsystems Approach to Genome Annotation and its Use in the Project to Annotate 1000 Genomes
journal, September 2005


The SEED and the Rapid Annotation of microbial genomes using Subsystems Technology (RAST)
journal, November 2013

  • Overbeek, Ross; Olson, Robert; Pusch, Gordon D.
  • Nucleic Acids Research, Vol. 42, Issue D1
  • DOI: 10.1093/nar/gkt1226

Databases of homologous gene families for comparative genomics
journal, January 2009

  • Penel, Simon; Arigon, Anne-Muriel; Dufayard, Jean-François
  • BMC Bioinformatics, Vol. 10, Issue Suppl 6
  • DOI: 10.1186/1471-2105-10-S6-S3

Genome sequence of enterohaemorrhagic Escherichia coli O157:H7
journal, January 2001

  • Perna, Nicole T.; Plunkett, Guy; Burland, Valerie
  • Nature, Vol. 409, Issue 6819
  • DOI: 10.1038/35054089

FastTree 2 – Approximately Maximum-Likelihood Trees for Large Alignments
journal, March 2010


The Pfam protein families database
journal, November 2011

  • Punta, M.; Coggill, P. C.; Eberhardt, R. Y.
  • Nucleic Acids Research, Vol. 40, Issue D1
  • DOI: 10.1093/nar/gkr1065

Automatic clustering of orthologs and in-paralogs from pairwise species comparisons
journal, December 2001

  • Remm, Maido; Storm, Christian E. V.; Sonnhammer, Erik L. L.
  • Journal of Molecular Biology, Vol. 314, Issue 5
  • DOI: 10.1006/jmbi.2000.5197

Twilight zone of protein sequence alignments
journal, February 1999


Database resources of the National Center for Biotechnology Information
journal, May 2009

  • Sayers, E. W.; Barrett, T.; Benson, D. A.
  • Nucleic Acids Research, Vol. 37, Issue 9
  • DOI: 10.1093/nar/gkp382

High-throughput comparison, functional annotation, and metabolic modeling of plant genomes using the PlantSEED resource
journal, June 2014

  • Seaver, Samuel M. D.; Gerdes, Svetlana; Frelin, Océane
  • Proceedings of the National Academy of Sciences, Vol. 111, Issue 26
  • DOI: 10.1073/pnas.1401329111

The history of the genetic sequence databases
journal, April 1990


RefSeq microbial genomes database: new representation and annotation strategy
journal, December 2013

  • Tatusova, Tatiana; Ciufo, Stacy; Fedorov, Boris
  • Nucleic Acids Research, Vol. 42, Issue D1
  • DOI: 10.1093/nar/gkt1274

Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: Implications for the microbial "pan-genome"
journal, September 2005

  • Tettelin, H.; Masignani, V.; Cieslewicz, M. J.
  • Proceedings of the National Academy of Sciences, Vol. 102, Issue 39
  • DOI: 10.1073/pnas.0506758102

PATRIC, the bacterial bioinformatics database and analysis resource
journal, November 2013

  • Wattam, Alice R.; Abraham, David; Dalay, Oral
  • Nucleic Acids Research, Vol. 42, Issue D1
  • DOI: 10.1093/nar/gkt1099

Comparative Phylogenomics and Evolution of the Brucellae Reveal a Path to Virulence
journal, December 2013

  • Wattam, A. R.; Foster, J. T.; Mane, S. P.
  • Journal of Bacteriology, Vol. 196, Issue 5
  • DOI: 10.1128/JB.01091-13

    Works referencing / citing this record:

    Trait-based analysis of the human skin microbiome
    journal, July 2019