Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

FastBLAST: Homology Relationships for Millions of Proteins

Journal Article · · PLoS ONE
 [1];  [2];  [3]
  1. Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States). Physical Biosciences Division; Virtual Inst. for Microbial Stress and Survival, Berkeley, CA (United States); DOE/OSTI
  2. Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States). Physical Biosciences Division; Virtual Inst. for Microbial Stress and Survival, Berkeley, CA (United States)
  3. Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States). Physical Biosciences Division; Virtual Inst. for Microbial Stress and Survival, Berkeley, CA (United States); Univ. of California, Berkeley, CA (United States). Dept. of Bioengineering

Background: All-versus-all BLAST, which searches for homologous pairs of sequences in a database of proteins, is used to identify potential orthologs, to find new protein families, and to provide rapid access to these homology relationships. As DNA sequencing accelerates and data sets grow, all-versus-all BLAST has become computationally demanding. Methodology/Principal Findings: We present FastBLAST, a heuristic replacement for all-versus-all BLAST that relies on alignments of proteins to known families, obtained from tools such as PSI-BLAST and HMMer. FastBLAST avoids most of the work of all-versus-all BLAST by taking advantage of these alignments and by clustering similar sequences. FastBLAST runs in two stages: the first stage identifies additional families and aligns them, and the second stage quickly identifies the homologs of a query sequence, based on the alignments of the families, before generating pairwise alignments. On 6.53 million proteins from the non-redundant Genbank database (‘‘NR’’), FastBLAST identifies new families 25 times faster than all-versus-all BLAST. Once the first stage is completed, FastBLAST identifies homologs for the average query in less than 5 seconds (8.6 times faster than BLAST) and gives nearly identical results. For hits above 70 bits, FastBLAST identifies 98% of the top 3,250 hits per query. Conclusions/Significance: FastBLAST enables research groups that do not have supercomputers to analyze large protein sequence data sets. FastBLAST is open source software and is available at http://microbesonline.org/fastblast.

Research Organization:
Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)
Sponsoring Organization:
USDOE Office of Science (SC), Biological and Environmental Research (BER). Biological Systems Science Division
Grant/Contract Number:
AC02-05CH11231
OSTI ID:
1627360
Journal Information:
PLoS ONE, Journal Name: PLoS ONE Journal Issue: 10 Vol. 3; ISSN 1932-6203
Publisher:
Public Library of ScienceCopyright Statement
Country of Publication:
United States
Language:
English

References (22)

Automatic clustering of orthologs and in-paralogs from pairwise species comparisons journal December 2001
The Closest BLAST Hit Is Often Not the Nearest Neighbor journal June 2001
Identification of protein coding regions by database similarity search journal March 1993
Tolerating some redundancy significantly speeds up clustering of large protein databases journal January 2002
UniRef: comprehensive and non-redundant UniProt reference clusters journal March 2007
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs journal September 1997
The COG database: new developments in phylogenetic classification of proteins from complete genomes journal January 2001
Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements journal July 2001
PIRSF: family classification system at the Protein Information Resource journal January 2004
The CATH Domain Structure Database and related resources Gene3D and DHS provide comprehensive domain family information for genome analysis journal December 2004
The PANTHER database of protein families, subfamilies, functions and pathways journal December 2004
SMART 5: domains in the context of genomes and networks journal January 2006
Pfam: clans, web tools and services journal January 2006
TIGRFAMs and Genome Properties: tools for the assignment of molecular function and biological process in prokaryotic genomes journal January 2007
A phylogenomic gene cluster resource: the Phylogenetically Inferred Groups (PhIGs) database journal April 2006
Orthologous Transcription Factors in Bacteria Have Different Functions and Regulate Different Genes journal September 2007
Biological Sequence Analysis book January 2012
CDD: a curated Entrez database of conserved domain alignments journal January 2003
New developments in the InterPro database journal January 2007
The SUPERFAMILY database in 2007: families and functions journal January 2007
RIO: Analyzing proteomes by automated phylogenomics using resampled inference of orthologs journal May 2002
The Sorcerer II Global Ocean Sampling Expedition: Expanding the Universe of Protein Families journal March 2007

Cited By (8)

Increased diversity of egg-associated bacteria on brown trout (Salmo trutta) at elevated temperatures journal November 2015
Expression profiling of hypothetical genes in Desulfovibrio vulgaris leads to improved functional annotation journal March 2009
metaMicrobesOnline: phylogenomic analysis of microbial communities journal November 2012
An automated graphics tool for comparative genomics: the Coulson plot generator journal April 2013
Telling the whole story in a 10,000-genome world journal January 2011
Metagenomics: Facts and Artifacts, and Computational Challenges journal January 2010
Effects of host genetics and environment on egg-associated microbiotas in brown trout ( Salmo trutta ) journal September 2016
COGNAT: a web server for comparative analysis of genomic neighborhoods journal November 2017

Similar Records

Homologous gene family database compilation
Conference · Thu Dec 30 23:00:00 EST 1993 · OSTI ID:37761

Genesearch: a Gene Homology Search Service (Genesearch) v1.0
Software · Sun Jan 10 19:00:00 EST 2021 · OSTI ID:code-52022