Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Utilizing Amino Acid Composition and Entropy of Potential Open Reading Frames to Identify Protein-Coding Genes

Journal Article · · Microorganisms
 [1];  [2];  [3];  [3];  [4]
  1. San Diego State Univ., CA (United States). Computational Sciences Research Center; OSTI
  2. Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States). Global Security Computing Applications
  3. Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States). Biological Sciences Research Division
  4. San Diego State Univ., CA (United States). Computational Sciences Research Center; Flinders Univ., Adelaide, SA (Australia). College of Science and Engineering

One of the main steps in gene-finding in prokaryotes is determining which open reading frames encode for a protein, and which occur by chance alone. There are many different methods to differentiate the two; the most prevalent approach is using shared homology with a database of known genes. This method presents many pitfalls, most notably the catch that you only find genes that you have seen before. The four most popular prokaryotic gene-prediction programs (GeneMark, Glimmer, Prodigal, Phanotate) all use a protein-coding training model to predict protein-coding genes, with the latter three allowing for the training model to be created ab initio from the input genome. Different methods are available for creating the training model, and to increase the accuracy of such tools, we present here GOODORFS, a method for identifying protein-coding genes within a set of all possible open reading frames (ORFS). Our workflow begins with taking the amino acid frequencies of each ORF, calculating an entropy density profile (EDP), using KMeans to cluster the EDPs, and then selecting the cluster with the lowest variation as the coding ORFs. To test the efficacy of our method, we ran GOODORFS on 14,179 annotated phage genomes, and compared our results to the initial training-set creation step of four other similar methods (Glimmer, MED2, PHANOTATE, Prodigal). We found that GOODORFS was the most accurate (0.94) and had the best F1-score (0.85), while Glimmer had the highest precision (0.92) and PHANOTATE had the highest recall (0.96).

Research Organization:
Lawrence Livermore National Laboratory (LLNL), Livermore, CA (United States)
Sponsoring Organization:
USDOE Office of Science (SC)
Grant/Contract Number:
AC52-07NA27344
OSTI ID:
1815658
Alternate ID(s):
OSTI ID: 2008164
Journal Information:
Microorganisms, Journal Name: Microorganisms Journal Issue: 1 Vol. 9; ISSN 2076-2607; ISSN MICRKN
Publisher:
MDPICopyright Statement
Country of Publication:
United States
Language:
English

References (12)

Goodorfs data dataset January 2021
GENMARK: Parallel gene recognition for both DNA strands journal June 1993
Complete nucleotide sequence of bacteriophage MS2 RNA: primary and secondary structure of the replicase gene journal April 1976
Array programming with NumPy journal September 2020
PHANOTATE: a novel approach to gene identification in phage genomes journal April 2019
Microbial gene identification using interpolated Markov models journal January 1998
CRITICA: coding region identification tool invoking comparative analysis journal April 1999
Analyses of four new Caulobacter Phicbkviruses indicate independent lineages journal February 2019
Matplotlib: A 2D Graphics Environment journal January 2007
Whole-genome random sequencing and assembly of Haemophilus influenzae Rd journal July 1995
Prodigal: prokaryotic gene recognition and translation initiation site identification journal March 2010
MED: a new non-supervised gene prediction algorithm for bacterial and archaeal genomes journal March 2007

Cited By (1)

MultiPhATE2: code for functional annotation and comparison of phage genomes journal March 2021

Similar Records

PHANOTATE: a novel approach to gene identification in phage genomes
Journal Article · Thu Apr 25 00:00:00 EDT 2019 · Bioinformatics · OSTI ID:1625296

Complete Genome Sequence of the Genetically Tractable Hydrogenotrophic Methanogen Methanococcus maripaludis
Journal Article · Wed Dec 31 23:00:00 EST 2003 · Journal of Bacteriology · OSTI ID:978722