Utilizing Amino Acid Composition and Entropy of Potential Open Reading Frames to Identify Protein-Coding Genes

McNair, Katelyn; Ecale Zhou, Carol L.; Souza, Brian; Malfatti, Stephanie; Edwards, Robert A.

doi:10.3390/microorganisms9010129

Utilizing Amino Acid Composition and Entropy of Potential Open Reading Frames to Identify Protein-Coding Genes

Journal Article · Thu Jan 07 23:00:00 EST 2021 · Microorganisms

DOI:https://doi.org/10.3390/microorganisms9010129· OSTI ID:1815658

^[1]; Ecale Zhou, Carol L. ^[2]; Souza, Brian ^[3]; Malfatti, Stephanie ^[3]; ^[4]

San Diego State Univ., CA (United States). Computational Sciences Research Center; OSTI
Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States). Global Security Computing Applications
Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States). Biological Sciences Research Division
San Diego State Univ., CA (United States). Computational Sciences Research Center; Flinders Univ., Adelaide, SA (Australia). College of Science and Engineering

One of the main steps in gene-finding in prokaryotes is determining which open reading frames encode for a protein, and which occur by chance alone. There are many different methods to differentiate the two; the most prevalent approach is using shared homology with a database of known genes. This method presents many pitfalls, most notably the catch that you only find genes that you have seen before. The four most popular prokaryotic gene-prediction programs (GeneMark, Glimmer, Prodigal, Phanotate) all use a protein-coding training model to predict protein-coding genes, with the latter three allowing for the training model to be created ab initio from the input genome. Different methods are available for creating the training model, and to increase the accuracy of such tools, we present here GOODORFS, a method for identifying protein-coding genes within a set of all possible open reading frames (ORFS). Our workflow begins with taking the amino acid frequencies of each ORF, calculating an entropy density profile (EDP), using KMeans to cluster the EDPs, and then selecting the cluster with the lowest variation as the coding ORFs. To test the efficacy of our method, we ran GOODORFS on 14,179 annotated phage genomes, and compared our results to the initial training-set creation step of four other similar methods (Glimmer, MED2, PHANOTATE, Prodigal). We found that GOODORFS was the most accurate (0.94) and had the best F1-score (0.85), while Glimmer had the highest precision (0.92) and PHANOTATE had the highest recall (0.96).

Research Organization:: Lawrence Livermore National Laboratory (LLNL), Livermore, CA (United States)

Sponsoring Organization:: USDOE Office of Science (SC)

Grant/Contract Number:: AC52-07NA27344

OSTI ID:: 1815658

Alternate ID(s):: OSTI ID: 2008164

Journal Information:: Microorganisms, Journal Name: Microorganisms Journal Issue: 1 Vol. 9; ISSN 2076-2607; ISSN MICRKN

Publisher:: MDPICopyright Statement

Country of Publication:: United States

Language:: English

References (12)

Goodorfs data McNair, Katelyn figshare https://doi.org/10.6084/m9.figshare.13542962.v1	dataset	January 2021
GENMARK: Parallel gene recognition for both DNA strands Borodovsky, Mark; McIninch, James Computers & Chemistry, Vol. 17, Issue 2 https://doi.org/10.1016/0097-8485(93)85004-V	journal	June 1993
Complete nucleotide sequence of bacteriophage MS2 RNA: primary and secondary structure of the replicase gene Fiers, W.; Contreras, R.; Duerinck, F. Nature, Vol. 260, Issue 5551 https://doi.org/10.1038/260500a0	journal	April 1976
Array programming with NumPy Harris, Charles R.; Millman, K. Jarrod; van der Walt, Stéfan J. Nature, Vol. 585, Issue 7825 https://doi.org/10.1038/s41586-020-2649-2	journal	September 2020
PHANOTATE: a novel approach to gene identification in phage genomes McNair, Katelyn; Zhou, Carol; Dinsdale, Elizabeth A. Bioinformatics, Vol. 35, Issue 22 https://doi.org/10.1093/bioinformatics/btz265	journal	April 2019
Microbial gene identification using interpolated Markov models Salzberg, S. L.; Delcher, A. L.; Kasif, S. Nucleic Acids Research, Vol. 26, Issue 2 https://doi.org/10.1093/nar/26.2.544	journal	January 1998
CRITICA: coding region identification tool invoking comparative analysis Badger, J. H.; Olsen, G. J. Molecular Biology and Evolution, Vol. 16, Issue 4 https://doi.org/10.1093/oxfordjournals.molbev.a026133	journal	April 1999
Analyses of four new Caulobacter Phicbkviruses indicate independent lineages Wilson, Kiesha; Ely, Bert Journal of General Virology, Vol. 100, Issue 2 https://doi.org/10.1099/jgv.0.001218	journal	February 2019
Matplotlib: A 2D Graphics Environment Hunter, John D. Computing in Science & Engineering, Vol. 9, Issue 3 https://doi.org/10.1109/MCSE.2007.55	journal	January 2007
Whole-genome random sequencing and assembly of Haemophilus influenzae Rd Fleischmann, R.; Adams, M.; White, O. Science, Vol. 269, Issue 5223 https://doi.org/10.1126/science.7542800	journal	July 1995
Prodigal: prokaryotic gene recognition and translation initiation site identification Hyatt, Doug; Chen, Gwo-Liang; LoCascio, Philip F. BMC Bioinformatics, Vol. 11, Issue 1 https://doi.org/10.1186/1471-2105-11-119	journal	March 2010
MED: a new non-supervised gene prediction algorithm for bacterial and archaeal genomes Zhu, Huaiqiu; Hu, Gang-Qing; Yang, Yi-Fan BMC Bioinformatics, Vol. 8, Issue 1 https://doi.org/10.1186/1471-2105-8-97	journal	March 2007

Cited By (1)

MultiPhATE2: code for functional annotation and comparison of phage genomes Ecale Zhou, Carol L.; Kimbrel, Jeffrey; Edwards, Robert G3 Genes\|Genomes\|Genetics, Vol. 11, Issue 5 https://doi.org/10.1093/g3journal/jkab074	journal	March 2021

Similar Records

PHANOTATE: a novel approach to gene identification in phage genomes

Journal Article · Thu Apr 25 00:00:00 EDT 2019 · Bioinformatics · OSTI ID:1625296

Complete Genome Sequence of the Genetically Tractable Hydrogenotrophic Methanogen Methanococcus maripaludis

Journal Article · Wed Dec 31 23:00:00 EST 2003 · Journal of Bacteriology · OSTI ID:978722

Related Subjects

59 BASIC BIOLOGICAL SCIENCES
annotation
clustering
gene
genome
machine learning
phage
prediction

Utilizing Amino Acid Composition and Entropy of Potential Open Reading Frames to Identify Protein-Coding Genes

Citation Formats

References (12)

Cited By (1)

Similar Records

Related Subjects