DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Galba: genome annotation with miniprot and AUGUSTUS

Journal Article · · BMC Bioinformatics
 [1];  [2];  [3];  [4];  [5];  [6];  [6];  [6];  [6];  [6]
  1. Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)
  2. Dana-Farber Cancer Institute, Boston, MA (United States); Harvard Medical School, Boston, MA (United States)
  3. University of Otago, Dunedin (New Zealand)
  4. University of Göttingen (Germany)
  5. University of Passau (Germany)
  6. University of Greifswald (Germany)

The Earth Biogenome Project has rapidly increased the number of available eukaryotic genomes, but most released genomes continue to lack annotation of protein-coding genes. In addition, no transcriptome data is available for some genomes. Various gene annotation tools have been developed but each has its limitations. Here, we introduce GALBA, a fully automated pipeline that utilizes miniprot, a rapid protein-to-genome aligner, in combination with AUGUSTUS to predict genes with high accuracy. Accuracy results indicate that GALBA is particularly strong in the annotation of large vertebrate genomes. We also present use cases in insects, vertebrates, and a land plant. GALBA is fully open source and available as a docker image for easy execution with Singularity in high-performance computing environments. Our pipeline addresses the critical need for accurate gene annotation in newly sequenced genomes, and we believe that GALBA will greatly facilitate genome annotation for diverse organisms.

Research Organization:
Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)
Sponsoring Organization:
USDOE; US National Institute of Health (NIH); German Research Foundation
Grant/Contract Number:
AC02-05CH11231
OSTI ID:
2470966
Journal Information:
BMC Bioinformatics, Journal Name: BMC Bioinformatics Journal Issue: 1 Vol. 24; ISSN 1471-2105
Publisher:
BioMed CentralCopyright Statement
Country of Publication:
United States
Language:
English

References (46)

Using Repeat Masker to Identify Repetitive Elements in Genomic Sequences journal March 2004
Genome Annotation and Curation Using MAKER and MAKER‐P journal December 2014
Predicting Genes in Single Genomes with AUGUSTUS journal November 2018
BUSCO: Assessing Genomic Data Quality and Beyond journal December 2021
GeMoMa: Homology-Based Gene Prediction Utilizing Intron Position Conservation and RNA-seq Data book January 2019
The Coix Genome Provides Insights into Panicoideae Evolution and Papery Hull Domestication journal February 2020
The draft chromosome-level genome assembly of tetraploid ground cherry (Prunus fruticosa Pall.) from long reads journal November 2021
Fast and sensitive protein alignment using DIAMOND journal November 2014
A beginner's guide to eukaryotic genome annotation journal April 2012
A comparative genomics multitool for scientific discovery and conservation journal November 2020
Nanopore sequencing technology, bioinformatics and applications journal November 2021
A de novo transcriptional atlas in Danaus plexippus reveals variability in dosage compensation across tissues journal June 2021
Molecular signatures of plastic phenotypes in two eusocial insect species with simple societies journal October 2015
Earth BioGenome Project: Sequencing life for the future of life journal April 2018
RepeatModeler2 for automated genomic discovery of transposable element families journal April 2020
Standards recommendations for the Earth BioGenome Project journal January 2022
Protein-to-genome alignment with miniprot journal January 2023
Using native and syntenically mapped cDNA alignments to improve de novo gene finding journal January 2008
Direct mapping and alignment of protein sequences onto genomic sequence journal August 2008
BRAKER1: Unsupervised RNA-Seq-Based Genome Annotation with GeneMark-ET and AUGUSTUS: Table 1. journal November 2015
GeneValidator: identify problems with protein-coding gene predictions journal January 2016
NCBI Taxonomy: a comprehensive update on curation, resources and tools journal January 2020
Tandem repeats finder: a program to analyze DNA sequences journal January 1999
The Sequence Read Archive: a decade more of explosive growth journal November 2021
OrthoDB v11: annotation of orthologs in the widest sampling of organismal diversity journal November 2022
Gene identification in novel eukaryotic genomes by self-training algorithm journal November 2005
Benchmarking spliced alignment programs including Spaln2, an extended version of Spaln that incorporates additional species-specific features journal July 2012
Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation journal November 2015
Using intron position conservation for homology-based gene prediction journal February 2016
BRAKER2: automatic eukaryotic genome annotation with GeneMark-EP+ and AUGUSTUS supported by a protein database journal January 2021
Gene prediction in novel fungal genomes using an ab initio algorithm with unsupervised training journal October 2008
MAKER: An easy-to-use annotation pipeline designed for emerging model organism genomes journal November 2007
Genome, transcriptome and methylome sequencing of a primitively eusocial wasp reveal a greatly reduced DNA methylation system in a social insect journal March 2016
Transposable elements and introgression introduce genetic variation in the invasive ant Cardiocondyla obscurior journal August 2021
MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects journal December 2011
Gene finding in novel genomes journal May 2004
Combining RNA-seq data and homology-based gene prediction for plants, animals and fungi journal May 2018
VARUS: sampling complementary RNA reads from the sequence read archive journal November 2019
TSEBRA: transcript selector for BRAKER journal November 2021
A hybrid de novo genome assembly of the honeybee, Apis mellifera, with chromosome-length scaffolds journal April 2019
High-Quality Assemblies for Three Invasive Social Wasps from the Vespula Genus journal October 2020
Multifaceted quality assessment of gene repertoire annotation with OMArk dataset January 2022
Combining RNA-seq data and homology-based gene prediction for plants, animals and fungi collection January 2018
A hybrid de novo genome assembly of the honeybee, Apis mellifera, with chromosome-length scaffolds collection January 2019
VARUS: sampling complementary RNA reads from the sequence read archive collection January 2019
Galba: genome annotation with miniprot and AUGUSTUS collection January 2023