skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: GOTTCHA, Version 1

Software ·
OSTI ID:1232297

One major challenge in the field of shotgun metagenomics is the accurate identification of the organisms present within the community, based on classification of short sequence reads. Though microbial community profiling methods have emerged to attempt to rapidly classify the millions of reads output from contemporary sequencers, the combination of incomplete databases, similarity among otherwise divergent genomes, and the large volumes of sequencing data required for metagenome sequencing has led to unacceptably high false discovery rates (FDR). Here we present the application of a novel, gene-independent and signature-based metagenomic taxonomic profiling tool with significantly smaller FDR, which is also capable of classifying never-before seen genomes into the appropriate parent taxa.The algorithm is based upon three primary computational phases: (I) genomic decomposition into bit vectors, (II) bit vector intersections to identify shared regions, and (III) bit vector subtractions to remove shared regions and reveal unique, signature regions. In the Decomposition phase, genomic data is first masked to highlight only the valid (non-ambiguous) regions and then decomposed into overlapping 24-mers. The k-mers are sorted along with their start positions, de-replicated, and then prefixed, to minimize data duplication. The prefixes are indexed and an identical data structure is created for the start positions to mimic that of the k-mer data structure. During the Intersection phase -- which is the most computationally intensive phase -- as an all-vs-all comparison is made, the number of comparisons is first reduced by four methods: (a) Prefix restriction, (b) Overlap detection, (c) Overlap restriction, and (d) Result recording. In Prefix restriction, only k-mers of the same prefix are compared. Within that group, potential overlap of k-mer suffixes that would result in a non-empty set intersection are screened for. If such an overlap exists, the region which intersects is first reduced by performing a binary search of the boundary suffixes of the smaller set into the larger set, which defines the limits of the zipper-based intersection process. Rather than recording the actual k-mers of the intersection, another data structure of identical "shape" is created which consists of only bit vectors so that only a 1 or 0 will be stored in the location of the k-mer suffix that was found in the intersection. This reduces the amount of data generated and stored considerably. During the Subtraction phase, relevant intersection bitmasks are first unionized together to form a single bitmask which is then applied over the original genome to reveal only those regions of the genome that are unique. These regions are then exported to disk in FASTA format and used in the application of determining the constituents of an unknown metagenomic community.

Short Name / Acronym:
GOTTCHA V. 1; 003589WKSTN00
Site Accession Number:
C14077
Version:
00
Programming Language(s):
Medium: X; OS: LINUX; Compatibility: Multiplatform
Research Organization:
Los Alamos National Laboratory (LANL), Los Alamos, NM (United States)
Sponsoring Organization:
USDOE
DOE Contract Number:
AC52-06NA25396
OSTI ID:
1232297
Country of Origin:
United States

Similar Records

GOTTCHA Database, Version 1
Software · Mon Aug 03 00:00:00 EDT 2015 · OSTI ID:1232297

Omega: an Overlap-graph de novo Assembler for Meta-genomics
Journal Article · Wed Jan 01 00:00:00 EST 2014 · Bioinformatics · OSTI ID:1232297

Kraken2 Metagenomic Virus Database
Dataset · Thu Apr 23 00:00:00 EDT 2020 · OSTI ID:1232297

Related Subjects