skip to main content

Title: GOTTCHA Database, Version 1

One major challenge in the field of shotgun metagenomics is the accurate identification of the organisms present within the community, based on classification of short sequence reads. Though microbial community profiling methods have emerged to attempt to rapidly classify the millions of reads output from contemporary sequencers, the combination of incomplete databases, similarity among otherwise divergent genomes, and the large volumes of sequencing data required for metagenome sequencing has led to unacceptably high false discovery rates (FDR). Here we present the application of a novel, gene-independent and signature-based metagenomic taxonomic profiling tool with significantly smaller FDR, which is also capable of classifying never-before seen genomes into the appropriate parent taxa.The algorithm is based upon three primary computational phases: (I) genomic decomposition into bit vectors, (II) bit vector intersections to identify shared regions, and (III) bit vector subtractions to remove shared regions and reveal unique, signature regions.In the Decomposition phase, genomic data is first masked to highlight only the valid (non-ambiguous) regions and then decomposed into overlapping 24-mers. The k-mers are sorted along with their start positions, de-replicated, and then prefixed, to minimize data duplication. The prefixes are indexed and an identical data structure is created for the start positions tomore » mimic that of the k-mer data structure.During the Intersection phase -- which is the most computationally intensive phase -- as an all-vs-all comparison is made, the number of comparisons is first reduced by four methods: (a) Prefix restriction, (b) Overlap detection, (c) Overlap restriction, and (d) Result recording. In Prefix restriction, only k-mers of the same prefix are compared. Within that group, potential overlap of k-mer suffixes that would result in a non-empty set intersection are screened for. If such an overlap exists, the region which intersects is first reduced by performing a binary search of the boundary suffixes of the smaller set into the larger set, which defines the limits of the zipper-based intersection process. Rather than recording the actual k-mers of the intersection, another data structure of identical "shape" is created which consists of only bit vectors so that only a 1 or 0 will be stored in the location of the k-mer suffix that was found in the intersection. This reduces the amount of data generated and stored considerably.During the Subtraction phase, relevant intersection bitmasks are first unionized together to form a single bitmask which is then applied over the original genome to reveal only those regions of the genome that are unique. These regions are then exported to disk in FASTA format and used in the application of determining the constituents of an unknown metagenomic community.The DATABASE provided is the result of the algorithm described.« less
; ; ;
Publication Date:
OSTI Identifier:
Report Number(s):
R&D Project: LA-CC-14-040; C14078
DOE Contract Number:
Resource Type:
Software Revision:
Software Package Number:
Software Package Contents:
Open Source Software package available from Los Alamos National Laboratory at the following URL:
Software CPU:
Open Source:
Source Code Available:
Research Org:
Los Alamos National Laboratory
Sponsoring Org:
Country of Publication:
United States

To initiate an order for this software, request consultation services, or receive further information, fill out the request form below. You may also reach us by email at: .

OSTI staff will begin to process an order for scientific and technical software once the payment and signed site license agreement are received. If the forms are not in order, OSTI will contact you. No further action will be taken until all required information and/or payment is received. Orders are usually processed within three to five business days.

Software Request