skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: VecFinder: Automated de novo identification and removal of vector and adapter sequence from genomic datasets

Abstract

High-throughput Sanger sequencing requires DNA to be inserted into bacterial vectors for biological amplification. Adapter or linker oligonucleotides may also be attached to target DNA fragments to facilitate insertion into the vector. These vector and adapter sequences are sequenced concomitantly with the target, or insert, sequence and represent contamination which must be removed from the dataset prior to analysis. Removal of such contamination can be accomplished by screening the dataset against the known sequence of the vector and adapter used to generate the data. However, often in the case of public or collaborator datasets, information regarding these contaminant sequences may be incorrect or absent, resulting in an incomplete screening. We've created a piece of software, VecFinder, which is able to identify the sequence of the vector and adapter from the read sequences alone and subsequently remove it. This alleviates the dependence on the library creators to provide the vector and adapter sequences used for the library. It also automates the previously manual task of identifying and screening the adapter or linker sequence, which can be tedious and time-consuming

Authors:
; ; ;
Publication Date:
Research Org.:
Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
Sponsoring Org.:
USDOE Office of Science (SC)
OSTI Identifier:
1170608
Report Number(s):
LBNL-6833E
DOE Contract Number:
DE-AC02-05CH11231
Resource Type:
Conference
Resource Relation:
Conference: The Biology of Genomes - Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, May 8-12, 2007
Country of Publication:
United States
Language:
English
Subject:
99 GENERAL AND MISCELLANEOUS; VecFinder, de novo, vector and adapter sequence, genomic datasets, Sanger sequencing

Citation Formats

Zhang, Michael Y., Tu, Hank, Shapiro, Harris, and Platt, Darren. VecFinder: Automated de novo identification and removal of vector and adapter sequence from genomic datasets. United States: N. p., 2007. Web.
Zhang, Michael Y., Tu, Hank, Shapiro, Harris, & Platt, Darren. VecFinder: Automated de novo identification and removal of vector and adapter sequence from genomic datasets. United States.
Zhang, Michael Y., Tu, Hank, Shapiro, Harris, and Platt, Darren. Fri . "VecFinder: Automated de novo identification and removal of vector and adapter sequence from genomic datasets". United States. doi:. https://www.osti.gov/servlets/purl/1170608.
@article{osti_1170608,
title = {VecFinder: Automated de novo identification and removal of vector and adapter sequence from genomic datasets},
author = {Zhang, Michael Y. and Tu, Hank and Shapiro, Harris and Platt, Darren},
abstractNote = {High-throughput Sanger sequencing requires DNA to be inserted into bacterial vectors for biological amplification. Adapter or linker oligonucleotides may also be attached to target DNA fragments to facilitate insertion into the vector. These vector and adapter sequences are sequenced concomitantly with the target, or insert, sequence and represent contamination which must be removed from the dataset prior to analysis. Removal of such contamination can be accomplished by screening the dataset against the known sequence of the vector and adapter used to generate the data. However, often in the case of public or collaborator datasets, information regarding these contaminant sequences may be incorrect or absent, resulting in an incomplete screening. We've created a piece of software, VecFinder, which is able to identify the sequence of the vector and adapter from the read sequences alone and subsequently remove it. This alleviates the dependence on the library creators to provide the vector and adapter sequences used for the library. It also automates the previously manual task of identifying and screening the adapter or linker sequence, which can be tedious and time-consuming},
doi = {},
journal = {},
number = ,
volume = ,
place = {United States},
year = {Fri May 04 00:00:00 EDT 2007},
month = {Fri May 04 00:00:00 EDT 2007}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share:
  • Disulfide bonds are a form of posttranslational modification that often determines protein structure(s) and function(s). In this work, we report a mass spectrometry method for identification of disulfides in degradation products of proteins, and specifically endogenous peptides in the human blood plasma peptidome. LC-Fourier transform tandem mass spectrometry (FT MS/MS) was used for acquiring mass spectra that were de novo sequenced and then searched against the IPI human protein database. Through the use of unique sequence tags (UStags) we unambiguously correlated the spectra to specific database proteins. Examination of the UStags’ prefix and/or suffix sequences that contain cysteine(s) in conjunctionmore » with sequences of the UStags-specified database proteins is shown to enable the unambigious determination of disulfide bonds. Using this method, we identified the intermolecular and intramolecular disulfides in human blood plasma peptidome peptides that have molecular weights of up to ~10 kDa.« less
  • Repeat finding has been given little attention in most published complete prokaryotic genome annotations, and yet repeated sequences are ubiquitous in prokaryotic genomes. Without identifying the repetitive DNA in genomes, it not only makes systematical characterization the prokaryotic repeats impossible, it leaves the genome annotations incomplete, resulting in a barrier to genome analysis. We have developed a software package that can identify repeats in the whole prokaryotic genome from the DNA sequence.
  • Automatic de novo peptide identification from collision-induced dissociation tandem mass spectrometry data is made difficult by large plateaus in the fitness landscapes of scoring functions and the fuzzy nature of the constraints that is due to noise in the data. A framework is presented for combining different peptide identification methods within a parallel genetic algorithm. The distinctive feature of our approach, based on Pareto ranking, is that it can accommodate constraints and possibly conflicting scoring functions. We have also shown how population structure can significantly improve the wall clock time of a parallel peptide identification genetic algorithm while at themore » same time maintaining some exchange of information across local populations.« less
  • The recent funding of more than a dozen major genome centers to begin community-wide high-throughput sequencing of the human genome has created a significant new challenge for the computational analysis of DNA sequence and the prediction of gene structure and function. It has been estimated that on average from 1996 to 2003, approximately 2 million bases of newly finished DNA sequence will be produced every day and be made available on the Internet and in central databases. The finished (fully assembled) sequence generated each day will represent approximately 75 new genes (and their respective proteins), and many times this numbermore » will be represented in partially completed sequences. The information contained in these is of immeasurable value to medical research, biotechnology, the pharmaceutical industry and researchers in a host of fields ranging from microorganism metabolism, to structural biology, to bioremediation. Sequencing of microorganisms and other model organisms is also ramping up at a very rapid rate. The genomes for yeast and several microorganisms such as H. influenza have recently been fully sequenced, although the significance of many genes remains to be determined.« less
  • Background: Comprehensive annotation and quantification of transcriptomes are outstanding problems in functional genomics. While high throughput mRNA sequencing (RNA-Seq) has emerged as a powerful tool for addressing these problems, its success is dependent upon the availability and quality of reference genome sequences, thus limiting the organisms to which it can be applied. Results: Here, we describe Rnnotator, an automated software pipeline that generates transcript models by de novo assembly of RNA-Seq data without the need for a reference genome. We have applied the Rnnotator assembly pipeline to two yeast transcriptomes and compared the results to the reference gene catalogs ofmore » these organisms. The contigs produced by Rnnotator are highly accurate (95percent) and reconstruct full-length genes for the majority of the existing gene models (54.3percent). Furthermore, our analyses revealed many novel transcribed regions that are absent from well annotated genomes, suggesting Rnnotator serves as a complementary approach to analysis based on a reference genome for comprehensive transcriptomics. Conclusions: These results demonstrate that the Rnnotator pipeline is able to reconstruct full-length transcripts in the absence of a complete reference genome.« less