ADEPT, a dynamic next generation sequencing data error-detection program with trimming

Feng, Shihai; Lo, Chien-Chi; Li, Po-E; Chain, Patrick S.  G.

doi:10.1186/s12859-016-0967-z

Title: ADEPT, a dynamic next generation sequencing data error-detection program with trimming

Abstract

Illumina is the most widely used next generation sequencing technology and produces millions of short reads that contain errors. These sequencing errors constitute a major problem in applications such as de novo genome assembly, metagenomics analysis and single nucleotide polymorphism discovery. In this study, we present ADEPT, a dynamic error detection method, based on the quality scores of each nucleotide and its neighboring nucleotides, together with their positions within the read and compares this to the position-specific quality score distribution of all bases within the sequencing run. This method greatly improves upon other available methods in terms of the true positive rate of error discovery without affecting the false positive rate, particularly within the middle of reads. We conclude that ADEPT is the only tool to date that dynamically assesses errors within reads by comparing position-specific and neighboring base quality scores with the distribution of quality scores for the dataset being analyzed. The result is a method that is less prone to position-dependent under-prediction, which is one of the most prominent issues in error prediction. The outcome is that ADEPT improves upon prior efforts in identifying true errors, primarily within the middle of reads, while reducing the false positive rate.

Authors:

Feng, Shihai ^[1]; Lo, Chien-Chi ^[1]; Li, Po-E ^[1]; Chain, Patrick S. G. ^[1]

Los Alamos National Lab. (LANL), Los Alamos, NM (United States)

Publication Date:: Mon Feb 29 00:00:00 EST 2016

Research Org.:: Los Alamos National Laboratory (LANL), Los Alamos, NM (United States)

Sponsoring Org.:: USDOE

OSTI Identifier:: 1248578

Report Number(s):: LA-UR-14-25592
Journal ID: ISSN 1471-2105; PII: 967

Grant/Contract Number:: AC02-05CH11231; AC52-06NA25396; CB10152; Y1-DE-6006-02; HSHQDC08X00790; B104153I; B084531I

Resource Type:: Accepted Manuscript

Journal Name:: BMC Bioinformatics

Additional Journal Information:: Journal Volume: 17; Journal Issue: 1; Journal ID: ISSN 1471-2105

Publisher:: BioMed Central

Country of Publication:: United States

Language:: English

Subject:: 59 BASIC BIOLOGICAL SCIENCES; Next generation sequencing; Illumina error prediction; Local quality scores; Position-specific quality

Citation Formats


                    Feng, Shihai, Lo, Chien-Chi, Li, Po-E, and Chain, Patrick S.  G. ADEPT, a dynamic next generation sequencing data error-detection program with trimming.  United States: N. p., 2016. 
Web.  doi:10.1186/s12859-016-0967-z.

Copy to clipboard


                    Feng, Shihai, Lo, Chien-Chi, Li, Po-E, & Chain, Patrick S.  G. ADEPT, a dynamic next generation sequencing data error-detection program with trimming.  United States.  https://doi.org/10.1186/s12859-016-0967-z

Copy to clipboard


                    Feng, Shihai, Lo, Chien-Chi, Li, Po-E, and Chain, Patrick S.  G. Mon .  
"ADEPT, a dynamic next generation sequencing data error-detection program with trimming".  United States.  https://doi.org/10.1186/s12859-016-0967-z.  https://www.osti.gov/servlets/purl/1248578.

Copy to clipboard


                    
@article{osti_1248578,

  title        = {ADEPT, a dynamic next generation sequencing data error-detection program with trimming},

  author       = {Feng, Shihai and Lo, Chien-Chi and Li, Po-E and Chain, Patrick S.  G.},

  abstractNote = {Illumina is the most widely used next generation sequencing technology and produces millions of short reads that contain errors. These sequencing errors constitute a major problem in applications such as de novo genome assembly, metagenomics analysis and single nucleotide polymorphism discovery. In this study, we present ADEPT, a dynamic error detection method, based on the quality scores of each nucleotide and its neighboring nucleotides, together with their positions within the read and compares this to the position-specific quality score distribution of all bases within the sequencing run. This method greatly improves upon other available methods in terms of the true positive rate of error discovery without affecting the false positive rate, particularly within the middle of reads. We conclude that ADEPT is the only tool to date that dynamically assesses errors within reads by comparing position-specific and neighboring base quality scores with the distribution of quality scores for the dataset being analyzed. The result is a method that is less prone to position-dependent under-prediction, which is one of the most prominent issues in error prediction. The outcome is that ADEPT improves upon prior efforts in identifying true errors, primarily within the middle of reads, while reducing the false positive rate.},

  doi          = {10.1186/s12859-016-0967-z},

  journal      = {BMC Bioinformatics},

  number       = 1,

  volume       = 17,

  place        = {United States},

  year         = {Mon Feb 29 00:00:00 EST 2016},

  month        = {Mon Feb 29 00:00:00 EST 2016}

}

Copy to clipboard

Journal Article:

Free Publicly Available Full Text

Accepted Manuscript (DOE)

Publisher's Version of Record

https://doi.org/10.1186/s12859-016-0967-z

Other availability

Search WorldCat to find libraries that may hold this journal

Citation Metrics:

Cited by: 2 works

Citation information provided by
Web of Science

Figures / Tables:

Fig. 1: Comparison of predicted error rates with observed error rates. The solid line represents the theoretical, predicted error rate given a Q score, P = 10^(−Q/10), where Q is the Phred quality score and P is the predicted error rate. The actual error rates for all called Q scoresmore »

All figures and tables (4 total)

Save / Share:

Export Metadata

Save to My Library

Works referenced in this record:

SolexaQA: At-a-glance quality assessment of Illumina second-generation sequencing data
journal, September 2010

Cox, Murray P.; Peterson, Daniel A.; Biggs, Patrick J.
BMC Bioinformatics, Vol. 11, Issue 1
DOI: 10.1186/1471-2105-11-485

Sequencing technologies — the next generation
journal, December 2009

Metzker, Michael L.
Nature Reviews Genetics, Vol. 11, Issue 1
DOI: 10.1038/nrg2626

HiTEC: accurate error correction in high-throughput sequencing data
journal, November 2010

Ilie, L.; Fazayeli, F.; Ilie, S.
Bioinformatics, Vol. 27, Issue 3
DOI: 10.1093/bioinformatics/btq653

ConDeTri - A Content Dependent Read Trimmer for Illumina Data
journal, October 2011

Smeds, Linnéa; Künstner, Axel
PLoS ONE, Vol. 6, Issue 10
DOI: 10.1371/journal.pone.0026314

SHREC: a short-read error correction method
journal, June 2009

Schroder, J.; Schroder, H.; Puglisi, S. J.
Bioinformatics, Vol. 25, Issue 17
DOI: 10.1093/bioinformatics/btp379

Fast and accurate short read alignment with Burrows-Wheeler transform
journal, May 2009

Li, H.; Durbin, R.
Bioinformatics, Vol. 25, Issue 14
DOI: 10.1093/bioinformatics/btp324

Quake: quality-aware detection and correction of sequencing errors
journal, January 2010

Kelley, David R.; Schatz, Michael C.; Salzberg, Steven L.
Genome Biology, Vol. 11, Issue 11
DOI: 10.1186/gb-2010-11-11-r116

Rapid evaluation and quality control of next generation sequencing data with FaQCs
journal, November 2014

Lo, Chien-Chi; Chain, Patrick S. G.
BMC Bioinformatics, Vol. 15, Issue 1
DOI: 10.1186/s12859-014-0366-2

A survey of error-correction methods for next-generation sequencing
journal, April 2012

Yang, X.; Chockalingam, S. P.; Aluru, S.
Briefings in Bioinformatics, Vol. 14, Issue 1
DOI: 10.1093/bib/bbs015

Correction of sequencing errors in a mixed set of reads
journal, April 2010

Salmela, L.
Bioinformatics, Vol. 26, Issue 10
DOI: 10.1093/bioinformatics/btq151

The impact of next-generation sequencing technology on genetics
journal, March 2008

Mardis, Elaine R.
Trends in Genetics, Vol. 24, Issue 3
DOI: 10.1016/j.tig.2007.12.007

Substantial biases in ultra-short read data sets from high-throughput DNA sequencing
journal, August 2008

Dohm, J. C.; Lottaz, C.; Borodina, T.
Nucleic Acids Research, Vol. 36, Issue 16
DOI: 10.1093/nar/gkn425

Sequencing technologies — the next generation
journal, December 2009

Metzker, Michael L.
Nature Reviews Genetics, Vol. 11, Issue 1
DOI: 10.1038/nrg2626

Targeted A-to-G base editing of chloroplast DNA in plants
journal, December 2022

Mok, Young Geun; Hong, Sunghyun; Bae, Su-Ji
Nature Plants, Vol. 8, Issue 12
DOI: 10.1038/s41477-022-01279-8

A survey of error-correction methods for next-generation sequencing
journal, April 2012

Yang, X.; Chockalingam, S. P.; Aluru, S.
Briefings in Bioinformatics, Vol. 14, Issue 1
DOI: 10.1093/bib/bbs015

SHREC: a short-read error correction method
journal, June 2009

Schroder, J.; Schroder, H.; Puglisi, S. J.
Bioinformatics, Vol. 25, Issue 17
DOI: 10.1093/bioinformatics/btp379

Correction of sequencing errors in a mixed set of reads
journal, April 2010

Salmela, L.
Bioinformatics, Vol. 26, Issue 10
DOI: 10.1093/bioinformatics/btq151

SolexaQA: At-a-glance quality assessment of Illumina second-generation sequencing data
journal, September 2010

Cox, Murray P.; Peterson, Daniel A.; Biggs, Patrick J.
BMC Bioinformatics, Vol. 11, Issue 1
DOI: 10.1186/1471-2105-11-485

Rapid evaluation and quality control of next generation sequencing data with FaQCs
journal, November 2014

Lo, Chien-Chi; Chain, Patrick S. G.
BMC Bioinformatics, Vol. 15, Issue 1
DOI: 10.1186/s12859-014-0366-2

ConDeTri - A Content Dependent Read Trimmer for Illumina Data
journal, October 2011

Smeds, Linnéa; Künstner, Axel
PLoS ONE, Vol. 6, Issue 10
DOI: 10.1371/journal.pone.0026314

Substantial biases in ultra-short read data sets from high-throughput DNA sequencing.
text, January 2008

Dohm, Juliane C.; Lottaz, Claudio; Borodina, Tatiana
Universität Regensburg
DOI: 10.5283/epub.32959

Figures / Tables found in this record:

Figures/Tables have been extracted from DOE-funded journal article accepted manuscripts.

Similar Records in DOE PAGES and OSTI.GOV collections:

Systematic and stochastic influences on the performance of the MinION nanopore sequencer across a range of nucleotide bias

Journal Article Krishnakumar, Raga ; Sinha, Anupama ; Bird, Sara W. ; ... - Scientific Reports

Emerging sequencing technologies are allowing us to characterize environmental, clinical and laboratory samples with increasing speed and detail, including real-time analysis and interpretation of data. One example of this is being able to rapidly and accurately detect a wide range of pathogenic organisms, both in the clinic and the field. Genomes can have radically different GC content however, such that accurate sequence analysis can be challenging depending upon the technology used. Here, we have characterized the performance of the Oxford MinION nanopore sequencer for detection and evaluation of organisms with a range of genomic nucleotide bias. We have diagnosed themore »« less
Cited by 36
https://doi.org/10.1038/s41598-018-21484-w

Full Text Available
Proteome-wide identification of proteins and their modifications with decreased ambiguities and improved false discovery rates using unique sequence tags

Journal Article Shen, Yufeng ; Tolic, Nikola ; Hixson, Kim K. ; ... - Analytical Chemistry, 80(6):1871-82

Identifying proteins correctly and with known levels of confidence remain as significant challenges for proteomics. Random or decoy peptide databases are increasingly being used to estimate the false discovery rate (FDR), e.g., from liquid chromatography-tandem mass spectrometry (LC-MS/MS) analyses of tryptic digests. We show that this approach can significantly underestimate the FDR, and describe an approach for more confident protein identifications that uses unique partial sequences derived from a combination of database searching and de novo-style data analyses of high precision MS/MS data. Applied to a Saccharomyces cerevisiae tryptic digest, the approach provided 3,132 confident peptide identifications (~5% modified inmore »« less
https://doi.org/10.1021/ac702328x
ADEPT: a domain independent sequence alignment strategy for gpu architectures

Journal Article Awan, Muaaz G. ; Deslippe, Jack ; Buluc, Aydin ; ... - BMC Bioinformatics

Bioinformatic workflows frequently make use of automated genome assembly and protein clustering tools. At the core of most of these tools, a significant portion of execution time is spent in determining optimal local alignment between two sequences. This task is performed with the Smith-Waterman algorithm, which is a dynamic programming based method. With the advent of modern sequencing technologies and increasing size of both genome and protein databases, a need for faster Smith-Waterman implementations has emerged. Multiple SIMD strategies for the Smith-Waterman algorithm are available for CPUs. However, with the move of HPC facilities towards accelerator based architectures, a needmore »« less
https://doi.org/10.1186/s12859-020-03720-1

Full Text Available
Rapid evaluation and quality control of next generation sequencing data with FaQCs

Journal Article Lo, Chien -Chi ; Chain, Patrick S. G. - BMC Bioinformatics

Background: Next generation sequencing (NGS) technologies that parallelize the sequencing process and produce thousands to millions, or even hundreds of millions of sequences in a single sequencing run, have revolutionized genomic and genetic research. Because of the vagaries of any platform's sequencing chemistry, the experimental processing, machine failure, and so on, the quality of sequencing reads is never perfect, and often declines as the read is extended. These errors invariably affect downstream analysis/application and should therefore be identified early on to mitigate any unforeseen effects. Results: Here we present a novel FastQ Quality Control Software (FaQCs) that can rapidly processmore »« less
Cited by 128
https://doi.org/10.1186/s12859-014-0366-2

Full Text Available
SORFIND: A computer program that predicts exons in vertebrate genomic DNA

Conference Hutchinson, G B ; Hayden, M R

Several computer programs now available will predict exons based upon naive genomic sequence data, but they generally require access to a unix workstation or e-mail access to Internet. The authors have developed a program, called SORFIND, which predicts vertebrate internal exons at 5 different confidence levels, and which runs on an IBM-PC computer. The program reads sequence data in several formats, identifies ``spliceable open reading frames`` (SORFs) possessing high consensus scores with known acceptor and donor splice junctions, and analyzes codon usage. Potential exons are filtered through successive stages, and in a data set of 130 human genes results inmore »« less

Other research related to this record:

Additional file 1: of ADEPT, a dynamic next generation sequencing data error-detection program with trimming
dataset, February 2016

Feng, Shihai; Lo, Chien-Chi; Li, Po-E
Figshare, 600.13 kB
DOI: 10.6084/m9.figshare.c.3617303_d1.v1
This dataset is a supplement to the current record