skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: acdc – Automated Contamination Detection and Confidence estimation for single-cell genome data

Abstract

A major obstacle in single-cell sequencing is sample contamination with foreign DNA. To guarantee clean genome assemblies and to prevent the introduction of contamination into public databases, considerable quality control efforts are put into post-sequencing analysis. Contamination screening generally relies on reference-based methods such as database alignment or marker gene search, which limits the set of detectable contaminants to organisms with closely related reference species. As genomic coverage in the tree of life is highly fragmented, there is an urgent need for a reference-free methodology for contaminant identification in sequence data. We present acdc, a tool specifically developed to aid the quality control process of genomic sequence data. By combining supervised and unsupervised methods, it reliably detects both known and de novo contaminants. First, 16S rRNA gene prediction and the inclusion of ultrafast exact alignment techniques allow sequence classification using existing knowledge from databases. Second, reference-free inspection is enabled by the use of state-of-the-art machine learning techniques that include fast, non-linear dimensionality reduction of oligonucleotide signatures and subsequent clustering algorithms that automatically estimate the number of clusters. The latter also enables the removal of any contaminant, yielding a clean sample. Furthermore, given the data complexity and the ill-posedness of clustering,more » acdc employs bootstrapping techniques to provide statistically profound confidence values. Tested on a large number of samples from diverse sequencing projects, our software is able to quickly and accurately identify contamination. Results are displayed in an interactive user interface. Acdc can be run from the web as well as a dedicated command line application, which allows easy integration into large sequencing project analysis workflows. Acdc can reliably detect contamination in single-cell genome data. In addition to database-driven detection, it complements existing tools by its unsupervised techniques, which allow for the detection of de novo contaminants. Our contribution has the potential to drastically reduce the amount of resources put into these processes, particularly in the context of limited availability of reference species. As single-cell genome data continues to grow rapidly, acdc adds to the toolkit of crucial quality assurance tools.« less

Authors:
 [1];  [1];  [2];  [1];  [1];  [3];  [1];  [1]
  1. Bielefeld Univ., Bielefeld (Germany)
  2. Univ. of Queensland, Brisbane (Australia)
  3. USDOE Joint Genome Institute (JGI), Walnut Creek, CA (United States)
Publication Date:
Research Org.:
Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
Sponsoring Org.:
USDOE Office of Science (SC), Biological and Environmental Research (BER) (SC-23)
OSTI Identifier:
1379618
DOE Contract Number:  
AC02-05CH11231
Resource Type:
Journal Article
Journal Name:
BMC Bioinformatics
Additional Journal Information:
Journal Volume: 17; Journal Issue: 1; Journal ID: ISSN 1471-2105
Publisher:
BioMed Central
Country of Publication:
United States
Language:
English
Subject:
60 APPLIED LIFE SCIENCES; 59 BASIC BIOLOGICAL SCIENCES; Single-cell sequencing; Contamination detection; Machine learning; Clustering; Binning; Quality control

Citation Formats

Lux, Markus, Kruger, Jan, Rinke, Christian, Maus, Irena, Schluter, Andreas, Woyke, Tanja, Sczyrba, Alexander, and Hammer, Barbara. acdc – Automated Contamination Detection and Confidence estimation for single-cell genome data. United States: N. p., 2016. Web. doi:10.1186/s12859-016-1397-7.
Lux, Markus, Kruger, Jan, Rinke, Christian, Maus, Irena, Schluter, Andreas, Woyke, Tanja, Sczyrba, Alexander, & Hammer, Barbara. acdc – Automated Contamination Detection and Confidence estimation for single-cell genome data. United States. doi:10.1186/s12859-016-1397-7.
Lux, Markus, Kruger, Jan, Rinke, Christian, Maus, Irena, Schluter, Andreas, Woyke, Tanja, Sczyrba, Alexander, and Hammer, Barbara. Tue . "acdc – Automated Contamination Detection and Confidence estimation for single-cell genome data". United States. doi:10.1186/s12859-016-1397-7. https://www.osti.gov/servlets/purl/1379618.
@article{osti_1379618,
title = {acdc – Automated Contamination Detection and Confidence estimation for single-cell genome data},
author = {Lux, Markus and Kruger, Jan and Rinke, Christian and Maus, Irena and Schluter, Andreas and Woyke, Tanja and Sczyrba, Alexander and Hammer, Barbara},
abstractNote = {A major obstacle in single-cell sequencing is sample contamination with foreign DNA. To guarantee clean genome assemblies and to prevent the introduction of contamination into public databases, considerable quality control efforts are put into post-sequencing analysis. Contamination screening generally relies on reference-based methods such as database alignment or marker gene search, which limits the set of detectable contaminants to organisms with closely related reference species. As genomic coverage in the tree of life is highly fragmented, there is an urgent need for a reference-free methodology for contaminant identification in sequence data. We present acdc, a tool specifically developed to aid the quality control process of genomic sequence data. By combining supervised and unsupervised methods, it reliably detects both known and de novo contaminants. First, 16S rRNA gene prediction and the inclusion of ultrafast exact alignment techniques allow sequence classification using existing knowledge from databases. Second, reference-free inspection is enabled by the use of state-of-the-art machine learning techniques that include fast, non-linear dimensionality reduction of oligonucleotide signatures and subsequent clustering algorithms that automatically estimate the number of clusters. The latter also enables the removal of any contaminant, yielding a clean sample. Furthermore, given the data complexity and the ill-posedness of clustering, acdc employs bootstrapping techniques to provide statistically profound confidence values. Tested on a large number of samples from diverse sequencing projects, our software is able to quickly and accurately identify contamination. Results are displayed in an interactive user interface. Acdc can be run from the web as well as a dedicated command line application, which allows easy integration into large sequencing project analysis workflows. Acdc can reliably detect contamination in single-cell genome data. In addition to database-driven detection, it complements existing tools by its unsupervised techniques, which allow for the detection of de novo contaminants. Our contribution has the potential to drastically reduce the amount of resources put into these processes, particularly in the context of limited availability of reference species. As single-cell genome data continues to grow rapidly, acdc adds to the toolkit of crucial quality assurance tools.},
doi = {10.1186/s12859-016-1397-7},
journal = {BMC Bioinformatics},
issn = {1471-2105},
number = 1,
volume = 17,
place = {United States},
year = {2016},
month = {12}
}

Works referenced in this record:

RNAmmer: consistent and rapid annotation of ribosomal RNA genes
journal, April 2007

  • Lagesen, Karin; Hallin, Peter; Rødland, Einar Andreas
  • Nucleic Acids Research, Vol. 35, Issue 9
  • DOI: 10.1093/nar/gkm160

Data clustering: 50 years beyond K-means
journal, June 2010


Why so many clustering algorithms: a position paper
journal, June 2002

  • Estivill-Castro, Vladimir
  • ACM SIGKDD Explorations Newsletter, Vol. 4, Issue 1
  • DOI: 10.1145/568574.568575

Single-cell analysis: toward the clinic
journal, August 2013

  • Speicher, Michael R.
  • Genome Medicine, Vol. 5, Issue 8
  • DOI: 10.1186/gm478

metaBEETL: high-throughput analysis of heterogeneous microbial populations from shotgun DNA sequences
journal, January 2013

  • Ander, Christina; Schulz-Trieglaff, Ole B.; Stoye, Jens
  • BMC Bioinformatics, Vol. 14, Issue Suppl 5
  • DOI: 10.1186/1471-2105-14-S5-S2

READSCAN: a fast and scalable pathogen discovery program with accurate genome relative abundance estimation
journal, November 2012


Reagent and laboratory contamination can critically impact sequence-based microbiome analyses
journal, November 2014


Relative clustering validity criteria: A comparative overview
journal, January 2010

  • Vendramin, Lucas; Campello, Ricardo J. G. B.; Hruschka, Eduardo R.
  • Statistical Analysis and Data Mining
  • DOI: 10.1002/sam.10080

CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes
journal, May 2015

  • Parks, Donovan H.; Imelfort, Michael; Skennerton, Connor T.
  • Genome Research, Vol. 25, Issue 7
  • DOI: 10.1101/gr.186072.114

ART: a next-generation sequencing read simulator
journal, December 2011


Effects of sample treatments on genome recovery via single-cell genomics
journal, June 2014

  • Clingenpeel, Scott; Schwientek, Patrick; Hugenholtz, Philip
  • The ISME Journal, Vol. 8, Issue 12
  • DOI: 10.1038/ismej.2014.92

Herbinix hemicellulosilytica gen. nov., sp. nov., a thermophilic cellulose-degrading bacterium isolated from a thermophilic biogas reactor
journal, August 2015

  • Koeck, Daniela E.; Zverlov, Vladimir V.; Schwarz, Wolfgang H.
  • International Journal of Systematic and Evolutionary Microbiology, Vol. 65, Issue 8
  • DOI: 10.1099/ijs.0.000264

Kraken: ultrafast metagenomic sequence classification using exact alignments
journal, January 2014


ProDeGe: a computational protocol for fully automated decontamination of genomes
journal, June 2015

  • Tennessen, Kristin; Andersen, Evan; Clingenpeel, Scott
  • The ISME Journal, Vol. 10, Issue 1
  • DOI: 10.1038/ismej.2015.100

Application of tetranucleotide frequencies for the assignment of genomic fragments
journal, September 2004


The promise of single-cell sequencing
journal, December 2013

  • Eberwine, James; Sul, Jai-Yoon; Bartfai, Tamas
  • Nature Methods, Vol. 11, Issue 1
  • DOI: 10.1038/nmeth.2769

Complete genome sequence of the methanogenic neotype strain Methanobacterium formicicum MFT
journal, December 2014


The first five years of single-cell cancer genomics and beyond
journal, October 2015


The future is now: single-cell genomics of bacteria and archaea
journal, May 2013


The Dip Test of Unimodality
journal, March 1985


Decontamination of MDA Reagents for Single Cell Whole Genome Amplification
journal, October 2011


SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing
journal, May 2012

  • Bankevich, Anton; Nurk, Sergey; Antipov, Dmitry
  • Journal of Computational Biology, Vol. 19, Issue 5
  • DOI: 10.1089/cmb.2012.0021

Alignment-free Visualization of Metagenomic Data by Nonlinear Dimension Reduction
journal, March 2014

  • Laczny, Cedric C.; Pinel, Nicolás; Vlassis, Nikos
  • Scientific Reports, Vol. 4, Issue 1
  • DOI: 10.1038/srep04516

Estimating the number of clusters in a data set via the gap statistic
journal, May 2001

  • Tibshirani, Robert; Walther, Guenther; Hastie, Trevor
  • Journal of the Royal Statistical Society: Series B (Statistical Methodology), Vol. 63, Issue 2, p. 411-423
  • DOI: 10.1111/1467-9868.00293

Isolation of acetic, propionic and butyric acid-forming bacteria from biogas plants
journal, February 2016


BLAST+: architecture and applications
journal, January 2009

  • Camacho, Christiam; Coulouris, George; Avagyan, Vahram
  • BMC Bioinformatics, Vol. 10, Issue 1
  • DOI: 10.1186/1471-2105-10-421

Classification of metagenomic sequences: methods and challenges
journal, September 2012

  • Mande, S. S.; Mohammed, M. H.; Ghosh, T. S.
  • Briefings in Bioinformatics, Vol. 13, Issue 6
  • DOI: 10.1093/bib/bbs054

Single-cell genome sequencing: current state of the science
journal, January 2016

  • Gawad, Charles; Koh, Winston; Quake, Stephen R.
  • Nature Reviews Genetics, Vol. 17, Issue 3
  • DOI: 10.1038/nrg.2015.16

A tutorial on spectral clustering
journal, August 2007


Insights into the phylogeny and coding potential of microbial dark matter
journal, July 2013

  • Rinke, Christian; Schwientek, Patrick; Sczyrba, Alexander
  • Nature, Vol. 499, Issue 7459
  • DOI: 10.1038/nature12352

Potential for Chemolithoautotrophy Among Ubiquitous Bacteria Lineages in the Dark Ocean
journal, September 2011


    Works referencing / citing this record:

    Reagent and laboratory contamination can critically impact sequence-based microbiome analyses
    journal, November 2014


    READSCAN: a fast and scalable pathogen discovery program with accurate genome relative abundance estimation
    journal, November 2012


    Classification of metagenomic sequences: methods and challenges
    journal, September 2012

    • Mande, S. S.; Mohammed, M. H.; Ghosh, T. S.
    • Briefings in Bioinformatics, Vol. 13, Issue 6
    • DOI: 10.1093/bib/bbs054

    Isolation of acetic, propionic and butyric acid-forming bacteria from biogas plants
    journal, February 2016


    The promise of single-cell sequencing
    journal, December 2013

    • Eberwine, James; Sul, Jai-Yoon; Bartfai, Tamas
    • Nature Methods, Vol. 11, Issue 1
    • DOI: 10.1038/nmeth.2769

    Effects of sample treatments on genome recovery via single-cell genomics
    journal, June 2014

    • Clingenpeel, Scott; Schwientek, Patrick; Hugenholtz, Philip
    • The ISME Journal, Vol. 8, Issue 12
    • DOI: 10.1038/ismej.2014.92

    Application of tetranucleotide frequencies for the assignment of genomic fragments
    journal, September 2004


    Relative clustering validity criteria: A comparative overview
    journal, January 2010

    • Vendramin, Lucas; Campello, Ricardo J. G. B.; Hruschka, Eduardo R.
    • Statistical Analysis and Data Mining
    • DOI: 10.1002/sam.10080

    ART: a next-generation sequencing read simulator
    journal, December 2011


    The Dip Test of Unimodality
    journal, March 1985


    Estimating the number of clusters in a data set via the gap statistic
    journal, May 2001

    • Tibshirani, Robert; Walther, Guenther; Hastie, Trevor
    • Journal of the Royal Statistical Society: Series B (Statistical Methodology), Vol. 63, Issue 2, p. 411-423
    • DOI: 10.1111/1467-9868.00293

    BLAST+: architecture and applications
    journal, January 2009

    • Camacho, Christiam; Coulouris, George; Avagyan, Vahram
    • BMC Bioinformatics, Vol. 10, Issue 1
    • DOI: 10.1186/1471-2105-10-421

    Insights into the phylogeny and coding potential of microbial dark matter
    journal, July 2013

    • Rinke, Christian; Schwientek, Patrick; Sczyrba, Alexander
    • Nature, Vol. 499, Issue 7459
    • DOI: 10.1038/nature12352

    Data clustering: 50 years beyond K-means
    journal, June 2010


    Why so many clustering algorithms: a position paper
    journal, June 2002

    • Estivill-Castro, Vladimir
    • ACM SIGKDD Explorations Newsletter, Vol. 4, Issue 1
    • DOI: 10.1145/568574.568575

    CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes
    journal, May 2015

    • Parks, Donovan H.; Imelfort, Michael; Skennerton, Connor T.
    • Genome Research, Vol. 25, Issue 7
    • DOI: 10.1101/gr.186072.114

    RNAmmer: consistent and rapid annotation of ribosomal RNA genes
    journal, April 2007

    • Lagesen, Karin; Hallin, Peter; Rødland, Einar Andreas
    • Nucleic Acids Research, Vol. 35, Issue 9
    • DOI: 10.1093/nar/gkm160

    Potential for Chemolithoautotrophy Among Ubiquitous Bacteria Lineages in the Dark Ocean
    journal, September 2011


    Complete genome sequence of the methanogenic neotype strain Methanobacterium formicicum MFT
    journal, December 2014


    Single-cell analysis: toward the clinic
    journal, August 2013

    • Speicher, Michael R.
    • Genome Medicine, Vol. 5, Issue 8
    • DOI: 10.1186/gm478

    ProDeGe: a computational protocol for fully automated decontamination of genomes
    journal, June 2015

    • Tennessen, Kristin; Andersen, Evan; Clingenpeel, Scott
    • The ISME Journal, Vol. 10, Issue 1
    • DOI: 10.1038/ismej.2015.100

    A tutorial on spectral clustering
    journal, August 2007


    SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing
    journal, May 2012

    • Bankevich, Anton; Nurk, Sergey; Antipov, Dmitry
    • Journal of Computational Biology, Vol. 19, Issue 5
    • DOI: 10.1089/cmb.2012.0021

    Kraken: ultrafast metagenomic sequence classification using exact alignments
    journal, January 2014


    Decontamination of MDA Reagents for Single Cell Whole Genome Amplification
    journal, October 2011


    Characterization of Bathyarchaeota genomes assembled from metagenomes of biofilms residing in mesophilic and thermophilic biogas reactors
    journal, June 2018


    Single-cell genome sequencing: current state of the science
    journal, January 2016

    • Gawad, Charles; Koh, Winston; Quake, Stephen R.
    • Nature Reviews Genetics, Vol. 17, Issue 3
    • DOI: 10.1038/nrg.2015.16

    Herbinix hemicellulosilytica gen. nov., sp. nov., a thermophilic cellulose-degrading bacterium isolated from a thermophilic biogas reactor
    journal, August 2015

    • Koeck, Daniela E.; Zverlov, Vladimir V.; Schwarz, Wolfgang H.
    • International Journal of Systematic and Evolutionary Microbiology, Vol. 65, Issue 8
    • DOI: 10.1099/ijs.0.000264

    The first five years of single-cell cancer genomics and beyond
    journal, October 2015


    Alignment-free Visualization of Metagenomic Data by Nonlinear Dimension Reduction
    journal, March 2014

    • Laczny, Cedric C.; Pinel, Nicolás; Vlassis, Nikos
    • Scientific Reports, Vol. 4, Issue 1
    • DOI: 10.1038/srep04516

    The future is now: single-cell genomics of bacteria and archaea
    journal, May 2013