Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

SourceFinder: a Machine-Learning-Based Tool for Identification of Chromosomal, Plasmid, and Bacteriophage Sequences from Assemblies

Journal Article · · Microbiology Spectrum
High-throughput genome sequencing technologies enable the investigation of complex genetic interactions, including the horizontal gene transfer of plasmids and bacteriophages. However, identifying these elements from assembled reads remains challenging due to genome sequence plasticity and the difficulty in assembling complete sequences. In this study, we developed a classifier, using random forest, to identify whether sequences originated from bacterial chromosomes, plasmids, or bacteriophages. The classifier was trained on a diverse collection of 23,211 chromosomal, plasmid, and bacteriophage sequences from hundreds of bacterial species. In order to adapt the classifier to incomplete sequences, each complete sequence was subsampled into 5,000 nucleotide fragments and further subdivided into k-mers. This three-class classifier succeeded in identifying chromosomes, plasmids, and bacteriophages using k-mer distributions of complete and partial genome sequences, including simulated metagenomic scaffolds with minimum performance of 0.939 area under the receiver operating characteristic curve (AUC). This classifier, implemented as SourceFinder, has been made available as an online web service to help the community with predicting the chromosomal, plasmid, and bacteriophage sources of assembled bacterial sequence data (https://cge.food.dtu.dk/services/SourceFinder/).
Research Organization:
Argonne National Laboratory (ANL), Argonne, IL (United States)
Sponsoring Organization:
National Institutes of Health (NIH); Novo Nordisk Foundation; USDOE
Grant/Contract Number:
AC02-06CH11357
OSTI ID:
1984973
Journal Information:
Microbiology Spectrum, Journal Name: Microbiology Spectrum Journal Issue: 6 Vol. 10; ISSN 2165-0497
Publisher:
American Society for MicrobiologyCopyright Statement
Country of Publication:
United States
Language:
English

References (39)

Identifying viruses from metagenomic data using deep learning journal January 2020
Plasmids and the spread of resistance journal August 2013
Random Forests journal January 2001
Horizontal gene transfer: building the web of life journal July 2015
Genomic islands in pathogenic and environmental microorganisms journal May 2004
Global monitoring of antimicrobial resistance based on metagenomics analyses of urban sewage journal March 2019
SciPy 1.0: fundamental algorithms for scientific computing in Python journal February 2020
Data integration for prediction of weight loss in randomized controlled dietary trials journal November 2020
Critical evaluation of short, long, and hybrid assembly for contextual analysis of antibiotic resistance genes in complex environmental metagenomes journal February 2021
PATRIC as a unique resource for studying antimicrobial resistance journal July 2017
Prokka: rapid prokaryotic genome annotation journal March 2014
KMC 3: counting and manipulating k-mer statistics journal May 2017
NCBI Taxonomy: a comprehensive update on curation, resources and tools journal January 2020
PPR-Meta: a tool for identifying phages and plasmids from metagenomic fragments using deep learning journal June 2019
Seeker: alignment-free identification of bacteriophage genomes by deep learning journal October 2020
PhiSpy: a novel algorithm for finding prophages in bacterial genomes that combines similarity- and composition-based strategies journal May 2012
Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation journal November 2015
PlasFlow: predicting plasmid sequences in metagenomic data using genome signatures journal January 2018
Database resources of the National Center for Biotechnology Information journal October 2019
The PATRIC Bioinformatics Resource Center: expanding data and analysis capabilities journal October 2019
On the (im)possibility of reconstructing plasmids from whole-genome short-read sequencing data journal October 2017
metaSPAdes: a new versatile metagenomic assembler journal March 2017
RNN-VirSeeker: a deep learning method for identification of short viral sequences from metagenomes journal January 2020
In Silico Detection and Typing of Plasmids using PlasmidFinder and Plasmid Multilocus Sequence Typing journal April 2014
ETE: a python Environment for Tree Exploration journal January 2010
Relative entropy differences in bacterial chromosomes, plasmids, phages and genomic islands journal January 2012
Genetic flux over time in the Salmonella lineage journal January 2007
Rapid and precise alignment of raw reads against redundant databases with KMA journal August 2018
ContigExtender: a new approach to improving de novo sequence assembly for viral metagenomics data journal March 2021
VirFinder: a novel k-mer based tool for identifying viral sequences from assembled metagenomic data journal July 2017
MUMmer4: A fast and versatile genome alignment system journal January 2018
MARVEL, a Tool for Prediction of Bacteriophage Sequences in Metagenomic Bins journal August 2018
Plasmid Classification in an Era of Whole-Genome Sequencing: Application in Studies of Antibiotic Resistance Epidemiology journal February 2017
Metagenomic Data Assembly – The Way of Decoding Unknown Microorganisms journal March 2021
Horizontal Gene Transfer and Its Association with Antibiotic Resistance in the Genus Aeromonas spp. journal September 2019
Seeker: Alignment-free identification of bacteriophage genomes by deep learning dataset January 2020
VirFinder: a novel k-mer based tool for identifying viral sequences from assembled metagenomic data collection January 2017
Rapid and precise alignment of raw reads against redundant databases with KMA collection January 2018
VirSorter: mining viral signal from microbial genomic data journal January 2015

Cited By (3)


Similar Records

Classification of bacterial plasmid and chromosome derived sequences using machine learning
Journal Article · Thu Dec 15 19:00:00 EST 2022 · PLoS ONE · OSTI ID:2320222

Deeplasmid: deep learning accurately separates plasmids from bacterial chromosomes
Journal Article · Sun Dec 05 19:00:00 EST 2021 · Nucleic Acids Research · OSTI ID:1894067

cBar: a computer program to distinguish plasmid-derived from chromosome-derived sequence fragments in metagenomics data
Journal Article · Sun Aug 01 20:00:00 EDT 2010 · Bioinformatics · OSTI ID:1625269