SourceFinder: a Machine-Learning-Based Tool for Identification of Chromosomal, Plasmid, and Bacteriophage Sequences from Assemblies
Journal Article
·
· Microbiology Spectrum
- Technical Univ. of Denmark, Lyngby (Denmark)
- Univ. of Chicago, IL (United States); Argonne National Laboratory (ANL), Argonne, IL (United States). Data Science and Learning Division; Northwestern Univ., Evanston, IL (United States). Northwestern-Argonne Institute for Science and Engineering (NAISE)
High-throughput genome sequencing technologies enable the investigation of complex genetic interactions, including the horizontal gene transfer of plasmids and bacteriophages. However, identifying these elements from assembled reads remains challenging due to genome sequence plasticity and the difficulty in assembling complete sequences. In this study, we developed a classifier, using random forest, to identify whether sequences originated from bacterial chromosomes, plasmids, or bacteriophages. The classifier was trained on a diverse collection of 23,211 chromosomal, plasmid, and bacteriophage sequences from hundreds of bacterial species. In order to adapt the classifier to incomplete sequences, each complete sequence was subsampled into 5,000 nucleotide fragments and further subdivided into k-mers. This three-class classifier succeeded in identifying chromosomes, plasmids, and bacteriophages using k-mer distributions of complete and partial genome sequences, including simulated metagenomic scaffolds with minimum performance of 0.939 area under the receiver operating characteristic curve (AUC). This classifier, implemented as SourceFinder, has been made available as an online web service to help the community with predicting the chromosomal, plasmid, and bacteriophage sources of assembled bacterial sequence data (https://cge.food.dtu.dk/services/SourceFinder/).
- Research Organization:
- Argonne National Laboratory (ANL), Argonne, IL (United States)
- Sponsoring Organization:
- National Institutes of Health (NIH); Novo Nordisk Foundation; USDOE
- Grant/Contract Number:
- AC02-06CH11357
- OSTI ID:
- 1984973
- Journal Information:
- Microbiology Spectrum, Journal Name: Microbiology Spectrum Journal Issue: 6 Vol. 10; ISSN 2165-0497
- Publisher:
- American Society for MicrobiologyCopyright Statement
- Country of Publication:
- United States
- Language:
- English
Similar Records
Classification of bacterial plasmid and chromosome derived sequences using machine learning
Deeplasmid: deep learning accurately separates plasmids from bacterial chromosomes
cBar: a computer program to distinguish plasmid-derived from chromosome-derived sequence fragments in metagenomics data
Journal Article
·
Thu Dec 15 19:00:00 EST 2022
· PLoS ONE
·
OSTI ID:2320222
Deeplasmid: deep learning accurately separates plasmids from bacterial chromosomes
Journal Article
·
Sun Dec 05 19:00:00 EST 2021
· Nucleic Acids Research
·
OSTI ID:1894067
cBar: a computer program to distinguish plasmid-derived from chromosome-derived sequence fragments in metagenomics data
Journal Article
·
Sun Aug 01 20:00:00 EDT 2010
· Bioinformatics
·
OSTI ID:1625269