Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Classification of bacterial plasmid and chromosome derived sequences using machine learning

Journal Article · · PLoS ONE
 [1];  [2];  [2];  [1];  [2]
  1. China-Japan Friendship Hospital, Beijing (China)
  2. Argonne National Laboratory (ANL), Argonne, IL (United States); Univ. of Chicago, IL (United States)
Plasmids are important genetic elements that facilitate horizonal gene transfer between bacteria and contribute to the spread of virulence and antimicrobial resistance. Most bacterial genome sequences in the public archives exist in draft form with many contigs, making it difficult to determine if a contig is of chromosomal or plasmid origin. Using a training set of contigs comprising 10,584 chromosomes and 10,654 plasmids from the PATRIC database, we evaluated several machine learning models including random forest, logistic regression, XGBoost, and a neural network for their ability to classify chromosomal and plasmid sequences using nucleotide k-mers as features. Based on the methods tested, a neural network model that used nucleotide 6-mers as features that was trained on randomly selected chromosomal and plasmid subsequences 5kb in length achieved the best performance, outperforming existing out-of-the-box methods, with an average accuracy of 89.38% ± 2.16% over a 10-fold cross validation. The model accuracy can be improved to 92.08% by using a voting strategy when classifying holdout sequences. In both plasmids and chromosomes, subsequences encoding functions involved in horizontal gene transfer—including hypothetical proteins, transporters, phage, mobile elements, and CRISPR elements—were most likely to be misclassified by the model. This study provides a straightforward approach for identifying plasmid-encoding sequences in short read assemblies without the need for sequence alignment-based tools.
Research Organization:
Argonne National Laboratory (ANL), Argonne, IL (United States)
Sponsoring Organization:
USDOE
Grant/Contract Number:
AC02-06CH11357
OSTI ID:
2320222
Journal Information:
PLoS ONE, Journal Name: PLoS ONE Journal Issue: 12 Vol. 17; ISSN 1932-6203
Publisher:
Public Library of ScienceCopyright Statement
Country of Publication:
United States
Language:
English

References (26)

Comparison of de-novo assembly tools for plasmid metagenome analysis journal June 2019
Towards a taxonomy of conjugative plasmids journal August 2017
Impact of plasmid interactions with the chromosome and other plasmids on the spread of antibiotic resistance journal September 2018
Beyond horizontal gene transfer: the role of plasmids in bacterial evolution journal January 2021
RASTtk: A modular and extensible implementation of the RAST algorithm for building custom annotation pipelines and annotating batches of genomes journal February 2015
Functional assignment of metagenomic data: challenges and applications journal July 2012
KMC 3: counting and manipulating k-mer statistics journal May 2017
PPR-Meta: a tool for identifying phages and plasmids from metagenomic fragments using deep learning journal June 2019
GenBank journal November 2020
PLSDB: advancing a comprehensive database of bacterial plasmids journal November 2021
ISfinder: the reference centre for bacterial insertion sequences journal January 2006
NCBI BLAST: a better web interface journal May 2008
PlasFlow: predicting plasmid sequences in metagenomic data using genome signatures journal January 2018
PLSDB: a resource of complete bacterial plasmids journal October 2018
mlplasmids: a user-friendly tool to predict plasmid- and chromosome-derived sequences for single species journal November 2018
Plasmid evolution in carbapenemase‐producing Enterobacteriaceae: a review journal August 2019
Evolution of Carbapenem-Resistant Serotype K1 Hypervirulent Klebsiella pneumoniae by Acquisition of bla VIM-1 -Bearing Plasmid journal September 2019
In Silico Detection and Typing of Plasmids using PlasmidFinder and Plasmid Multilocus Sequence Typing journal April 2014
XGBoost: A Scalable Tree Boosting System conference January 2016
Random forest versus logistic regression: a large-scale benchmark experiment journal July 2018
A machine learning-based service for estimating quality of genomes using PATRIC journal October 2019
PlasClass improves plasmid sequence classification journal April 2020
Dissemination of Cephalosporin Resistance Genes between Escherichia coli Strains from Farm Animals and Humans by Specific Plasmid Lineages journal December 2014
Epidemic Plasmid Carrying blaCTX-M-15 in Klebsiella penumoniae in China journal January 2013
Presence of pathogenicity island related and plasmid encoded virulence genes in cytolethal distending toxin producing Escherichia coli isolates from diarrheal cases journal January 2015
A machine learning-based service for estimating quality of genomes using PATRIC collection January 2019