DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: A novel procedure on next generation sequencing data analysis using text mining algorithm

Journal Article · · BMC Bioinformatics
 [1];  [2];  [2];  [2];  [2];  [2];  [2];  [2]
  1. U.S. Food and Drug Administration (FDA), Jefferson, AR (United States). Division of Bioinformatics and Biostatistics, National Center for Toxicological Research; Xiangtan Univ. Xiangtan (China). College of Information Engineering
  2. U.S. Food and Drug Administration (FDA), Jefferson, AR (United States). Division of Bioinformatics and Biostatistics, National Center for Toxicological Research

Background: Next-generation sequencing (NGS) technologies have provided researchers with vast possibilities in various biological and biomedical research areas. Efficient data mining strategies are in high demand for large scale comparative and evolutional studies to be performed on the large amounts of data derived from NGS projects. Topic modeling is an active research field in machine learning and has been mainly used as an analytical tool to structure large textual corpora for data mining. Methods: We report a novel procedure to analyse NGS data using topic modeling. It consists of four major procedures: NGS data retrieval, preprocessing, topic modeling, and data mining using Latent Dirichlet Allocation (LDA) topic outputs. The NGS data set of the Salmonella enterica strains were used as a case study to show the workflow of this procedure. The perplexity measurement of the topic numbers and the convergence efficiencies of Gibbs sampling were calculated and discussed for achieving the best result from the proposed procedure. Results: The output topics by LDA algorithms could be treated as features of Salmonella strains to accurately describe the genetic diversity of fliC gene in various serotypes. The results of a two-way hierarchical clustering and data matrix analysis on LDA-derived matrices successfully classified Salmonella serotypes based on the NGS data. The implementation of topic modeling in NGS data analysis procedure provides a new way to elucidate genetic information from NGS data, and identify the gene-phenotype relationships and biomarkers, especially in the era of biological and medical big data. Conclusion: The implementation of topic modeling in NGS data analysis provides a new way to elucidate genetic information from NGS data, and identify the gene-phenotype relationships and biomarkers, especially in the era of biological and medical big data.

Research Organization:
Oak Ridge Institute for Science and Education (ORISE), Oak Ridge, TN (United States)
Sponsoring Organization:
USDOE Office of Science (SC)
Grant/Contract Number:
SC0014664
OSTI ID:
1626761
Journal Information:
BMC Bioinformatics, Vol. 17, Issue 1; ISSN 1471-2105
Publisher:
BioMed CentralCopyright Statement
Country of Publication:
United States
Language:
English

References (52)

Finding scientific topics journal February 2004
High resolution clustering of Salmonella enterica serovar Montevideo strains using a next-generation sequencing approach journal January 2012
Multilocus Sequence Typing as a Replacement for Serotyping in Salmonella enterica journal June 2012
High resolution clustering of Salmonella enterica serovar Montevideo strains using a next-generation sequencing approach journal January 2012
Application of next-generation sequencing technologies in virology journal September 2012
Probabilistic topic models journal April 2012
Comparison of dkgB -linked intergenic sequence ribotyping to DNA microarray hybridization for assigning serotype to Salmonella enterica journal October 2012
Metagenomic Pyrosequencing and Microbial Identification journal May 2009
Prediction System for Rapid Identification of Salmonella Serotypes Based on Pulsed-Field Gel Electrophoresis Fingerprints journal February 2012
Comparison of Typing Methods with a New Procedure Based on Sequence Characterization for Salmonella Serovar Prediction journal April 2013
The impact of next-generation sequencing on genomics journal March 2011
Biomarker identification from next-generation sequencing data for pathogen bacteria characterization and surveillance journal November 2015
On the Evolutionary History, Population Genetics and Diversity among Isolates of Salmonella Enteritidis PFGE Pattern JEGX01.0004 journal January 2013
The Next-Generation Sequencing Revolution and Its Impact on Genomics journal September 2013
Comparing partitions journal December 1985
A Greedy Algorithm for Aligning DNA Sequences journal February 2000
MUSCLE: multiple sequence alignment with high accuracy and high throughput journal March 2004
Sequencing technologies — the next generation journal December 2009
Sequencing technologies — the next generation journal December 2009
On the Evolutionary History, Population Genetics and Diversity among Isolates of Salmonella Enteritidis PFGE Pattern JEGX01.0004 journal January 2013
The Next-Generation Sequencing Revolution and Its Impact on Genomics journal September 2013
The evolutionary history and diagnostic utility of the CRISPR-Cas system within Salmonella enterica ssp. enterica journal January 2014
Evaluation of Pulsed-Field Gel Electrophoresis Profiles for Identification of Salmonella Serotypes journal July 2010
Probabilistic topic models journal April 2012
The impact of next-generation sequencing on genomics journal March 2011
Investigating drug repositioning opportunities in FDA drug labels through topic modeling journal January 2012
Image retrieval: Ideas, influences, and trends of the new age journal April 2008
Topic modeling for cluster analysis of large biological and medical datasets journal January 2014
Identification of a Salmonellosis Outbreak by Means of Molecular Sequencing journal March 2011
Image retrieval: Ideas, influences, and trends of the new age journal April 2008
Multi-view methods for protein structure comparison using latent dirichlet allocation journal June 2011
Unsupervised Learning by Probabilistic Latent Semantic Analysis journal January 2001
A cloud-compatible bioinformatics pipeline for ultrarapid pathogen identification from next-generation sequencing of clinical samples journal June 2014
Multi-view methods for protein structure comparison using latent dirichlet allocation journal June 2011
The Nature of Statistical Learning Theory book January 1995
Use of Pulsed-Field Gel Electrophoresis of Conserved XbaI Fragments for Identification of Swine Salmonella Serotypes journal December 2006
Molecular Typing of Salmonella Serotypes Prevalent in Animals in England: Assessment of Methodology journal October 2001
Exploiting topic modeling to boost metagenomic reads binning journal January 2015
A vector space model for automatic indexing journal November 1975
A cloud-compatible bioinformatics pipeline for ultrarapid pathogen identification from next-generation sequencing of clinical samples journal June 2014
Biomarker identification from next-generation sequencing data for pathogen bacteria characterization and surveillance journal November 2015
MUSCLE: multiple sequence alignment with high accuracy and high throughput journal March 2004
Estudio de la evolución de la investigación y trabajo de la arquitectura técnica a través de sus revistas colegiales = Study of the evolution of the research and work of Technical Architecture through its collegiate journals. journal August 2021
Rapid Whole-Genome Sequencing for Investigation of a Neonatal MRSA Outbreak journal June 2012
Transforming clinical microbiology with bacterial genome sequencing journal August 2012
The Bacterial Flagellum: Reversible Rotary Propellor and Type III Export Apparatus journal December 1999
Comparing community structure identification journal September 2005
Random Forests journal January 2001
CLUSTAL: a package for performing multiple sequence alignment on a microcomputer journal December 1988
Error Detecting and Error Correcting Codes journal April 1950
Comparison of dkgB -linked intergenic sequence ribotyping to DNA microarray hybridization for assigning serotype to Salmonella enterica journal October 2012
Latent semantic indexing
  • Papadimitriou, Christos H.; Tamaki, Hisao; Raghavan, Prabhakar
  • Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems https://doi.org/10.1145/275487.275505
conference May 1998

Cited By (1)

Familial Cortical Myoclonic Tremor and Epilepsy, an Enigmatic Disorder: From Phenotypes to Pathophysiology and Genetics. A Systematic Review text January 2018

Figures / Tables (11)