DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: A novel procedure on next generation sequencing data analysis using text mining algorithm

Abstract

Background: Next-generation sequencing (NGS) technologies have provided researchers with vast possibilities in various biological and biomedical research areas. Efficient data mining strategies are in high demand for large scale comparative and evolutional studies to be performed on the large amounts of data derived from NGS projects. Topic modeling is an active research field in machine learning and has been mainly used as an analytical tool to structure large textual corpora for data mining. Methods: We report a novel procedure to analyse NGS data using topic modeling. It consists of four major procedures: NGS data retrieval, preprocessing, topic modeling, and data mining using Latent Dirichlet Allocation (LDA) topic outputs. The NGS data set of the Salmonella enterica strains were used as a case study to show the workflow of this procedure. The perplexity measurement of the topic numbers and the convergence efficiencies of Gibbs sampling were calculated and discussed for achieving the best result from the proposed procedure. Results: The output topics by LDA algorithms could be treated as features of Salmonella strains to accurately describe the genetic diversity of fliC gene in various serotypes. The results of a two-way hierarchical clustering and data matrix analysis on LDA-derived matrices successfully classifiedmore » Salmonella serotypes based on the NGS data. The implementation of topic modeling in NGS data analysis procedure provides a new way to elucidate genetic information from NGS data, and identify the gene-phenotype relationships and biomarkers, especially in the era of biological and medical big data. Conclusion: The implementation of topic modeling in NGS data analysis provides a new way to elucidate genetic information from NGS data, and identify the gene-phenotype relationships and biomarkers, especially in the era of biological and medical big data.« less

Authors:
 [1];  [2];  [2];  [2];  [2];  [2];  [2];  [2]
  1. U.S. Food and Drug Administration (FDA), Jefferson, AR (United States). Division of Bioinformatics and Biostatistics, National Center for Toxicological Research; Xiangtan Univ. Xiangtan (China). College of Information Engineering
  2. U.S. Food and Drug Administration (FDA), Jefferson, AR (United States). Division of Bioinformatics and Biostatistics, National Center for Toxicological Research
Publication Date:
Research Org.:
Oak Ridge Institute for Science and Education (ORISE), Oak Ridge, TN (United States)
Sponsoring Org.:
USDOE Office of Science (SC)
OSTI Identifier:
1626761
Grant/Contract Number:  
SC0014664
Resource Type:
Accepted Manuscript
Journal Name:
BMC Bioinformatics
Additional Journal Information:
Journal Volume: 17; Journal Issue: 1; Journal ID: ISSN 1471-2105
Publisher:
BioMed Central
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING; 59 BASIC BIOLOGICAL SCIENCES; Biochemistry & Molecular Biology; Biotechnology & Applied Microbiology; Mathematical & Computational Biology; Data mining; Topic modeling; Next-generation sequencing (NGS); Genetic diversity; Biomarker

Citation Formats

Zhao, Weizhong, Chen, James J., Perkins, Roger, Wang, Yuping, Liu, Zhichao, Hong, Huixiao, Tong, Weida, and Zou, Wen. A novel procedure on next generation sequencing data analysis using text mining algorithm. United States: N. p., 2016. Web. doi:10.1186/s12859-016-1075-9.
Zhao, Weizhong, Chen, James J., Perkins, Roger, Wang, Yuping, Liu, Zhichao, Hong, Huixiao, Tong, Weida, & Zou, Wen. A novel procedure on next generation sequencing data analysis using text mining algorithm. United States. https://doi.org/10.1186/s12859-016-1075-9
Zhao, Weizhong, Chen, James J., Perkins, Roger, Wang, Yuping, Liu, Zhichao, Hong, Huixiao, Tong, Weida, and Zou, Wen. Fri . "A novel procedure on next generation sequencing data analysis using text mining algorithm". United States. https://doi.org/10.1186/s12859-016-1075-9. https://www.osti.gov/servlets/purl/1626761.
@article{osti_1626761,
title = {A novel procedure on next generation sequencing data analysis using text mining algorithm},
author = {Zhao, Weizhong and Chen, James J. and Perkins, Roger and Wang, Yuping and Liu, Zhichao and Hong, Huixiao and Tong, Weida and Zou, Wen},
abstractNote = {Background: Next-generation sequencing (NGS) technologies have provided researchers with vast possibilities in various biological and biomedical research areas. Efficient data mining strategies are in high demand for large scale comparative and evolutional studies to be performed on the large amounts of data derived from NGS projects. Topic modeling is an active research field in machine learning and has been mainly used as an analytical tool to structure large textual corpora for data mining. Methods: We report a novel procedure to analyse NGS data using topic modeling. It consists of four major procedures: NGS data retrieval, preprocessing, topic modeling, and data mining using Latent Dirichlet Allocation (LDA) topic outputs. The NGS data set of the Salmonella enterica strains were used as a case study to show the workflow of this procedure. The perplexity measurement of the topic numbers and the convergence efficiencies of Gibbs sampling were calculated and discussed for achieving the best result from the proposed procedure. Results: The output topics by LDA algorithms could be treated as features of Salmonella strains to accurately describe the genetic diversity of fliC gene in various serotypes. The results of a two-way hierarchical clustering and data matrix analysis on LDA-derived matrices successfully classified Salmonella serotypes based on the NGS data. The implementation of topic modeling in NGS data analysis procedure provides a new way to elucidate genetic information from NGS data, and identify the gene-phenotype relationships and biomarkers, especially in the era of biological and medical big data. Conclusion: The implementation of topic modeling in NGS data analysis provides a new way to elucidate genetic information from NGS data, and identify the gene-phenotype relationships and biomarkers, especially in the era of biological and medical big data.},
doi = {10.1186/s12859-016-1075-9},
journal = {BMC Bioinformatics},
number = 1,
volume = 17,
place = {United States},
year = {Fri May 13 00:00:00 EDT 2016},
month = {Fri May 13 00:00:00 EDT 2016}
}

Journal Article:
Free Publicly Available Full Text
Publisher's Version of Record

Figures / Tables:

Fig. 1 Fig. 1: Flowchart of the proposed procedure

Save / Share:

Works referenced in this record:

Finding scientific topics
journal, February 2004

  • Griffiths, T. L.; Steyvers, M.
  • Proceedings of the National Academy of Sciences, Vol. 101, Issue Supplement 1
  • DOI: 10.1073/pnas.0307752101

Multilocus Sequence Typing as a Replacement for Serotyping in Salmonella enterica
journal, June 2012


Identification of a Salmonellosis Outbreak by Means of Molecular Sequencing
journal, March 2011

  • Lienau, E. Kurt; Strain, Errol; Wang, Charles
  • New England Journal of Medicine, Vol. 364, Issue 10
  • DOI: 10.1056/NEJMc1100443

High resolution clustering of Salmonella enterica serovar Montevideo strains using a next-generation sequencing approach
journal, January 2012


Image retrieval: Ideas, influences, and trends of the new age
journal, April 2008


Multi-view methods for protein structure comparison using latent dirichlet allocation
journal, June 2011


Application of next-generation sequencing technologies in virology
journal, September 2012

  • Radford, Alan D.; Chapman, David; Dixon, Linda
  • Journal of General Virology, Vol. 93, Issue 9
  • DOI: 10.1099/vir.0.043182-0

Unsupervised Learning by Probabilistic Latent Semantic Analysis
journal, January 2001


A cloud-compatible bioinformatics pipeline for ultrarapid pathogen identification from next-generation sequencing of clinical samples
journal, June 2014

  • Naccache, S. N.; Federman, S.; Veeraraghavan, N.
  • Genome Research, Vol. 24, Issue 7
  • DOI: 10.1101/gr.171934.113

Comparison of dkgB -linked intergenic sequence ribotyping to DNA microarray hybridization for assigning serotype to Salmonella enterica
journal, October 2012

  • Guard, Jean; Sanchez-Ingunza, Roxana; Morales, Cesar
  • FEMS Microbiology Letters, Vol. 337, Issue 1
  • DOI: 10.1111/1574-6968.12010

Metagenomic Pyrosequencing and Microbial Identification
journal, May 2009


Prediction System for Rapid Identification of Salmonella Serotypes Based on Pulsed-Field Gel Electrophoresis Fingerprints
journal, February 2012

  • Zou, W.; Lin, W. -J.; Hise, K. B.
  • Journal of Clinical Microbiology, Vol. 50, Issue 5
  • DOI: 10.1128/JCM.00111-12

Comparison of Typing Methods with a New Procedure Based on Sequence Characterization for Salmonella Serovar Prediction
journal, April 2013

  • Ranieri, M. L.; Shi, C.; Moreno Switt, A. I.
  • Journal of Clinical Microbiology, Vol. 51, Issue 6
  • DOI: 10.1128/JCM.03201-12

The Nature of Statistical Learning Theory
book, January 1995


Biomarker identification from next-generation sequencing data for pathogen bacteria characterization and surveillance
journal, November 2015

  • Zhao, Weizhong; Chen, James J.; Foley, Steven
  • Biomarkers in Medicine, Vol. 9, Issue 11
  • DOI: 10.2217/bmm.15.88

The Next-Generation Sequencing Revolution and Its Impact on Genomics
journal, September 2013


Comparing partitions
journal, December 1985

  • Hubert, Lawrence; Arabie, Phipps
  • Journal of Classification, Vol. 2, Issue 1
  • DOI: 10.1007/BF01908075

A Greedy Algorithm for Aligning DNA Sequences
journal, February 2000

  • Zhang, Zheng; Schwartz, Scott; Wagner, Lukas
  • Journal of Computational Biology, Vol. 7, Issue 1-2
  • DOI: 10.1089/10665270050081478

Sequencing technologies — the next generation
journal, December 2009

  • Metzker, Michael L.
  • Nature Reviews Genetics, Vol. 11, Issue 1
  • DOI: 10.1038/nrg2626

Exploiting topic modeling to boost metagenomic reads binning
journal, January 2015


A vector space model for automatic indexing
journal, November 1975

  • Salton, G.; Wong, A.; Yang, C. S.
  • Communications of the ACM, Vol. 18, Issue 11
  • DOI: 10.1145/361219.361220

MUSCLE: multiple sequence alignment with high accuracy and high throughput
journal, March 2004

  • Edgar, R. C.
  • Nucleic Acids Research, Vol. 32, Issue 5, p. 1792-1797
  • DOI: 10.1093/nar/gkh340

Probabilistic topic models
journal, April 2012


The impact of next-generation sequencing on genomics
journal, March 2011


Transforming clinical microbiology with bacterial genome sequencing
journal, August 2012

  • Didelot, Xavier; Bowden, Rory; Wilson, Daniel J.
  • Nature Reviews Genetics, Vol. 13, Issue 9
  • DOI: 10.1038/nrg3226

Comparing community structure identification
journal, September 2005

  • Danon, Leon; Díaz-Guilera, Albert; Duch, Jordi
  • Journal of Statistical Mechanics: Theory and Experiment, Vol. 2005, Issue 09
  • DOI: 10.1088/1742-5468/2005/09/P09008

CLUSTAL: a package for performing multiple sequence alignment on a microcomputer
journal, December 1988


Error Detecting and Error Correcting Codes
journal, April 1950


The Next-Generation Sequencing Revolution and Its Impact on Genomics
journal, September 2013


The impact of next-generation sequencing on genomics
journal, March 2011


Sequencing technologies — the next generation
journal, December 2009

  • Metzker, Michael L.
  • Nature Reviews Genetics, Vol. 11, Issue 1
  • DOI: 10.1038/nrg2626

Rapid Whole-Genome Sequencing for Investigation of a Neonatal MRSA Outbreak
journal, June 2012

  • Köser, Claudio U.; Holden, Matthew T. G.; Ellington, Matthew J.
  • New England Journal of Medicine, Vol. 366, Issue 24
  • DOI: 10.1056/nejmoa1109910

Multi-view methods for protein structure comparison using latent dirichlet allocation
journal, June 2011


MUSCLE: multiple sequence alignment with high accuracy and high throughput
journal, March 2004

  • Edgar, R. C.
  • Nucleic Acids Research, Vol. 32, Issue 5, p. 1792-1797
  • DOI: 10.1093/nar/gkh340

A cloud-compatible bioinformatics pipeline for ultrarapid pathogen identification from next-generation sequencing of clinical samples
journal, June 2014

  • Naccache, S. N.; Federman, S.; Veeraraghavan, N.
  • Genome Research, Vol. 24, Issue 7
  • DOI: 10.1101/gr.171934.113

Comparison of dkgB -linked intergenic sequence ribotyping to DNA microarray hybridization for assigning serotype to Salmonella enterica
journal, October 2012

  • Guard, Jean; Sanchez-Ingunza, Roxana; Morales, Cesar
  • FEMS Microbiology Letters, Vol. 337, Issue 1
  • DOI: 10.1111/1574-6968.12010

The Bacterial Flagellum: Reversible Rotary Propellor and Type III Export Apparatus
journal, December 1999


Evaluation of Pulsed-Field Gel Electrophoresis Profiles for Identification of Salmonella Serotypes
journal, July 2010

  • Zou, W.; Lin, W. -J.; Foley, S. L.
  • Journal of Clinical Microbiology, Vol. 48, Issue 9
  • DOI: 10.1128/jcm.00645-10

Use of Pulsed-Field Gel Electrophoresis of Conserved XbaI Fragments for Identification of Swine Salmonella Serotypes
journal, December 2006

  • Gaul, S. B.; Wedel, S.; Erdman, M. M.
  • Journal of Clinical Microbiology, Vol. 45, Issue 2
  • DOI: 10.1128/jcm.00962-06

Molecular Typing of Salmonella Serotypes Prevalent in Animals in England: Assessment of Methodology
journal, October 2001


Image retrieval: Ideas, influences, and trends of the new age
journal, April 2008


Probabilistic topic models
journal, April 2012


Investigating drug repositioning opportunities in FDA drug labels through topic modeling
journal, January 2012


Topic modeling for cluster analysis of large biological and medical datasets
journal, January 2014


High resolution clustering of Salmonella enterica serovar Montevideo strains using a next-generation sequencing approach
journal, January 2012


Biomarker identification from next-generation sequencing data for pathogen bacteria characterization and surveillance
journal, November 2015

  • Zhao, Weizhong; Chen, James J.; Foley, Steven
  • Biomarkers in Medicine, Vol. 9, Issue 11
  • DOI: 10.2217/bmm.15.88

The evolutionary history and diagnostic utility of the CRISPR-Cas system within Salmonella enterica ssp. enterica
journal, January 2014

  • Pettengill, James B.; Timme, Ruth E.; Barrangou, Rodolphe
  • PeerJ, Vol. 2
  • DOI: 10.7717/peerj.340

Works referencing / citing this record:

Familial Cortical Myoclonic Tremor and Epilepsy, an Enigmatic Disorder: From Phenotypes to Pathophysiology and Genetics. A Systematic Review
text, January 2018

  • Van Den Ende, Tom; Sharifi, Sarvi; Van Der Salm, Sandra M. A.
  • Tremor and Other Hyperkinetic Movements
  • DOI: 10.7916/d85155wj