A novel procedure on next generation sequencing data analysis using text mining algorithm
Abstract
Background: Next-generation sequencing (NGS) technologies have provided researchers with vast possibilities in various biological and biomedical research areas. Efficient data mining strategies are in high demand for large scale comparative and evolutional studies to be performed on the large amounts of data derived from NGS projects. Topic modeling is an active research field in machine learning and has been mainly used as an analytical tool to structure large textual corpora for data mining. Methods: We report a novel procedure to analyse NGS data using topic modeling. It consists of four major procedures: NGS data retrieval, preprocessing, topic modeling, and data mining using Latent Dirichlet Allocation (LDA) topic outputs. The NGS data set of the Salmonella enterica strains were used as a case study to show the workflow of this procedure. The perplexity measurement of the topic numbers and the convergence efficiencies of Gibbs sampling were calculated and discussed for achieving the best result from the proposed procedure. Results: The output topics by LDA algorithms could be treated as features of Salmonella strains to accurately describe the genetic diversity of fliC gene in various serotypes. The results of a two-way hierarchical clustering and data matrix analysis on LDA-derived matrices successfully classifiedmore »
- Authors:
-
- U.S. Food and Drug Administration (FDA), Jefferson, AR (United States). Division of Bioinformatics and Biostatistics, National Center for Toxicological Research; Xiangtan Univ. Xiangtan (China). College of Information Engineering
- U.S. Food and Drug Administration (FDA), Jefferson, AR (United States). Division of Bioinformatics and Biostatistics, National Center for Toxicological Research
- Publication Date:
- Research Org.:
- Oak Ridge Institute for Science and Education (ORISE), Oak Ridge, TN (United States)
- Sponsoring Org.:
- USDOE Office of Science (SC)
- OSTI Identifier:
- 1626761
- Grant/Contract Number:
- SC0014664
- Resource Type:
- Accepted Manuscript
- Journal Name:
- BMC Bioinformatics
- Additional Journal Information:
- Journal Volume: 17; Journal Issue: 1; Journal ID: ISSN 1471-2105
- Publisher:
- BioMed Central
- Country of Publication:
- United States
- Language:
- English
- Subject:
- 97 MATHEMATICS AND COMPUTING; 59 BASIC BIOLOGICAL SCIENCES; Biochemistry & Molecular Biology; Biotechnology & Applied Microbiology; Mathematical & Computational Biology; Data mining; Topic modeling; Next-generation sequencing (NGS); Genetic diversity; Biomarker
Citation Formats
Zhao, Weizhong, Chen, James J., Perkins, Roger, Wang, Yuping, Liu, Zhichao, Hong, Huixiao, Tong, Weida, and Zou, Wen. A novel procedure on next generation sequencing data analysis using text mining algorithm. United States: N. p., 2016.
Web. doi:10.1186/s12859-016-1075-9.
Zhao, Weizhong, Chen, James J., Perkins, Roger, Wang, Yuping, Liu, Zhichao, Hong, Huixiao, Tong, Weida, & Zou, Wen. A novel procedure on next generation sequencing data analysis using text mining algorithm. United States. https://doi.org/10.1186/s12859-016-1075-9
Zhao, Weizhong, Chen, James J., Perkins, Roger, Wang, Yuping, Liu, Zhichao, Hong, Huixiao, Tong, Weida, and Zou, Wen. Fri .
"A novel procedure on next generation sequencing data analysis using text mining algorithm". United States. https://doi.org/10.1186/s12859-016-1075-9. https://www.osti.gov/servlets/purl/1626761.
@article{osti_1626761,
title = {A novel procedure on next generation sequencing data analysis using text mining algorithm},
author = {Zhao, Weizhong and Chen, James J. and Perkins, Roger and Wang, Yuping and Liu, Zhichao and Hong, Huixiao and Tong, Weida and Zou, Wen},
abstractNote = {Background: Next-generation sequencing (NGS) technologies have provided researchers with vast possibilities in various biological and biomedical research areas. Efficient data mining strategies are in high demand for large scale comparative and evolutional studies to be performed on the large amounts of data derived from NGS projects. Topic modeling is an active research field in machine learning and has been mainly used as an analytical tool to structure large textual corpora for data mining. Methods: We report a novel procedure to analyse NGS data using topic modeling. It consists of four major procedures: NGS data retrieval, preprocessing, topic modeling, and data mining using Latent Dirichlet Allocation (LDA) topic outputs. The NGS data set of the Salmonella enterica strains were used as a case study to show the workflow of this procedure. The perplexity measurement of the topic numbers and the convergence efficiencies of Gibbs sampling were calculated and discussed for achieving the best result from the proposed procedure. Results: The output topics by LDA algorithms could be treated as features of Salmonella strains to accurately describe the genetic diversity of fliC gene in various serotypes. The results of a two-way hierarchical clustering and data matrix analysis on LDA-derived matrices successfully classified Salmonella serotypes based on the NGS data. The implementation of topic modeling in NGS data analysis procedure provides a new way to elucidate genetic information from NGS data, and identify the gene-phenotype relationships and biomarkers, especially in the era of biological and medical big data. Conclusion: The implementation of topic modeling in NGS data analysis provides a new way to elucidate genetic information from NGS data, and identify the gene-phenotype relationships and biomarkers, especially in the era of biological and medical big data.},
doi = {10.1186/s12859-016-1075-9},
journal = {BMC Bioinformatics},
number = 1,
volume = 17,
place = {United States},
year = {Fri May 13 00:00:00 EDT 2016},
month = {Fri May 13 00:00:00 EDT 2016}
}
Figures / Tables:
Works referenced in this record:
Finding scientific topics
journal, February 2004
- Griffiths, T. L.; Steyvers, M.
- Proceedings of the National Academy of Sciences, Vol. 101, Issue Supplement 1
Multilocus Sequence Typing as a Replacement for Serotyping in Salmonella enterica
journal, June 2012
- Achtman, Mark; Wain, John; Weill, François-Xavier
- PLoS Pathogens, Vol. 8, Issue 6
Identification of a Salmonellosis Outbreak by Means of Molecular Sequencing
journal, March 2011
- Lienau, E. Kurt; Strain, Errol; Wang, Charles
- New England Journal of Medicine, Vol. 364, Issue 10
High resolution clustering of Salmonella enterica serovar Montevideo strains using a next-generation sequencing approach
journal, January 2012
- Allard, Marc W.; Luo, Yan; Strain, Errol
- BMC Genomics, Vol. 13, Issue 1
Image retrieval: Ideas, influences, and trends of the new age
journal, April 2008
- Datta, Ritendra; Joshi, Dhiraj; Li, Jia
- ACM Computing Surveys, Vol. 40, Issue 2
Multi-view methods for protein structure comparison using latent dirichlet allocation
journal, June 2011
- Shivashankar, S.; Srivathsan, S.; Ravindran, B.
- Bioinformatics, Vol. 27, Issue 13
Application of next-generation sequencing technologies in virology
journal, September 2012
- Radford, Alan D.; Chapman, David; Dixon, Linda
- Journal of General Virology, Vol. 93, Issue 9
Unsupervised Learning by Probabilistic Latent Semantic Analysis
journal, January 2001
- Hofmann, Thomas
- Machine Learning, Vol. 42, 177–196
A cloud-compatible bioinformatics pipeline for ultrarapid pathogen identification from next-generation sequencing of clinical samples
journal, June 2014
- Naccache, S. N.; Federman, S.; Veeraraghavan, N.
- Genome Research, Vol. 24, Issue 7
Comparison of dkgB -linked intergenic sequence ribotyping to DNA microarray hybridization for assigning serotype to Salmonella enterica
journal, October 2012
- Guard, Jean; Sanchez-Ingunza, Roxana; Morales, Cesar
- FEMS Microbiology Letters, Vol. 337, Issue 1
Metagenomic Pyrosequencing and Microbial Identification
journal, May 2009
- Petrosino, Joseph F.; Highlander, Sarah; Luna, Ruth Ann
- Clinical Chemistry, Vol. 55, Issue 5
Prediction System for Rapid Identification of Salmonella Serotypes Based on Pulsed-Field Gel Electrophoresis Fingerprints
journal, February 2012
- Zou, W.; Lin, W. -J.; Hise, K. B.
- Journal of Clinical Microbiology, Vol. 50, Issue 5
Comparison of Typing Methods with a New Procedure Based on Sequence Characterization for Salmonella Serovar Prediction
journal, April 2013
- Ranieri, M. L.; Shi, C.; Moreno Switt, A. I.
- Journal of Clinical Microbiology, Vol. 51, Issue 6
Biomarker identification from next-generation sequencing data for pathogen bacteria characterization and surveillance
journal, November 2015
- Zhao, Weizhong; Chen, James J.; Foley, Steven
- Biomarkers in Medicine, Vol. 9, Issue 11
On the Evolutionary History, Population Genetics and Diversity among Isolates of Salmonella Enteritidis PFGE Pattern JEGX01.0004
journal, January 2013
- Allard, Marc W.; Luo, Yan; Strain, Errol
- PLoS ONE, Vol. 8, Issue 1
The Next-Generation Sequencing Revolution and Its Impact on Genomics
journal, September 2013
- Koboldt, Daniel C.; Steinberg, Karyn Meltz; Larson, David E.
- Cell, Vol. 155, Issue 1
Comparing partitions
journal, December 1985
- Hubert, Lawrence; Arabie, Phipps
- Journal of Classification, Vol. 2, Issue 1
A Greedy Algorithm for Aligning DNA Sequences
journal, February 2000
- Zhang, Zheng; Schwartz, Scott; Wagner, Lukas
- Journal of Computational Biology, Vol. 7, Issue 1-2
Sequencing technologies — the next generation
journal, December 2009
- Metzker, Michael L.
- Nature Reviews Genetics, Vol. 11, Issue 1
Exploiting topic modeling to boost metagenomic reads binning
journal, January 2015
- Zhang, Ruichang; Cheng, Zhanzhan; Guan, Jihong
- BMC Bioinformatics, Vol. 16, Issue Suppl 5
A vector space model for automatic indexing
journal, November 1975
- Salton, G.; Wong, A.; Yang, C. S.
- Communications of the ACM, Vol. 18, Issue 11
MUSCLE: multiple sequence alignment with high accuracy and high throughput
journal, March 2004
- Edgar, R. C.
- Nucleic Acids Research, Vol. 32, Issue 5, p. 1792-1797
Probabilistic topic models
journal, April 2012
- Blei, David M.
- Communications of the ACM, Vol. 55, Issue 4
The impact of next-generation sequencing on genomics
journal, March 2011
- Zhang, Jun; Chiodini, Rod; Badr, Ahmed
- Journal of Genetics and Genomics, Vol. 38, Issue 3
Transforming clinical microbiology with bacterial genome sequencing
journal, August 2012
- Didelot, Xavier; Bowden, Rory; Wilson, Daniel J.
- Nature Reviews Genetics, Vol. 13, Issue 9
Comparing community structure identification
journal, September 2005
- Danon, Leon; Díaz-Guilera, Albert; Duch, Jordi
- Journal of Statistical Mechanics: Theory and Experiment, Vol. 2005, Issue 09
CLUSTAL: a package for performing multiple sequence alignment on a microcomputer
journal, December 1988
- Higgins, Desmond G.; Sharp, Paul M.
- Gene, Vol. 73, Issue 1, p. 237-244
Error Detecting and Error Correcting Codes
journal, April 1950
- Hamming, R. W.
- Bell System Technical Journal, Vol. 29, Issue 2
The Next-Generation Sequencing Revolution and Its Impact on Genomics
journal, September 2013
- Koboldt, Daniel C.; Steinberg, Karyn Meltz; Larson, David E.
- Cell, Vol. 155, Issue 1
The impact of next-generation sequencing on genomics
journal, March 2011
- Zhang, Jun; Chiodini, Rod; Badr, Ahmed
- Journal of Genetics and Genomics, Vol. 38, Issue 3
Sequencing technologies — the next generation
journal, December 2009
- Metzker, Michael L.
- Nature Reviews Genetics, Vol. 11, Issue 1
Rapid Whole-Genome Sequencing for Investigation of a Neonatal MRSA Outbreak
journal, June 2012
- Köser, Claudio U.; Holden, Matthew T. G.; Ellington, Matthew J.
- New England Journal of Medicine, Vol. 366, Issue 24
Multi-view methods for protein structure comparison using latent dirichlet allocation
journal, June 2011
- Shivashankar, S.; Srivathsan, S.; Ravindran, B.
- Bioinformatics, Vol. 27, Issue 13
MUSCLE: multiple sequence alignment with high accuracy and high throughput
journal, March 2004
- Edgar, R. C.
- Nucleic Acids Research, Vol. 32, Issue 5, p. 1792-1797
A cloud-compatible bioinformatics pipeline for ultrarapid pathogen identification from next-generation sequencing of clinical samples
journal, June 2014
- Naccache, S. N.; Federman, S.; Veeraraghavan, N.
- Genome Research, Vol. 24, Issue 7
Comparison of dkgB -linked intergenic sequence ribotyping to DNA microarray hybridization for assigning serotype to Salmonella enterica
journal, October 2012
- Guard, Jean; Sanchez-Ingunza, Roxana; Morales, Cesar
- FEMS Microbiology Letters, Vol. 337, Issue 1
The Bacterial Flagellum: Reversible Rotary Propellor and Type III Export Apparatus
journal, December 1999
- Macnab, Robert M.
- Journal of Bacteriology, Vol. 181, Issue 23
Evaluation of Pulsed-Field Gel Electrophoresis Profiles for Identification of Salmonella Serotypes
journal, July 2010
- Zou, W.; Lin, W. -J.; Foley, S. L.
- Journal of Clinical Microbiology, Vol. 48, Issue 9
Use of Pulsed-Field Gel Electrophoresis of Conserved XbaI Fragments for Identification of Swine Salmonella Serotypes
journal, December 2006
- Gaul, S. B.; Wedel, S.; Erdman, M. M.
- Journal of Clinical Microbiology, Vol. 45, Issue 2
Molecular Typing of Salmonella Serotypes Prevalent in Animals in England: Assessment of Methodology
journal, October 2001
- Liebana, E.; Guns, D.; Garcia-Migura, L.
- Journal of Clinical Microbiology, Vol. 39, Issue 10
Image retrieval: Ideas, influences, and trends of the new age
journal, April 2008
- Datta, Ritendra; Joshi, Dhiraj; Li, Jia
- ACM Computing Surveys, Vol. 40, Issue 2
Probabilistic topic models
journal, April 2012
- Blei, David M.
- Communications of the ACM, Vol. 55, Issue 4
Investigating drug repositioning opportunities in FDA drug labels through topic modeling
journal, January 2012
- Bisgin, Halil; Liu, Zhichao; Kelly, Reagan
- BMC Bioinformatics, Vol. 13, Issue Suppl 15
Topic modeling for cluster analysis of large biological and medical datasets
journal, January 2014
- Zhao, Weizhong; Zou, Wen; Chen, James J.
- BMC Bioinformatics, Vol. 15, Issue Suppl 11
High resolution clustering of Salmonella enterica serovar Montevideo strains using a next-generation sequencing approach
journal, January 2012
- Allard, Marc W.; Luo, Yan; Strain, Errol
- BMC Genomics, Vol. 13, Issue 1
On the Evolutionary History, Population Genetics and Diversity among Isolates of Salmonella Enteritidis PFGE Pattern JEGX01.0004
journal, January 2013
- Allard, Marc W.; Luo, Yan; Strain, Errol
- PLoS ONE, Vol. 8, Issue 1
Estudio de la evolución de la investigación y trabajo de la arquitectura técnica a través de sus revistas colegiales = Study of the evolution of the research and work of Technical Architecture through its collegiate journals.
journal, August 2021
- Durán Álvarez, J. M.; López Asiaín, J.; Payán De Tejada, Alonso A.
- Anales de Edificación, Vol. 6, Issue 3
Biomarker identification from next-generation sequencing data for pathogen bacteria characterization and surveillance
journal, November 2015
- Zhao, Weizhong; Chen, James J.; Foley, Steven
- Biomarkers in Medicine, Vol. 9, Issue 11
The evolutionary history and diagnostic utility of the CRISPR-Cas system within Salmonella enterica ssp. enterica
journal, January 2014
- Pettengill, James B.; Timme, Ruth E.; Barrangou, Rodolphe
- PeerJ, Vol. 2
Works referencing / citing this record:
Familial Cortical Myoclonic Tremor and Epilepsy, an Enigmatic Disorder: From Phenotypes to Pathophysiology and Genetics. A Systematic Review
text, January 2018
- Van Den Ende, Tom; Sharifi, Sarvi; Van Der Salm, Sandra M. A.
- Tremor and Other Hyperkinetic Movements