A novel procedure on next generation sequencing data analysis using text mining algorithm
- U.S. Food and Drug Administration (FDA), Jefferson, AR (United States). Division of Bioinformatics and Biostatistics, National Center for Toxicological Research; Xiangtan Univ. Xiangtan (China). College of Information Engineering
- U.S. Food and Drug Administration (FDA), Jefferson, AR (United States). Division of Bioinformatics and Biostatistics, National Center for Toxicological Research
Background: Next-generation sequencing (NGS) technologies have provided researchers with vast possibilities in various biological and biomedical research areas. Efficient data mining strategies are in high demand for large scale comparative and evolutional studies to be performed on the large amounts of data derived from NGS projects. Topic modeling is an active research field in machine learning and has been mainly used as an analytical tool to structure large textual corpora for data mining. Methods: We report a novel procedure to analyse NGS data using topic modeling. It consists of four major procedures: NGS data retrieval, preprocessing, topic modeling, and data mining using Latent Dirichlet Allocation (LDA) topic outputs. The NGS data set of the Salmonella enterica strains were used as a case study to show the workflow of this procedure. The perplexity measurement of the topic numbers and the convergence efficiencies of Gibbs sampling were calculated and discussed for achieving the best result from the proposed procedure. Results: The output topics by LDA algorithms could be treated as features of Salmonella strains to accurately describe the genetic diversity of fliC gene in various serotypes. The results of a two-way hierarchical clustering and data matrix analysis on LDA-derived matrices successfully classified Salmonella serotypes based on the NGS data. The implementation of topic modeling in NGS data analysis procedure provides a new way to elucidate genetic information from NGS data, and identify the gene-phenotype relationships and biomarkers, especially in the era of biological and medical big data. Conclusion: The implementation of topic modeling in NGS data analysis provides a new way to elucidate genetic information from NGS data, and identify the gene-phenotype relationships and biomarkers, especially in the era of biological and medical big data.
- Research Organization:
- Oak Ridge Institute for Science and Education (ORISE), Oak Ridge, TN (United States)
- Sponsoring Organization:
- USDOE Office of Science (SC)
- Grant/Contract Number:
- SC0014664
- OSTI ID:
- 1626761
- Journal Information:
- BMC Bioinformatics, Vol. 17, Issue 1; ISSN 1471-2105
- Publisher:
- BioMed CentralCopyright Statement
- Country of Publication:
- United States
- Language:
- English
Familial Cortical Myoclonic Tremor and Epilepsy, an Enigmatic Disorder: From Phenotypes to Pathophysiology and Genetics. A Systematic Review
|
text | January 2018 |
Similar Records
Statistical modeling of biomedical corpora: mining the Caenorhabditis Genetic Center Bibliography for genes related to life span
Comparative Sequence Analysis of Multidrug-Resistant IncA/C Plasmids from Salmonella enterica