Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Tracking topic birth and death in LDA.

Technical Report ·
DOI:https://doi.org/10.2172/1029827· OSTI ID:1029827
Most topic modeling algorithms that address the evolution of documents over time use the same number of topics at all times. This obscures the common occurrence in the data where new subjects arise and old ones diminish or disappear entirely. We propose an algorithm to model the birth and death of topics within an LDA-like framework. The user selects an initial number of topics, after which new topics are created and retired without further supervision. Our approach also accommodates many of the acceleration and parallelization schemes developed in recent years for standard LDA. In recent years, topic modeling algorithms such as latent semantic analysis (LSA)[17], latent Dirichlet allocation (LDA)[10] and their descendants have offered a powerful way to explore and interrogate corpora far too large for any human to grasp without assistance. Using such algorithms we are able to search for similar documents, model and track the volume of topics over time, search for correlated topics or model them with a hierarchy. Most of these algorithms are intended for use with static corpora where the number of documents and the size of the vocabulary are known in advance. Moreover, almost all current topic modeling algorithms fix the number of topics as one of the input parameters and keep it fixed across the entire corpus. While this is appropriate for static corpora, it becomes a serious handicap when analyzing time-varying data sets where topics come and go as a matter of course. This is doubly true for online algorithms that may not have the option of revising earlier results in light of new data. To be sure, these algorithms will account for changing data one way or another, but without the ability to adapt to structural changes such as entirely new topics they may do so in counterintuitive ways.
Research Organization:
Sandia National Laboratories
Sponsoring Organization:
USDOE
DOE Contract Number:
AC04-94AL85000
OSTI ID:
1029827
Report Number(s):
SAND2011-6927
Country of Publication:
United States
Language:
English

Similar Records

Statistical modeling of biomedical corpora: mining the Caenorhabditis Genetic Center Bibliography for genes related to life span
Journal Article · Sun May 07 20:00:00 EDT 2006 · BMC Bioinformatics · OSTI ID:1626320

Continuous Time Group Discovery in Dynamic Graphs
Conference · Thu Nov 04 00:00:00 EDT 2010 · OSTI ID:1016298

A novel procedure on next generation sequencing data analysis using text mining algorithm
Journal Article · Thu May 12 20:00:00 EDT 2016 · BMC Bioinformatics · OSTI ID:1626761