Machine Learning for Identifying Relevance to Biosurveillance in Multilingual Text
- Tulane Univ., New Orleans, LA (United States); Tulane University
- Pacific Northwest National Lab. (PNNL), Richland, WA (United States)
Objective: The objective is to develop an ensemble of machine learning algorithms to identify multilingual, online articles that are relevant to biosurveillance. Language morphology varies widely across languages and must be accounted for when designing algorithms. Here, we compare the performance of a word embedding-based approach and a topic modeling approach with machine learning algorithms to determine the best method for Chinese, Arabic, and French languages. Introduction: Global biosurveillance is an extremely important, yet challenging task. One form of global biosurveillance comes from harvesting open source online data (e.g. news, blogs, reports, RSS feeds). The information derived from this data can be used for timely detection and identification of biological threats all over the world. However, the more inclusive the data harvesting procedure is to ensure that all potentially relevant articles are collected, the more data that is irrelevant also gets harvested. This issue can become even more complex when the online data is in a non-native language. Foreign language articles not only create language-specific issues for Natural Language Processing (NLP), but also add a significant amount of translation costs. Previous work shows success in the use of combinatory monolingual classifiers in specific applications, e.g., legal domain [1]. A critical component for a comprehensive, online harvesting biosurveillance system is the capability to identify relevant foreign language articles from irrelevant ones based on the initial article information collected, without the additional cost of full text retrieval and translation.
- Research Organization:
- Pacific Northwest National Lab, Richland, WA (United States)
- Sponsoring Organization:
- USDOE; Department of Homeland Security Science and Technology Directorate
- Grant/Contract Number:
- AC05-76RL01830
- OSTI ID:
- 1629052
- Journal Information:
- Online Journal of Public Health Informatics, Journal Name: Online Journal of Public Health Informatics Journal Issue: 1 Vol. 10; ISSN 1947-2579
- Publisher:
- University of Illinois at ChicagoCopyright Statement
- Country of Publication:
- United States
- Language:
- English
Similar Records
Latent morpho-semantic analysis : multilingual information retrieval with character n-grams and mutual information.
NBIC Biofeeds: Deploying a New, Digital Tool for Open Source Biosurveillance across Federal Agencies