Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Machine Learning for Identifying Relevance to Biosurveillance in Multilingual Text

Journal Article · · Online Journal of Public Health Informatics
 [1];  [2]
  1. Tulane Univ., New Orleans, LA (United States); Tulane University
  2. Pacific Northwest National Lab. (PNNL), Richland, WA (United States)

Objective: The objective is to develop an ensemble of machine learning algorithms to identify multilingual, online articles that are relevant to biosurveillance. Language morphology varies widely across languages and must be accounted for when designing algorithms. Here, we compare the performance of a word embedding-based approach and a topic modeling approach with machine learning algorithms to determine the best method for Chinese, Arabic, and French languages. Introduction: Global biosurveillance is an extremely important, yet challenging task. One form of global biosurveillance comes from harvesting open source online data (e.g. news, blogs, reports, RSS feeds). The information derived from this data can be used for timely detection and identification of biological threats all over the world. However, the more inclusive the data harvesting procedure is to ensure that all potentially relevant articles are collected, the more data that is irrelevant also gets harvested. This issue can become even more complex when the online data is in a non-native language. Foreign language articles not only create language-specific issues for Natural Language Processing (NLP), but also add a significant amount of translation costs. Previous work shows success in the use of combinatory monolingual classifiers in specific applications, e.g., legal domain [1]. A critical component for a comprehensive, online harvesting biosurveillance system is the capability to identify relevant foreign language articles from irrelevant ones based on the initial article information collected, without the additional cost of full text retrieval and translation.

Research Organization:
Pacific Northwest National Lab, Richland, WA (United States)
Sponsoring Organization:
USDOE; Department of Homeland Security Science and Technology Directorate
Grant/Contract Number:
AC05-76RL01830
OSTI ID:
1629052
Journal Information:
Online Journal of Public Health Informatics, Journal Name: Online Journal of Public Health Informatics Journal Issue: 1 Vol. 10; ISSN 1947-2579
Publisher:
University of Illinois at ChicagoCopyright Statement
Country of Publication:
United States
Language:
English

Similar Records

Text-based Analytics for Biosurveillance
Book · Wed May 16 00:00:00 EDT 2018 · OSTI ID:1440619

Latent morpho-semantic analysis : multilingual information retrieval with character n-grams and mutual information.
Conference · Fri Aug 01 00:00:00 EDT 2008 · OSTI ID:947254

NBIC Biofeeds: Deploying a New, Digital Tool for Open Source Biosurveillance across Federal Agencies
Journal Article · Tue May 22 00:00:00 EDT 2018 · Online Journal of Public Health Informatics · OSTI ID:1629191