skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Comparison of Social Media, Syndromic Surveillance, and Microbiologic Acute Respiratory Infection Data: Observational Study

Journal Article · · JMIR Public Health and Surveillance
DOI:https://doi.org/10.2196/14986· OSTI ID:1716788

Internet data can be used to improve infectious disease models. However, the representativeness and individual-level validity of internet-derived measures are largely unexplored as this requires ground truth data for study. This study sought to identify relationships between Web-based behaviors and/or conversation topics and health status using a ground truth, survey-based dataset. This study leveraged a unique dataset of self-reported surveys, microbiological laboratory tests, and social media data from the same individuals toward understanding the validity of individual-level constructs pertaining to influenza-like illness in social media data. Logistic regression models were used to identify illness in Twitter posts using user posting behaviors and topic model features extracted from users’ tweets. Of 396 original study participants, only 81 met the inclusion criteria for this study. Of these participants’ tweets, we identified only two instances that were related to health and occurred within 2 weeks (before or after) of a survey indicating symptoms. It was not possible to predict when participants reported symptoms using features derived from topic models (area under the curve [AUC]=0.51; P=.38), though it was possible using behavior features, albeit with a very small effect size (AUC=0.53; P≤.001). Individual symptoms were also generally not predictable either. The study sample and a random sample from Twitter are predictably different on held-out data (AUC=0.67; P≤.001), meaning that the content posted by people who participated in this study was predictably different from that posted by random Twitter users. Individuals in the random sample and the GoViral sample used Twitter with similar frequencies (similar @ mentions, number of tweets, and number of retweets; AUC=0.50; P=.19). To our knowledge, this is the first instance of an attempt to use a ground truth dataset to validate infectious disease observations in social media data. The lack of signal, the lack of predictability among behaviors or topics, and the demonstrated volunteer bias in the study population are important findings for the large and growing body of disease surveillance using internet-sourced data.

Research Organization:
Los Alamos National Lab. (LANL), Los Alamos, NM (United States)
Sponsoring Organization:
USDOE; National Science Foundation (NSF)
Grant/Contract Number:
89233218CNA000001; IIS-1643576; IIS-1551036
OSTI ID:
1716788
Report Number(s):
LA-UR-19-21141
Journal Information:
JMIR Public Health and Surveillance, Vol. 6, Issue 2; ISSN 2369-2960
Publisher:
JMIR PublicationsCopyright Statement
Country of Publication:
United States
Language:
English

References (30)

Using search queries for malaria surveillance, Thailand journal January 2013
The Parable of Google Flu: Traps in Big Data Analysis journal March 2014
Even a good influenza forecasting model can benefit from internet-based nowcasts, but those benefits are limited journal February 2019
Combining Search, Social Media, and Traditional Data Sources to Improve Influenza Surveillance journal October 2015
Using the Internet for Surveys and Health Research journal January 2002
Recruitment of mental health survey participants using Internet advertising: content, characteristics and cost effectiveness: Recruitment using Internet advertising journal February 2014
Predicting Acute Respiratory Infections from Participatory Data journal May 2017
Population Reach and Recruitment Bias in a Maintenance RCT in Physically Active Older Adults journal January 2010
Measuring Global Disease with Wikipedia: Success, Failure, and a Research Agenda
  • Priedhorsky, Reid; Osthus, Dave; Daughton, Ashlynn R.
  • Proceedings of the 2017 ACM Conference on Computer Supported Cooperative Work and Social Computing - CSCW '17 https://doi.org/10.1145/2998181.2998183
conference January 2017
Factors influencing the response rate in social question and answering behavior conference January 2013
Social media as a measurement tool of depression in populations conference January 2013
Surveillance of Acute Respiratory Infections Using Community-Submitted Symptoms and Specimens for Molecular Diagnostic Testing journal January 2015
Using Web Search Query Data to Monitor Dengue Epidemics: A New Model for Neglected Tropical Disease Surveillance journal May 2011
The Use of Twitter to Track Levels of Disease Activity and Public Concern in the U.S. during the Influenza A H1N1 Pandemic journal May 2011
Volunteer bias in twin research: The rule of two‐thirds journal March 1978
Social and News Media Enable Estimation of Epidemiological Patterns Early in the 2010 Haitian Cholera Outbreak journal January 2012
Social Media Use in the United States: Implications for Health Communication journal January 2009
Denominator Issues for Personally Generated Data in Population Health Monitoring journal April 2017
Detecting influenza epidemics using search engine query data journal February 2009
Global Disease Monitoring and Forecasting with Wikipedia journal November 2014
Social media for large studies of behavior journal November 2014
Evaluation of reporting timeliness of public health surveillance systems for infectious diseases journal July 2004
Twitter Improves Influenza Forecasting journal January 2014
Variations in Facebook Posting Patterns Across Validated Patient Health Conditions: A Prospective Cohort Study journal January 2017
Subject Recruitment Bias: The Paid Volunteer Subject journal December 1978
Advanced Paternal Age Is Associated with Impaired Neurocognitive Outcomes during Infancy and Childhood journal March 2009
Bias in Online Recruitment and Retention of Racial and Ethnic Minority Men Who Have Sex With Men journal January 2011
Language of ADHD in Adults on Social Media journal November 2017
Who Tweets with Their Location? Understanding the Relationship between Demographic Characteristics and the Use of Geoservices and Geotagging on Twitter journal November 2015
Assessing Google Flu Trends Performance in the United States during the 2009 Influenza Virus A (H1N1) Pandemic journal August 2011