DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Comparison of Social Media, Syndromic Surveillance, and Microbiologic Acute Respiratory Infection Data: Observational Study

Abstract

Internet data can be used to improve infectious disease models. However, the representativeness and individual-level validity of internet-derived measures are largely unexplored as this requires ground truth data for study. This study sought to identify relationships between Web-based behaviors and/or conversation topics and health status using a ground truth, survey-based dataset. This study leveraged a unique dataset of self-reported surveys, microbiological laboratory tests, and social media data from the same individuals toward understanding the validity of individual-level constructs pertaining to influenza-like illness in social media data. Logistic regression models were used to identify illness in Twitter posts using user posting behaviors and topic model features extracted from users’ tweets. Of 396 original study participants, only 81 met the inclusion criteria for this study. Of these participants’ tweets, we identified only two instances that were related to health and occurred within 2 weeks (before or after) of a survey indicating symptoms. It was not possible to predict when participants reported symptoms using features derived from topic models (area under the curve [AUC]=0.51; P=.38), though it was possible using behavior features, albeit with a very small effect size (AUC=0.53; P≤.001). Individual symptoms were also generally not predictable either. The study sample andmore » a random sample from Twitter are predictably different on held-out data (AUC=0.67; P≤.001), meaning that the content posted by people who participated in this study was predictably different from that posted by random Twitter users. Individuals in the random sample and the GoViral sample used Twitter with similar frequencies (similar @ mentions, number of tweets, and number of retweets; AUC=0.50; P=.19). To our knowledge, this is the first instance of an attempt to use a ground truth dataset to validate infectious disease observations in social media data. The lack of signal, the lack of predictability among behaviors or topics, and the demonstrated volunteer bias in the study population are important findings for the large and growing body of disease surveillance using internet-sourced data.« less

Authors:
ORCiD logo [1]; ORCiD logo [2]; ORCiD logo [3]
  1. Los Alamos National Lab. (LANL), Los Alamos, NM (United States)
  2. New York Univ. (NYU), NY (United States)
  3. Univ. of Colorado, Boulder, CO (United States)
Publication Date:
Research Org.:
Los Alamos National Lab. (LANL), Los Alamos, NM (United States)
Sponsoring Org.:
USDOE; National Science Foundation (NSF)
OSTI Identifier:
1716788
Report Number(s):
LA-UR-19-21141
Journal ID: ISSN 2369-2960
Grant/Contract Number:  
89233218CNA000001; IIS-1643576; IIS-1551036
Resource Type:
Accepted Manuscript
Journal Name:
JMIR Public Health and Surveillance
Additional Journal Information:
Journal Volume: 6; Journal Issue: 2; Journal ID: ISSN 2369-2960
Publisher:
JMIR Publications
Country of Publication:
United States
Language:
English
Subject:
59 BASIC BIOLOGICAL SCIENCES; social media; infodemiology; influenza; selection bias; bias; logistic models

Citation Formats

Daughton, Ashlynn R., Chunara, Rumi, and Paul, Michael J. Comparison of Social Media, Syndromic Surveillance, and Microbiologic Acute Respiratory Infection Data: Observational Study. United States: N. p., 2020. Web. doi:10.2196/14986.
Daughton, Ashlynn R., Chunara, Rumi, & Paul, Michael J. Comparison of Social Media, Syndromic Surveillance, and Microbiologic Acute Respiratory Infection Data: Observational Study. United States. https://doi.org/10.2196/14986
Daughton, Ashlynn R., Chunara, Rumi, and Paul, Michael J. Fri . "Comparison of Social Media, Syndromic Surveillance, and Microbiologic Acute Respiratory Infection Data: Observational Study". United States. https://doi.org/10.2196/14986. https://www.osti.gov/servlets/purl/1716788.
@article{osti_1716788,
title = {Comparison of Social Media, Syndromic Surveillance, and Microbiologic Acute Respiratory Infection Data: Observational Study},
author = {Daughton, Ashlynn R. and Chunara, Rumi and Paul, Michael J.},
abstractNote = {Internet data can be used to improve infectious disease models. However, the representativeness and individual-level validity of internet-derived measures are largely unexplored as this requires ground truth data for study. This study sought to identify relationships between Web-based behaviors and/or conversation topics and health status using a ground truth, survey-based dataset. This study leveraged a unique dataset of self-reported surveys, microbiological laboratory tests, and social media data from the same individuals toward understanding the validity of individual-level constructs pertaining to influenza-like illness in social media data. Logistic regression models were used to identify illness in Twitter posts using user posting behaviors and topic model features extracted from users’ tweets. Of 396 original study participants, only 81 met the inclusion criteria for this study. Of these participants’ tweets, we identified only two instances that were related to health and occurred within 2 weeks (before or after) of a survey indicating symptoms. It was not possible to predict when participants reported symptoms using features derived from topic models (area under the curve [AUC]=0.51; P=.38), though it was possible using behavior features, albeit with a very small effect size (AUC=0.53; P≤.001). Individual symptoms were also generally not predictable either. The study sample and a random sample from Twitter are predictably different on held-out data (AUC=0.67; P≤.001), meaning that the content posted by people who participated in this study was predictably different from that posted by random Twitter users. Individuals in the random sample and the GoViral sample used Twitter with similar frequencies (similar @ mentions, number of tweets, and number of retweets; AUC=0.50; P=.19). To our knowledge, this is the first instance of an attempt to use a ground truth dataset to validate infectious disease observations in social media data. The lack of signal, the lack of predictability among behaviors or topics, and the demonstrated volunteer bias in the study population are important findings for the large and growing body of disease surveillance using internet-sourced data.},
doi = {10.2196/14986},
journal = {JMIR Public Health and Surveillance},
number = 2,
volume = 6,
place = {United States},
year = {Fri Apr 24 00:00:00 EDT 2020},
month = {Fri Apr 24 00:00:00 EDT 2020}
}

Works referenced in this record:

Using search queries for malaria surveillance, Thailand
journal, January 2013


The Parable of Google Flu: Traps in Big Data Analysis
journal, March 2014


Even a good influenza forecasting model can benefit from internet-based nowcasts, but those benefits are limited
journal, February 2019


Combining Search, Social Media, and Traditional Data Sources to Improve Influenza Surveillance
journal, October 2015


Using the Internet for Surveys and Health Research
journal, January 2002

  • Eysenbach, Gunther; Wyatt, Jeremy
  • Journal of Medical Internet Research, Vol. 4, Issue 2
  • DOI: 10.2196/jmir.4.2.e13

Recruitment of mental health survey participants using Internet advertising: content, characteristics and cost effectiveness: Recruitment using Internet advertising
journal, February 2014

  • Batterham, Philip J.
  • International Journal of Methods in Psychiatric Research, Vol. 23, Issue 2
  • DOI: 10.1002/mpr.1421

Predicting Acute Respiratory Infections from Participatory Data
journal, May 2017

  • Ray, Bisakha; Chunara, Rumi
  • Online Journal of Public Health Informatics, Vol. 9, Issue 1
  • DOI: 10.5210/ojphi.v9i1.7650

Population Reach and Recruitment Bias in a Maintenance RCT in Physically Active Older Adults
journal, January 2010

  • Martinson, Brian C.; Crain, A. Lauren; Sherwood, Nancy E.
  • Journal of Physical Activity and Health, Vol. 7, Issue 1
  • DOI: 10.1123/jpah.7.1.127

Measuring Global Disease with Wikipedia: Success, Failure, and a Research Agenda
conference, January 2017

  • Priedhorsky, Reid; Osthus, Dave; Daughton, Ashlynn R.
  • Proceedings of the 2017 ACM Conference on Computer Supported Cooperative Work and Social Computing - CSCW '17
  • DOI: 10.1145/2998181.2998183

Factors influencing the response rate in social question and answering behavior
conference, January 2013

  • Liu, Zhe; Jansen, Bernard J.
  • Proceedings of the 2013 conference on Computer supported cooperative work - CSCW '13
  • DOI: 10.1145/2441776.2441918

Social media as a measurement tool of depression in populations
conference, January 2013

  • De Choudhury, Munmun; Counts, Scott; Horvitz, Eric
  • Proceedings of the 5th Annual ACM Web Science Conference on - WebSci '13
  • DOI: 10.1145/2464464.2464480

Using Web Search Query Data to Monitor Dengue Epidemics: A New Model for Neglected Tropical Disease Surveillance
journal, May 2011


The Use of Twitter to Track Levels of Disease Activity and Public Concern in the U.S. during the Influenza A H1N1 Pandemic
journal, May 2011


Volunteer bias in twin research: The rule of two‐thirds
journal, March 1978


Social and News Media Enable Estimation of Epidemiological Patterns Early in the 2010 Haitian Cholera Outbreak
journal, January 2012

  • Chunara, Rumi; Andrews, Jason R.; Brownstein, John S.
  • The American Journal of Tropical Medicine and Hygiene, Vol. 86, Issue 1
  • DOI: 10.4269/ajtmh.2012.11-0597

Social Media Use in the United States: Implications for Health Communication
journal, January 2009

  • Chou, Wen-ying Sylvia; Hunt, Yvonne M.; Beckjord, Ellen Burke
  • Journal of Medical Internet Research, Vol. 11, Issue 4
  • DOI: 10.2196/jmir.1249

Denominator Issues for Personally Generated Data in Population Health Monitoring
journal, April 2017

  • Chunara, Rumi; Wisk, Lauren E.; Weitzman, Elissa R.
  • American Journal of Preventive Medicine, Vol. 52, Issue 4
  • DOI: 10.1016/j.amepre.2016.10.038

Detecting influenza epidemics using search engine query data
journal, February 2009

  • Ginsberg, Jeremy; Mohebbi, Matthew H.; Patel, Rajan S.
  • Nature, Vol. 457, Issue 7232
  • DOI: 10.1038/nature07634

Global Disease Monitoring and Forecasting with Wikipedia
journal, November 2014


Social media for large studies of behavior
journal, November 2014


Evaluation of reporting timeliness of public health surveillance systems for infectious diseases
journal, July 2004


Twitter Improves Influenza Forecasting
journal, January 2014


Variations in Facebook Posting Patterns Across Validated Patient Health Conditions: A Prospective Cohort Study
journal, January 2017

  • Smith, Robert J.; Crutchley, Patrick; Schwartz, H. Andrew
  • Journal of Medical Internet Research, Vol. 19, Issue 1
  • DOI: 10.2196/jmir.6486

Subject Recruitment Bias: The Paid Volunteer Subject
journal, December 1978

  • Rush, Michael C.; Phillips, James S.; Panek, Paul E.
  • Perceptual and Motor Skills, Vol. 47, Issue 2
  • DOI: 10.2466/pms.1978.47.2.443

Advanced Paternal Age Is Associated with Impaired Neurocognitive Outcomes during Infancy and Childhood
journal, March 2009


Bias in Online Recruitment and Retention of Racial and Ethnic Minority Men Who Have Sex With Men
journal, January 2011

  • Sullivan, Patrick S.; Khosropour, Christine M.; Luisi, Nicole
  • Journal of Medical Internet Research, Vol. 13, Issue 2
  • DOI: 10.2196/jmir.1797

Language of ADHD in Adults on Social Media
journal, November 2017

  • Guntuku, Sharath Chandra; Ramsay, J. Russell; Merchant, Raina M.
  • Journal of Attention Disorders, Vol. 23, Issue 12
  • DOI: 10.1177/1087054717738083

Assessing Google Flu Trends Performance in the United States during the 2009 Influenza Virus A (H1N1) Pandemic
journal, August 2011