Comparison of Social Media, Syndromic Surveillance, and Microbiologic Acute Respiratory Infection Data: Observational Study
Abstract
Internet data can be used to improve infectious disease models. However, the representativeness and individual-level validity of internet-derived measures are largely unexplored as this requires ground truth data for study. This study sought to identify relationships between Web-based behaviors and/or conversation topics and health status using a ground truth, survey-based dataset. This study leveraged a unique dataset of self-reported surveys, microbiological laboratory tests, and social media data from the same individuals toward understanding the validity of individual-level constructs pertaining to influenza-like illness in social media data. Logistic regression models were used to identify illness in Twitter posts using user posting behaviors and topic model features extracted from users’ tweets. Of 396 original study participants, only 81 met the inclusion criteria for this study. Of these participants’ tweets, we identified only two instances that were related to health and occurred within 2 weeks (before or after) of a survey indicating symptoms. It was not possible to predict when participants reported symptoms using features derived from topic models (area under the curve [AUC]=0.51; P=.38), though it was possible using behavior features, albeit with a very small effect size (AUC=0.53; P≤.001). Individual symptoms were also generally not predictable either. The study sample andmore »
- Authors:
-
- Los Alamos National Lab. (LANL), Los Alamos, NM (United States)
- New York Univ. (NYU), NY (United States)
- Univ. of Colorado, Boulder, CO (United States)
- Publication Date:
- Research Org.:
- Los Alamos National Lab. (LANL), Los Alamos, NM (United States)
- Sponsoring Org.:
- USDOE; National Science Foundation (NSF)
- OSTI Identifier:
- 1716788
- Report Number(s):
- LA-UR-19-21141
Journal ID: ISSN 2369-2960
- Grant/Contract Number:
- 89233218CNA000001; IIS-1643576; IIS-1551036
- Resource Type:
- Accepted Manuscript
- Journal Name:
- JMIR Public Health and Surveillance
- Additional Journal Information:
- Journal Volume: 6; Journal Issue: 2; Journal ID: ISSN 2369-2960
- Publisher:
- JMIR Publications
- Country of Publication:
- United States
- Language:
- English
- Subject:
- 59 BASIC BIOLOGICAL SCIENCES; social media; infodemiology; influenza; selection bias; bias; logistic models
Citation Formats
Daughton, Ashlynn R., Chunara, Rumi, and Paul, Michael J. Comparison of Social Media, Syndromic Surveillance, and Microbiologic Acute Respiratory Infection Data: Observational Study. United States: N. p., 2020.
Web. doi:10.2196/14986.
Daughton, Ashlynn R., Chunara, Rumi, & Paul, Michael J. Comparison of Social Media, Syndromic Surveillance, and Microbiologic Acute Respiratory Infection Data: Observational Study. United States. https://doi.org/10.2196/14986
Daughton, Ashlynn R., Chunara, Rumi, and Paul, Michael J. Fri .
"Comparison of Social Media, Syndromic Surveillance, and Microbiologic Acute Respiratory Infection Data: Observational Study". United States. https://doi.org/10.2196/14986. https://www.osti.gov/servlets/purl/1716788.
@article{osti_1716788,
title = {Comparison of Social Media, Syndromic Surveillance, and Microbiologic Acute Respiratory Infection Data: Observational Study},
author = {Daughton, Ashlynn R. and Chunara, Rumi and Paul, Michael J.},
abstractNote = {Internet data can be used to improve infectious disease models. However, the representativeness and individual-level validity of internet-derived measures are largely unexplored as this requires ground truth data for study. This study sought to identify relationships between Web-based behaviors and/or conversation topics and health status using a ground truth, survey-based dataset. This study leveraged a unique dataset of self-reported surveys, microbiological laboratory tests, and social media data from the same individuals toward understanding the validity of individual-level constructs pertaining to influenza-like illness in social media data. Logistic regression models were used to identify illness in Twitter posts using user posting behaviors and topic model features extracted from users’ tweets. Of 396 original study participants, only 81 met the inclusion criteria for this study. Of these participants’ tweets, we identified only two instances that were related to health and occurred within 2 weeks (before or after) of a survey indicating symptoms. It was not possible to predict when participants reported symptoms using features derived from topic models (area under the curve [AUC]=0.51; P=.38), though it was possible using behavior features, albeit with a very small effect size (AUC=0.53; P≤.001). Individual symptoms were also generally not predictable either. The study sample and a random sample from Twitter are predictably different on held-out data (AUC=0.67; P≤.001), meaning that the content posted by people who participated in this study was predictably different from that posted by random Twitter users. Individuals in the random sample and the GoViral sample used Twitter with similar frequencies (similar @ mentions, number of tweets, and number of retweets; AUC=0.50; P=.19). To our knowledge, this is the first instance of an attempt to use a ground truth dataset to validate infectious disease observations in social media data. The lack of signal, the lack of predictability among behaviors or topics, and the demonstrated volunteer bias in the study population are important findings for the large and growing body of disease surveillance using internet-sourced data.},
doi = {10.2196/14986},
journal = {JMIR Public Health and Surveillance},
number = 2,
volume = 6,
place = {United States},
year = {Fri Apr 24 00:00:00 EDT 2020},
month = {Fri Apr 24 00:00:00 EDT 2020}
}
Works referenced in this record:
Using search queries for malaria surveillance, Thailand
journal, January 2013
- Ocampo, Alex J.; Chunara, Rumi; Brownstein, John S.
- Malaria Journal, Vol. 12, Issue 1
The Parable of Google Flu: Traps in Big Data Analysis
journal, March 2014
- Lazer, D.; Kennedy, R.; King, G.
- Science, Vol. 343, Issue 6176
Even a good influenza forecasting model can benefit from internet-based nowcasts, but those benefits are limited
journal, February 2019
- Osthus, Dave; Daughton, Ashlynn R.; Priedhorsky, Reid
- PLOS Computational Biology, Vol. 15, Issue 2
Combining Search, Social Media, and Traditional Data Sources to Improve Influenza Surveillance
journal, October 2015
- Santillana, Mauricio; Nguyen, André T.; Dredze, Mark
- PLOS Computational Biology, Vol. 11, Issue 10
Using the Internet for Surveys and Health Research
journal, January 2002
- Eysenbach, Gunther; Wyatt, Jeremy
- Journal of Medical Internet Research, Vol. 4, Issue 2
Recruitment of mental health survey participants using Internet advertising: content, characteristics and cost effectiveness: Recruitment using Internet advertising
journal, February 2014
- Batterham, Philip J.
- International Journal of Methods in Psychiatric Research, Vol. 23, Issue 2
Predicting Acute Respiratory Infections from Participatory Data
journal, May 2017
- Ray, Bisakha; Chunara, Rumi
- Online Journal of Public Health Informatics, Vol. 9, Issue 1
Population Reach and Recruitment Bias in a Maintenance RCT in Physically Active Older Adults
journal, January 2010
- Martinson, Brian C.; Crain, A. Lauren; Sherwood, Nancy E.
- Journal of Physical Activity and Health, Vol. 7, Issue 1
Measuring Global Disease with Wikipedia: Success, Failure, and a Research Agenda
conference, January 2017
- Priedhorsky, Reid; Osthus, Dave; Daughton, Ashlynn R.
- Proceedings of the 2017 ACM Conference on Computer Supported Cooperative Work and Social Computing - CSCW '17
Factors influencing the response rate in social question and answering behavior
conference, January 2013
- Liu, Zhe; Jansen, Bernard J.
- Proceedings of the 2013 conference on Computer supported cooperative work - CSCW '13
Social media as a measurement tool of depression in populations
conference, January 2013
- De Choudhury, Munmun; Counts, Scott; Horvitz, Eric
- Proceedings of the 5th Annual ACM Web Science Conference on - WebSci '13
Surveillance of Acute Respiratory Infections Using Community-Submitted Symptoms and Specimens for Molecular Diagnostic Testing
journal, January 2015
- Goff, Jennifer; Rowe, Aaron; Brownstein, John S.
- PLoS Currents
Using Web Search Query Data to Monitor Dengue Epidemics: A New Model for Neglected Tropical Disease Surveillance
journal, May 2011
- Chan, Emily H.; Sahai, Vikram; Conrad, Corrie
- PLoS Neglected Tropical Diseases, Vol. 5, Issue 5
The Use of Twitter to Track Levels of Disease Activity and Public Concern in the U.S. during the Influenza A H1N1 Pandemic
journal, May 2011
- Signorini, Alessio; Segre, Alberto Maria; Polgreen, Philip M.
- PLoS ONE, Vol. 6, Issue 5
Volunteer bias in twin research: The rule of two‐thirds
journal, March 1978
- Lykken, D. T.; Tellegen, A.; DeRubeis, R.
- Social Biology, Vol. 25, Issue 1
Social and News Media Enable Estimation of Epidemiological Patterns Early in the 2010 Haitian Cholera Outbreak
journal, January 2012
- Chunara, Rumi; Andrews, Jason R.; Brownstein, John S.
- The American Journal of Tropical Medicine and Hygiene, Vol. 86, Issue 1
Social Media Use in the United States: Implications for Health Communication
journal, January 2009
- Chou, Wen-ying Sylvia; Hunt, Yvonne M.; Beckjord, Ellen Burke
- Journal of Medical Internet Research, Vol. 11, Issue 4
Denominator Issues for Personally Generated Data in Population Health Monitoring
journal, April 2017
- Chunara, Rumi; Wisk, Lauren E.; Weitzman, Elissa R.
- American Journal of Preventive Medicine, Vol. 52, Issue 4
Detecting influenza epidemics using search engine query data
journal, February 2009
- Ginsberg, Jeremy; Mohebbi, Matthew H.; Patel, Rajan S.
- Nature, Vol. 457, Issue 7232
Global Disease Monitoring and Forecasting with Wikipedia
journal, November 2014
- Generous, Nicholas; Fairchild, Geoffrey; Deshpande, Alina
- PLoS Computational Biology, Vol. 10, Issue 11
Social media for large studies of behavior
journal, November 2014
- Ruths, Derek; Pfeffer, Jürgen
- Science, Vol. 346, Issue 6213
Evaluation of reporting timeliness of public health surveillance systems for infectious diseases
journal, July 2004
- Jajosky, Ruth Ann; Groseclose, Samuel L.
- BMC Public Health, Vol. 4, Issue 1
Twitter Improves Influenza Forecasting
journal, January 2014
- Paul, Michael J.; Dredze, Mark; Broniatowski, David
- PLoS Currents
Variations in Facebook Posting Patterns Across Validated Patient Health Conditions: A Prospective Cohort Study
journal, January 2017
- Smith, Robert J.; Crutchley, Patrick; Schwartz, H. Andrew
- Journal of Medical Internet Research, Vol. 19, Issue 1
Subject Recruitment Bias: The Paid Volunteer Subject
journal, December 1978
- Rush, Michael C.; Phillips, James S.; Panek, Paul E.
- Perceptual and Motor Skills, Vol. 47, Issue 2
Advanced Paternal Age Is Associated with Impaired Neurocognitive Outcomes during Infancy and Childhood
journal, March 2009
- Saha, Sukanta; Barnett, Adrian G.; Foldi, Claire
- PLoS Medicine, Vol. 6, Issue 3
Bias in Online Recruitment and Retention of Racial and Ethnic Minority Men Who Have Sex With Men
journal, January 2011
- Sullivan, Patrick S.; Khosropour, Christine M.; Luisi, Nicole
- Journal of Medical Internet Research, Vol. 13, Issue 2
Language of ADHD in Adults on Social Media
journal, November 2017
- Guntuku, Sharath Chandra; Ramsay, J. Russell; Merchant, Raina M.
- Journal of Attention Disorders, Vol. 23, Issue 12
Who Tweets with Their Location? Understanding the Relationship between Demographic Characteristics and the Use of Geoservices and Geotagging on Twitter
journal, November 2015
- Sloan, Luke; Morgan, Jeffrey
- PLOS ONE, Vol. 10, Issue 11
Assessing Google Flu Trends Performance in the United States during the 2009 Influenza Virus A (H1N1) Pandemic
journal, August 2011
- Cook, Samantha; Conrad, Corrie; Fowlkes, Ashley L.
- PLoS ONE, Vol. 6, Issue 8