skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Global disease monitoring and forecasting with Wikipedia

Journal Article · · PLoS Computational Biology (Online)

Infectious disease is a leading threat to public health, economic stability, and other key social structures. Efforts to mitigate these impacts depend on accurate and timely monitoring to measure the risk and progress of disease. Traditional, biologically-focused monitoring techniques are accurate but costly and slow; in response, new techniques based on social internet data, such as social media and search queries, are emerging. These efforts are promising, but important challenges in the areas of scientific peer review, breadth of diseases and countries, and forecasting hamper their operational usefulness. We examine a freely available, open data source for this use: access logs from the online encyclopedia Wikipedia. Using linear models, language as a proxy for location, and a systematic yet simple article selection procedure, we tested 14 location-disease combinations and demonstrate that these data feasibly support an approach that overcomes these challenges. Specifically, our proof-of-concept yields models with up to 0.92, forecasting value up to the 28 days tested, and several pairs of models similar enough to suggest that transferring models from one location to another without re-training is feasible. Based on these preliminary results, we close with a research agenda designed to overcome these challenges and produce a disease monitoring and forecasting system that is significantly more effective, robust, and globally comprehensive than the current state of the art.

Research Organization:
Los Alamos National Laboratory (LANL), Los Alamos, NM (United States)
Sponsoring Organization:
USDOE
Grant/Contract Number:
AC52-06NA25396
OSTI ID:
1214710
Journal Information:
PLoS Computational Biology (Online), Vol. 10, Issue 11; ISSN 1553-7358
Publisher:
Public Library of ScienceCopyright Statement
Country of Publication:
United States
Language:
English
Citation Metrics:
Cited by: 95 works
Citation information provided by
Web of Science

References (77)

MapReduce: simplified data processing on large clusters journal January 2008
Eye-Opening Approach to Norovirus Surveillance journal August 2010
Public Anxiety and Information Seeking Following the H1N1 Outbreak: Blogs, Newspaper Articles, and Wikipedia Visits journal August 2011
An Evaluation of Wikipedia as a Resource for Patient Education in Nephrology: WIKIPEDIA FOR RENAL PATIENT EDUCATION journal February 2013
Wikipedia Usage Estimates Prevalence of Influenza-Like Illness in the United States in Near Real-Time journal April 2014
Real-time influenza forecasts during the 2012–2013 season journal December 2013
Seasonality in Seeking Mental Health Information on Google journal May 2013
BioCaster: detecting public health rumors with a Web-based text mining system journal October 2008
Early detection of disease outbreaks using the Internet journal March 2009
Early Prediction of Movie Box Office Success Based on Wikipedia Activity Big Data journal August 2013
Using Web Search Query Data to Monitor Dengue Epidemics: A New Model for Neglected Tropical Disease Surveillance journal May 2011
Google Trends: A Web‐Based Tool for Real‐Time Surveillance of Disease Outbreaks journal November 2009
Using Internet Searches for Influenza Surveillance journal December 2008
Social and News Media Enable Estimation of Epidemiological Patterns Early in the 2010 Haitian Cholera Outbreak journal January 2012
Analysis and forecasting of trending topics in online media streams conference January 2013
Use of Hangeul Twitter to Track and Predict Human Influenza Infection journal July 2013
Internet Search Patterns of Human Immunodeficiency Virus and the Digital Divide in the Russian Federation: Infoveillance Study journal January 2013
Systematic Review: Surveillance Systems for Early Detection of Bioterrorism-Related Diseases journal June 2004
Reassessing Google Flu Trends Data for Detection of Seasonal and Pandemic Influenza: A Comparative Epidemiological Study at Three Geographic Scales journal October 2013
Global and regional burden of disease and risk factors, 2001: systematic analysis of population health data journal May 2006
A neural netwok based approach to detect influenza epidemics using search engine query data conference July 2010
Internet Queries and Methicillin-Resistant Staphylococcus aureus Surveillance journal June 2011
Tracking the flu pandemic by monitoring the social web conference June 2010
Creating, destroying, and restoring value in wikipedia
  • Priedhorsky, Reid; Chen, Jilin; Lam, Shyong (Tony) K.
  • Proceedings of the 2007 international ACM conference on Conference on supporting group work - GROUP '07 https://doi.org/10.1145/1316624.1316663
conference January 2007
More Diseases Tracked by Using Google Trends journal August 2009
Norovirus Disease Surveillance Using Google Internet Query Share Data journal June 2012
Gonorrhea incidence forecasting research based on Baidu search data
  • Jia-xing, Bao; Bcn-fu, Lv; Geng, Peng
  • 2013 International Conference on Management Science and Engineering (ICMSE), 2013 International Conference on Management Science and Engineering 20th Annual Conference Proceedings https://doi.org/10.1109/ICMSE.2013.6586259
conference July 2013
Internet suicide searches and the incidence of suicide in young people in Japan journal April 2011
Can electoral popularity be predicted using socially generated big data? journal January 2014
Monitoring Influenza Epidemics in China with Search Query from Baidu journal May 2013
National and Local Influenza Surveillance through Twitter: An Analysis of the 2012-2013 Influenza Epidemic journal December 2013
Using search queries for malaria surveillance, Thailand journal January 2013
Influenza Forecasting with Google Flu Trends journal February 2013
Predicting Flu Trends using Twitter data
  • Achrekar, Harshavardhan; Gandhe, Avinash; Lazarus, Ross
  • IEEE INFOCOM 2011 - IEEE Conference on Computer Communications Workshops, 2011 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS) https://doi.org/10.1109/INFCOMW.2011.5928903
conference April 2011
WikiPop: personalized event detection system based on Wikipedia page view statistics conference January 2010
Association of Internet search trends with suicide death in Taipei City, Taiwan, 2004–2009 journal July 2011
Quality of Information on the Internet About Carpal Tunnel Syndrome: An Update journal August 2013
Prediction of Infectious Disease Spread Using Twitter: A Case of Influenza conference December 2012
Using Google Trends for Influenza Surveillance in South China journal January 2013
The Use of Twitter to Track Levels of Disease Activity and Public Concern in the U.S. during the Influenza A H1N1 Pandemic journal May 2011
Categorization, Prioritization, and Surveillance of Potential Bioterrorism Agents journal June 2006
Influences, usage, and outcomes of Internet health information searching: Multivariate results from the Pew surveys journal January 2006
Wikipedia and osteosarcoma: a trustworthy patients' information? journal July 2010
Internet encyclopaedias go head to head journal December 2005
Accuracy and completeness of drug information in Wikipedia: an assessment journal October 2011
Head Lice Surveillance on a Deregulated OTC-Sales Market: A Study Using Web Query Data journal November 2012
Use of Google Insights for Search to Track Seasonal and Geographic Kidney Stone Incidence in the United States journal August 2011
Online reporting for malaria surveillance using micro-monetary incentives, in urban India 2010-2011 journal February 2012
The annual impact of seasonal influenza in the US: Measuring disease burden and costs journal June 2007
Prediction of Dengue Incidence Using Search Query Surveillance journal August 2011
The utility of “Google Trends” for epidemiological research: Lyme disease as an example journal May 2010
Flu Near You: An Online Self-reported Influenza Surveillance System in the USA journal March 2013
Lightweight methods to estimate influenza rates and alcohol sales volume from Twitter messages journal May 2012
Correlation between National Influenza Surveillance Data and Google Trends in South Korea journal December 2013
Web Queries as a Source for Syndromic Surveillance journal February 2009
Notifiable infectious disease surveillance with data collected by search engine journal April 2010
The Complex Relationship of Realspace Events and Messages in Cyberspace: Case Study of Influenza and Pertussis Using Tweets journal January 2013
When Google got flu wrong journal February 2013
HealthMap: Global Infectious Disease Monitoring through Automated Classification and Visualization of Internet Media Reports journal March 2008
Enhancing Twitter Data Analysis with Simple Semantic Filtering: Example in Tracking Influenza-Like Illnesses
  • Doan, Son; Ohno-Machado, Lucila; Collier, Nigel
  • 2012 IEEE Second International Conference on Healthcare Informatics, Imaging and Systems Biology (HISB) https://doi.org/10.1109/HISB.2012.21
conference September 2012
Internet Queries and Methicillin-Resistant Staphylococcus aureus Surveillance journal June 2011
Early detection of disease outbreaks using the Internet journal April 2009
Enhancing Twitter Data Analysis with Simple Semantic Filtering: Example in Tracking Influenza-Like Illnesses text January 2012
Can electoral popularity be predicted using socially generated big data? preprint January 2013
Using Web Mining for Discovering Spatial Patterns and Hot Spots for Spatial Generalization book January 2012
The bioterrorism preparedness and response Early Aberration Reporting System (EARS) journal March 2003
Re: Breyer et al.: Use of Google Insights for Search to Track Seasonal and Geographic Kidney Stone Incidence in the United States (Urology 2011;78:267-271) journal February 2012
Quantifying Wikipedia Usage Patterns Before Stock Market Moves journal May 2013
Monitoring Epidemic Alert Levels by Analyzing Internet Search Volume journal February 2013
Nowcasting Events from the Social Web with Statistical Learning journal September 2012
An approach for using Wikipedia to measure the flow of trends across countries conference May 2013
Seeking Health Information Online: Does Wikipedia Matter? journal July 2009
Patient-Oriented Cancer Information on the Internet: A Comparison of Wikipedia and a Professionally Maintained Database journal September 2011
A New Approach to Monitoring Dengue Activity journal May 2011
Web query-based surveillance in Sweden during the influenza A(H1N1)2009 pandemic, April 2009 to February 2010 journal May 2011
Determination of geographic variance in stroke prevalence using Internet search engine analytics journal June 2011
Modeling page-view dynamics on Wikipedia preprint January 2012

Cited By (49)

Big Data for Policymaking: Great Expectations, but with Limited Progress?: Big Data for Policymaking journal July 2018
Inspiration, Captivation, and Misdirection: Emergent Properties in Networks of Online Navigation book January 2018
Disease surveillance based on Internet-based linear models: an Australian case study of previously unmodeled infection diseases journal December 2016
Digital Pharmacovigilance and Disease Surveillance: Combining Traditional and Big-Data Systems for Better Public Health journal November 2016
Mind the Scales: Harnessing Spatial Big Data for Infectious Disease Surveillance and Inference journal November 2016
Epidemic Forecasting is Messier Than Weather Forecasting: The Role of Human Behavior and Internet Data Streams in Epidemic Forecast journal November 2016
Improved real-time influenza surveillance using Internet search data in eight Latin American countries journal September 2018
PREPRINT: Using digital epidemiology methods to monitor influenza-like illness in the Netherlands in real-time: the 2017-2018 season posted_content October 2018
Evolution of Wikipedia’s medical content: past, present and future journal August 2017
Enhancing disease surveillance with novel data streams: challenges and opportunities journal October 2015
Uncovering the relationships between military community health and affects expressed in social media journal June 2017
Measuring Global Disease with Wikipedia: Success, Failure, and a Research Agenda
  • Priedhorsky, Reid; Osthus, Dave; Daughton, Ashlynn R.
  • Proceedings of the 2017 ACM Conference on Computer Supported Cooperative Work and Social Computing - CSCW '17 https://doi.org/10.1145/2998181.2998183
conference January 2017
Using Participatory Web-based Surveillance Data to Improve Seasonal Influenza Forecasting in Italy
  • Perrotta, Daniela; Tizzoni, Michele; Paolotti, Daniela
  • WWW '17: 26th International World Wide Web Conference, Proceedings of the 26th International Conference on World Wide Web https://doi.org/10.1145/3038912.3052670
conference April 2017
Using electronic health records and Internet search information for accurate influenza forecasting journal May 2017
Summary results of the 2014-2015 DARPA Chikungunya challenge journal May 2018
Forecasting the 2013–2014 Influenza Season Using Wikipedia journal May 2015
Combining Search, Social Media, and Traditional Data Sources to Improve Influenza Surveillance journal October 2015
Nonmechanistic forecasts of seasonal influenza with iterative one-week-ahead distributions journal June 2018
Even a good influenza forecasting model can benefit from internet-based nowcasts, but those benefits are limited journal February 2019
Dengue prediction by the web: Tweets are a useful tool for estimating and forecasting Dengue at country and city level journal July 2017
Internet-based biosurveillance methods for vector-borne diseases: Are they novel public health tools or just novelties? journal November 2017
Supplementing Public Health Inspection via Social Media journal March 2016
Correlation between National Influenza Surveillance Data and Search Queries from Mobile Devices and Desktops in South Korea journal July 2016
Forecasting influenza-like illness dynamics for military populations using neural networks and social media journal December 2017
Real Time Influenza Monitoring Using Hospital Big Data in Combination with Machine Learning Methods: Comparison Study journal January 2018
Improved Real-Time Influenza Surveillance: Using Internet Search Data in Eight Latin American Countries journal January 2019
Identifying Protective Health Behaviors on Twitter: Observational Study of Travel Advisories and Zika Virus journal January 2019
The Application of Internet-Based Sources for Public Health Surveillance (Infoveillance): Systematic Review
  • Barros, Joana M.; Duggan, Jim; Rebholz-Schuhmann, Dietrich
  • Journal of Medical Internet Research, Vol. 22, Issue 3 https://doi.org/10.2196/13680
journal January 2020
Wikipedia and Medicine: Quantifying Readership, Editors, and the Significance of Natural Language journal January 2015
Evaluating Google, Twitter, and Wikipedia as Tools for Influenza Surveillance Using Bayesian Change Point Analysis: A Comparative Analysis journal January 2016
Determinants of Participants’ Follow-Up and Characterization of Representativeness in Flu Near You, A Participatory Disease Surveillance System journal January 2017
Automated Real-Time Collection of Pathogen-Specific Diagnostic Data: Syndromic Infectious Disease Epidemiology journal January 2018
Social Monitoring for Public Health journal August 2017
Global Research on Syndromic Surveillance from 1993 to 2017: Bibliometric Analysis and Visualization journal September 2018
Forecasting Zoonotic Infectious Disease Response to Climate Change: Mosquito Vectors and a Changing Environment journal May 2019
Evolution of Wikipedia’s medical content: past, present and future text January 2021
Forecasting the 2013--2014 Influenza Season using Wikipedia text January 2014
Combining Search, Social Media, and Traditional Data Sources to Improve Influenza Surveillance text January 2015
Using internet search data to predict new HIV diagnoses in China: a modelling study journal October 2018
Clinical Age-Specific Seasonal Conjunctivitis Patterns and Their Online Detection in Twitter, Blog, Forum, and Comment Social Media Posts journal February 2018
Situating Wikipedia as a health information resource in various contexts: A scoping review journal February 2020
Automated Real-Time Collection of Pathogen-Specific Diagnostic Data: Syndromic Infectious Disease Epidemiology journal April 2018
Digital Epidemiology: Use of Digital Data Collected for Non-epidemiological Purposes in Epidemiological Studies journal January 2018
Enhancement of Epidemiological Models for Dengue Fever Based on Twitter Data preprint January 2017
Epidemiological data challenges: planning for a more robust future through data standards text January 2018
Demographic Inference and Representative Population Estimates from Multilingual Social Media Data text January 2019
Collective response to the media coverage of COVID-19 Pandemic on Reddit and Wikipedia preprint January 2020
Does the blue bird get the flu? using Twitter for flu surveillance text January 2017
Design Choices for Automated Disease Surveillance in the Social Web journal September 2018

Similar Records

Forecasting the 2013–2014 influenza season using Wikipedia
Journal Article · Thu May 14 00:00:00 EDT 2015 · PLoS Computational Biology (Online) · OSTI ID:1214710

Science in 60 - The Forecast Calls for Flu
Multimedia · Wed Mar 02 00:00:00 EST 2016 · OSTI ID:1214710

Fast and accurate influenza forecasting in the United States with Inferno
Journal Article · Mon Jan 31 00:00:00 EST 2022 · PLoS Computational Biology (Online) · OSTI ID:1214710