Epi Archive: automated data collection of notifiable disease data

Generous, Nicholas; Fairchild, Geoffrey; Khalsa, Hari; Tasseff, Byron; Arnold, James

doi:10.5210/ojphi.v9i1.7615

Title: Epi Archive: automated data collection of notifiable disease data

Abstract

LANL has built a software program that automatically collects global notifiable disease data—particularly data stored in files—and makes it available and shareable within the Biosurveillance Ecosystem(BSVE) as a new data source. This will improve the prediction and early warning of disease events and other applications. Most countries do not report national notifiable disease data in a machine-readable format. Data are often in the form of a file that contains text, tables and graphs summarizing weekly or monthly disease counts. This presents a problem when information is needed for more data intensive approaches to epidemiology, biosurveillance and public health as exemplified by the Biosurveillance Ecosystem(BSVE).While most nations do likely store their data in a machine-readable format, the governments are often hesitant to share data openly for a variety of reasons that include technical, political, economic, and motivational issues. For example, an attempt by LANL to obtain a weekly version of openly available monthly data, reported by the Australian government, resulted in an onerous bureaucratic reply. The obstacles to obtaining data included: paperwork to request data from each of the Australian states and territories, a long delay to obtain data (up to 3 months) and extensive limitations on the data’s use thatmore »« less

Authors:

Generous, Nicholas ^[1]; Fairchild, Geoffrey ^[1]; Khalsa, Hari ^[1]; Tasseff, Byron ^[1]; Arnold, James ^[1]

Los Alamos National Laboratory (LANL), Los Alamos, NM (United States)

Publication Date:: Tue May 02 00:00:00 EDT 2017

Research Org.:: Los Alamos National Laboratory (LANL), Los Alamos, NM (United States)

Sponsoring Org.:: USDOE Office of Science (SC); Defense Threat Reduction Agency (DTRA)

OSTI Identifier:: 1629251

Grant/Contract Number:: AC52-06NA25396

Resource Type:: Accepted Manuscript

Journal Name:: Online Journal of Public Health Informatics

Additional Journal Information:: Journal Volume: 9; Journal Issue: 1; Journal ID: ISSN 1947-2579

Publisher:: University of Illinois at Chicago

Country of Publication:: United States

Language:: English

Subject:: 97 MATHEMATICS AND COMPUTING; 59 BASIC BIOLOGICAL SCIENCES; notifiable disease; data source; standards; scraping; data sharing

Citation Formats


                    Generous, Nicholas, Fairchild, Geoffrey, Khalsa, Hari, Tasseff, Byron, and Arnold, James. Epi Archive: automated data collection of notifiable disease data.  United States: N. p., 2017. 
Web.  doi:10.5210/ojphi.v9i1.7615.

Copy to clipboard


                    Generous, Nicholas, Fairchild, Geoffrey, Khalsa, Hari, Tasseff, Byron, & Arnold, James. Epi Archive: automated data collection of notifiable disease data.  United States.  https://doi.org/10.5210/ojphi.v9i1.7615

Copy to clipboard


                    Generous, Nicholas, Fairchild, Geoffrey, Khalsa, Hari, Tasseff, Byron, and Arnold, James. Tue .  
"Epi Archive: automated data collection of notifiable disease data".  United States.  https://doi.org/10.5210/ojphi.v9i1.7615.  https://www.osti.gov/servlets/purl/1629251.

Copy to clipboard


                    
@article{osti_1629251,

  title        = {Epi Archive: automated data collection of notifiable disease data},

  author       = {Generous, Nicholas and Fairchild, Geoffrey and Khalsa, Hari and Tasseff, Byron and Arnold, James},

  abstractNote = {LANL has built a software program that automatically collects global notifiable disease data—particularly data stored in files—and makes it available and shareable within the Biosurveillance Ecosystem(BSVE) as a new data source. This will improve the prediction and early warning of disease events and other applications. Most countries do not report national notifiable disease data in a machine-readable format. Data are often in the form of a file that contains text, tables and graphs summarizing weekly or monthly disease counts. This presents a problem when information is needed for more data intensive approaches to epidemiology, biosurveillance and public health as exemplified by the Biosurveillance Ecosystem(BSVE).While most nations do likely store their data in a machine-readable format, the governments are often hesitant to share data openly for a variety of reasons that include technical, political, economic, and motivational issues. For example, an attempt by LANL to obtain a weekly version of openly available monthly data, reported by the Australian government, resulted in an onerous bureaucratic reply. The obstacles to obtaining data included: paperwork to request data from each of the Australian states and territories, a long delay to obtain data (up to 3 months) and extensive limitations on the data’s use that prohibit collaboration and sharing. This type of experience when attempting to contact public health departments or ministries of health for data is not uncommon. A survey conducted by LANL of notifiable disease data reporting in 52 countries identified only 10 as being machine-readable and 42 being reported in pdf files on a regular basis. Within the 42 nations that report in pdf files, 32 report in a structured, tabular format and10 in a non-structured way. As a result, LANL has developed a tool-Epi Archive (formerly known as EPIC)-to automatically and continuously collect global notifiable disease data and make it readily accessible. We conducted a survey of the national notifiable disease reporting systems notating how the data is reported in two important dimensions: date standards and case definitions. The development of software to regularly ingests notifiable disease data frand makes this data available involved four main steps scraping, extracting, parsing and persisting. For scraping: we would examine website designs and determine reporting mechanisms for each country/website as well as what varies across the reporting mechanisms. We then designed and wrote code to automate the downloading of report pdf files, for each country. We stored report pdfs along with appropriate metadata for extracting and parsing. For extracting: we developed software that can extract notifiable disease data presented in tabular form from a pdf file. We combined the methodology of figure placement detection with the in-house developed table extraction and annotation heuristics. For parsing: we determined what to extract from each pdf dataset from the survey conducted. We then parsed the extracted data into uniform data structures correctly accommodating the dimensions surveyed and the various human languages. This task involved ingesting notifiable disease data in many disparate formats extracted from pdf files and coalescing the data into a standardized format. For persisting: We then store the data in the Epi Archive PostgreSQL database and make it available through the BSVE. The EpiArchive tool currently contains subnational notifiable disease data from 10 nations. When a user accesses the EpiArchive site, they are prompted with four fields: country, region, disease, and date duration. These fields allow the user to specify the location(down to the state level), the disease of interest, and the duration of interest. Upon form submission, a time series is generated from the users’ specifications. The generated time series can then be downloaded into a csv file if a user is interested in performing personal analysis. Additionally, the data from EpiArchive can be reached through an API. LANL as part of a currently funded DTRA effort so that it will automatically and continuously collect global notifiable disease data—particularly data stored in pdf files—and make it available and shareable within the Biosurveillance Ecosystem (BSVE) as a new data source. This will provide data to analytics and users that will improve the prediction and early warning of disease events and other applications.},

  doi          = {10.5210/ojphi.v9i1.7615},

  journal      = {Online Journal of Public Health Informatics},

  number       = 1,

  volume       = 9,

  place        = {United States},

  year         = {Tue May 02 00:00:00 EDT 2017},

  month        = {Tue May 02 00:00:00 EDT 2017}

}

Copy to clipboard

Journal Article:

Free Publicly Available Full Text

Accepted Manuscript (DOE)

Publisher's Version of Record

https://doi.org/10.5210/ojphi.v9i1.7615

Other availability

Search WorldCat to find libraries that may hold this journal

Save / Share:

Export Metadata

Save to My Library

Works referenced in this record:

A systematic review of barriers to data sharing in public health
journal, November 2014

van Panhuis, Willem G.; Paul, Proma; Emerson, Claudia
BMC Public Health, Vol. 14, Issue 1
DOI: 10.1186/1471-2458-14-1144

Similar Records in DOE PAGES and OSTI.GOV collections:

Epi Archive: Automated Synthesis of Global Notifiable Disease Data

Journal Article Khalsa, Hari S. ; Cordova, Sergio ; Generous, Nicholas ; ... - Online Journal of Public Health Informatics

ObjectiveLANL has built software that automatically collects global notifiable disease data, synthesizes the data, and makes it available to humans and computers within the Biosurveillance Ecosystem (BSVE) as a novel data stream. These data have many applications including improving the prediction and early warning of disease events.IntroductionMost countries do not report national notifiable disease data in a machine-readable format. Data are often in the form of a file that contains text, tables and graphs summarizing weekly or monthly disease counts. This presents a problem when information is needed for more data intensive approaches to epidemiology, biosurveillance and public health.While mostmore »« less
https://doi.org/10.5210/ojphi.v10i1.8323

Full Text Available
Soda Pop: A Time-Series Clustering, Alarming and Disease Forecasting Application

Journal Article Rounds, Jeremiah ; Charles-Smith, Lauren ; Corley, Courtney D. - Online Journal of Public Health Informatics

To introduce Soda Pop, an R/Shiny application designed to be a disease agnostic time-series clustering, alarming, and forecasting tool to assist in disease surveillance “triage, analysis and reporting” workflows within the Biosurveillance Ecosystem (BSVE). In this poster, we highlight the new capabilities that are brought to the BSVE by Soda Pop with an emphasis on the impact of methodological decisions. The Biosurveillance Ecosystem (BSVE) is a biological and chemical threat surveillance system sponsored by the Defense Threat Reduction Agency (DTRA). BSVE is intended to be user-friendly, multi-agency, cooperative, modular and threat agnostic platform for biosurveillance. In BSVE, a web-based workbenchmore »« less
https://doi.org/10.5210/ojphi.v9i1.7582

Full Text Available
NBIC Biofeeds: Deploying a New, Digital Tool for Open Source Biosurveillance across Federal Agencies

Journal Article Baker, Heather ; Grady, Asher ; Schwantes, Collin ; ... - Online Journal of Public Health Informatics

The National Biosurveillance Integration Center (NBIC) is deploying a scalable, flexible open source data collection, analysis, and dissemination tool to support biosurveillance operations by the U.S. Department of Homeland Security (DHS) and its federal interagency partners.IntroductionNBIC integrates, analyzes, and distributes key information about health and disease events to help ensure the nation’s responses are well-informed, save lives, and minimize economic impact. To meet its mission objectives, NBIC utilizes a variety of data sets, including open source information, to provide comprehensive coverage of biological events occurring across the globe. NBIC Biofeeds is a digital tool designed to improve the efficiency ofmore »« less
https://doi.org/10.5210/ojphi.v10i1.8947

Full Text Available
Procedure Parsing: A Method for Parsing Handwritten Documents into Computer-Based Procedures

Conference Whitmore, Stacey Ray

The nuclear industry is heavily procedure driven, where almost everything has a step-by-step instruction that is expected to be followed in detail. Historically, these procedures were printed on paper copies. Recently, the industry transitioned towards electronic copies (i.e., PDFs on tablets). One major drive for this transition is the introduction of human error and loss of situation awareness when using paper copies. However, electronic copies of documents inherently have the same error traps as their paper cousins. Therefore, there is an increased interest in a way to utilize the information in the step-by-step guidance, but to present it in amore »« less
https://doi.org/10.54941/ahfe1002518

Full Text Available
NBIC Biofeeds: A Digital Tool for Open Source Biosurveillance across Federal Agencies

Journal Article Baker, Heather ; Lesniak, Chandra ; Iarocci, Emily ; ... - Online Journal of Public Health Informatics

The National Biosurveillance Integration Center (NBIC) is developing a scalable, flexible open source data collection, analysis, and dissemination tool to support biosurveillance operations by the U.S. Department of Homeland Security (DHS) and its federal interagency partners. The NBIC integrates, analyzes, and distributes key information about health and disease events to help ensure the nation’s responses are well-informed, save lives, and minimize economic impact. NBIC serves as a bridge between Federal, State, Local, Territorial, and Tribal entities to conduct biosurveillance across human, animal, plant, and environmental domains. The integration of information enables early warning and shared situational awareness of biological eventsmore »« less
https://doi.org/10.5210/ojphi.v9i1.7642

Full Text Available

Similar Records

Title: Epi Archive: automated data collection of notifiable disease data

Abstract

Citation Formats

A systematic review of barriers to data sharing in public health journal, November 2014

A systematic review of barriers to data sharing in public health
journal, November 2014