DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Epi Archive: automated data collection of notifiable disease data

Abstract

LANL has built a software program that automatically collects global notifiable disease data—particularly data stored in files—and makes it available and shareable within the Biosurveillance Ecosystem(BSVE) as a new data source. This will improve the prediction and early warning of disease events and other applications. Most countries do not report national notifiable disease data in a machine-readable format. Data are often in the form of a file that contains text, tables and graphs summarizing weekly or monthly disease counts. This presents a problem when information is needed for more data intensive approaches to epidemiology, biosurveillance and public health as exemplified by the Biosurveillance Ecosystem(BSVE).While most nations do likely store their data in a machine-readable format, the governments are often hesitant to share data openly for a variety of reasons that include technical, political, economic, and motivational issues. For example, an attempt by LANL to obtain a weekly version of openly available monthly data, reported by the Australian government, resulted in an onerous bureaucratic reply. The obstacles to obtaining data included: paperwork to request data from each of the Australian states and territories, a long delay to obtain data (up to 3 months) and extensive limitations on the data’s use thatmore » prohibit collaboration and sharing. This type of experience when attempting to contact public health departments or ministries of health for data is not uncommon. A survey conducted by LANL of notifiable disease data reporting in 52 countries identified only 10 as being machine-readable and 42 being reported in pdf files on a regular basis. Within the 42 nations that report in pdf files, 32 report in a structured, tabular format and10 in a non-structured way. As a result, LANL has developed a tool-Epi Archive (formerly known as EPIC)-to automatically and continuously collect global notifiable disease data and make it readily accessible. We conducted a survey of the national notifiable disease reporting systems notating how the data is reported in two important dimensions: date standards and case definitions. The development of software to regularly ingests notifiable disease data frand makes this data available involved four main steps scraping, extracting, parsing and persisting. For scraping: we would examine website designs and determine reporting mechanisms for each country/website as well as what varies across the reporting mechanisms. We then designed and wrote code to automate the downloading of report pdf files, for each country. We stored report pdfs along with appropriate metadata for extracting and parsing. For extracting: we developed software that can extract notifiable disease data presented in tabular form from a pdf file. We combined the methodology of figure placement detection with the in-house developed table extraction and annotation heuristics. For parsing: we determined what to extract from each pdf dataset from the survey conducted. We then parsed the extracted data into uniform data structures correctly accommodating the dimensions surveyed and the various human languages. This task involved ingesting notifiable disease data in many disparate formats extracted from pdf files and coalescing the data into a standardized format. For persisting: We then store the data in the Epi Archive PostgreSQL database and make it available through the BSVE. The EpiArchive tool currently contains subnational notifiable disease data from 10 nations. When a user accesses the EpiArchive site, they are prompted with four fields: country, region, disease, and date duration. These fields allow the user to specify the location(down to the state level), the disease of interest, and the duration of interest. Upon form submission, a time series is generated from the users’ specifications. The generated time series can then be downloaded into a csv file if a user is interested in performing personal analysis. Additionally, the data from EpiArchive can be reached through an API. LANL as part of a currently funded DTRA effort so that it will automatically and continuously collect global notifiable disease data—particularly data stored in pdf files—and make it available and shareable within the Biosurveillance Ecosystem (BSVE) as a new data source. This will provide data to analytics and users that will improve the prediction and early warning of disease events and other applications.« less

Authors:
 [1];  [1];  [1];  [1];  [1]
  1. Los Alamos National Laboratory (LANL), Los Alamos, NM (United States)
Publication Date:
Research Org.:
Los Alamos National Laboratory (LANL), Los Alamos, NM (United States)
Sponsoring Org.:
USDOE Office of Science (SC); Defense Threat Reduction Agency (DTRA)
OSTI Identifier:
1629251
Grant/Contract Number:  
AC52-06NA25396
Resource Type:
Accepted Manuscript
Journal Name:
Online Journal of Public Health Informatics
Additional Journal Information:
Journal Volume: 9; Journal Issue: 1; Journal ID: ISSN 1947-2579
Publisher:
University of Illinois at Chicago
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING; 59 BASIC BIOLOGICAL SCIENCES; notifiable disease; data source; standards; scraping; data sharing

Citation Formats

Generous, Nicholas, Fairchild, Geoffrey, Khalsa, Hari, Tasseff, Byron, and Arnold, James. Epi Archive: automated data collection of notifiable disease data. United States: N. p., 2017. Web. doi:10.5210/ojphi.v9i1.7615.
Generous, Nicholas, Fairchild, Geoffrey, Khalsa, Hari, Tasseff, Byron, & Arnold, James. Epi Archive: automated data collection of notifiable disease data. United States. https://doi.org/10.5210/ojphi.v9i1.7615
Generous, Nicholas, Fairchild, Geoffrey, Khalsa, Hari, Tasseff, Byron, and Arnold, James. Tue . "Epi Archive: automated data collection of notifiable disease data". United States. https://doi.org/10.5210/ojphi.v9i1.7615. https://www.osti.gov/servlets/purl/1629251.
@article{osti_1629251,
title = {Epi Archive: automated data collection of notifiable disease data},
author = {Generous, Nicholas and Fairchild, Geoffrey and Khalsa, Hari and Tasseff, Byron and Arnold, James},
abstractNote = {LANL has built a software program that automatically collects global notifiable disease data—particularly data stored in files—and makes it available and shareable within the Biosurveillance Ecosystem(BSVE) as a new data source. This will improve the prediction and early warning of disease events and other applications. Most countries do not report national notifiable disease data in a machine-readable format. Data are often in the form of a file that contains text, tables and graphs summarizing weekly or monthly disease counts. This presents a problem when information is needed for more data intensive approaches to epidemiology, biosurveillance and public health as exemplified by the Biosurveillance Ecosystem(BSVE).While most nations do likely store their data in a machine-readable format, the governments are often hesitant to share data openly for a variety of reasons that include technical, political, economic, and motivational issues. For example, an attempt by LANL to obtain a weekly version of openly available monthly data, reported by the Australian government, resulted in an onerous bureaucratic reply. The obstacles to obtaining data included: paperwork to request data from each of the Australian states and territories, a long delay to obtain data (up to 3 months) and extensive limitations on the data’s use that prohibit collaboration and sharing. This type of experience when attempting to contact public health departments or ministries of health for data is not uncommon. A survey conducted by LANL of notifiable disease data reporting in 52 countries identified only 10 as being machine-readable and 42 being reported in pdf files on a regular basis. Within the 42 nations that report in pdf files, 32 report in a structured, tabular format and10 in a non-structured way. As a result, LANL has developed a tool-Epi Archive (formerly known as EPIC)-to automatically and continuously collect global notifiable disease data and make it readily accessible. We conducted a survey of the national notifiable disease reporting systems notating how the data is reported in two important dimensions: date standards and case definitions. The development of software to regularly ingests notifiable disease data frand makes this data available involved four main steps scraping, extracting, parsing and persisting. For scraping: we would examine website designs and determine reporting mechanisms for each country/website as well as what varies across the reporting mechanisms. We then designed and wrote code to automate the downloading of report pdf files, for each country. We stored report pdfs along with appropriate metadata for extracting and parsing. For extracting: we developed software that can extract notifiable disease data presented in tabular form from a pdf file. We combined the methodology of figure placement detection with the in-house developed table extraction and annotation heuristics. For parsing: we determined what to extract from each pdf dataset from the survey conducted. We then parsed the extracted data into uniform data structures correctly accommodating the dimensions surveyed and the various human languages. This task involved ingesting notifiable disease data in many disparate formats extracted from pdf files and coalescing the data into a standardized format. For persisting: We then store the data in the Epi Archive PostgreSQL database and make it available through the BSVE. The EpiArchive tool currently contains subnational notifiable disease data from 10 nations. When a user accesses the EpiArchive site, they are prompted with four fields: country, region, disease, and date duration. These fields allow the user to specify the location(down to the state level), the disease of interest, and the duration of interest. Upon form submission, a time series is generated from the users’ specifications. The generated time series can then be downloaded into a csv file if a user is interested in performing personal analysis. Additionally, the data from EpiArchive can be reached through an API. LANL as part of a currently funded DTRA effort so that it will automatically and continuously collect global notifiable disease data—particularly data stored in pdf files—and make it available and shareable within the Biosurveillance Ecosystem (BSVE) as a new data source. This will provide data to analytics and users that will improve the prediction and early warning of disease events and other applications.},
doi = {10.5210/ojphi.v9i1.7615},
journal = {Online Journal of Public Health Informatics},
number = 1,
volume = 9,
place = {United States},
year = {Tue May 02 00:00:00 EDT 2017},
month = {Tue May 02 00:00:00 EDT 2017}
}

Works referenced in this record:

A systematic review of barriers to data sharing in public health
journal, November 2014