skip to main content


Title: Investigating Sociodemographic Disparities in Cancer Risk Using Web-Based Informatics

Cancer health disparities due to demographic and socioeconomic factors are an area of great interest in the epidemiological community. Adjusting for such factors is important when developing cancer risk models. However, for digital epidemiology studies relying on online sources such information is not readily available. This paper presents a novel method for extracting demographic and socioeconomic information from openly available online obituaries. The method relies on tailored language processing rules and a probabilistic scheme to map subjects’ occupation history to the occupation classification codes and related earnings provided by the U.S. Census Bureau. Using this information, a case-control study is executed fully in silico to investigate how age, gender, parity, and income level impact breast and lung cancer risk. Based on 48,368 online obituaries (4,643 for breast cancer, 6,274 for lung cancer, and 37,451 cancer-free) collected automatically and a generalized cancer risk model, our study shows strong association between age, parity, and socioeconomic status and cancer risk. Although for breast cancer the observed trends are very consistent with traditional epidemiological studies, some inconsistency is observed for lung cancer with respect to socioeconomic status.
 [1] ;  [1]
  1. Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States). Biomedical Sciences, Engineering, and Computing Group. Health Data Sciences Inst.
Publication Date:
Grant/Contract Number:
AC05-00OR22725; 1R01-CA170508-04
Accepted Manuscript
Journal Name:
Journal of Human Performance in Extreme Environments
Additional Journal Information:
Journal Volume: 14; Journal Issue: 1; Journal ID: ISSN 2327-2937
Research Org:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Sponsoring Org:
USDOE Office of Science (SC); National Inst. of Health (NIH) (United States)
Country of Publication:
United States
60 APPLIED LIFE SCIENCES; digital epidemiology; natural language processing; case-control study; generalized linear model; obituary; cancer mortality; breast cancer; lung cancer
OSTI Identifier: