The utility of web mining for epidemiological research: studying the association between parity and cancer risk [Web Mining for Epidemiological Research. Assessing its Utility in Exploring the Association Between Parity and Cancer Risk]
- Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
- New Jersey Inst. of Technology, Newark, NJ (United States)
- American Cancer Society, Atlanta, GA (United States)
Background: The World Wide Web has emerged as a powerful data source for epidemiological studies related to infectious disease surveillance. However, its potential for cancer-related epidemiological discoveries is largely unexplored. Methods: Using advanced web crawling and tailored information extraction procedures we automatically collected and analyzed the text content of 79,394 online obituary articles published between 1998 and 2014. The collected data included 51,911 cancer (27,330 breast; 9,470 lung; 6,496 pancreatic; 6,342 ovarian; 2,273 colon) and 27,483 non-cancer cases. With the derived information, we replicated a case-control study design to investigate the association between parity and cancer risk. Age-adjusted odds ratios (ORs) with 95% confidence intervals (CIs) were calculated for each cancer type and compared to those reported in large-scale epidemiological studies. Results: Parity was found to be associated with a significantly reduced risk of breast cancer (OR=0.78, 95% CI = 0.75 to 0.82), pancreatic cancer (OR=0.78, 95% CI = 0.72 to 0.83), colon cancer (OR=0.67, 95% CI = 0.60 to 0.74), and ovarian cancer (OR=0.58, 95% CI = 0.54 to 0.62). Marginal association was found for lung cancer prevalence (OR=0.87, 95% CI = 0.81 to 0.92). The linear trend between multi-parity and reduced cancer risk was dramatically more pronounced for breast and ovarian cancer than the other cancers included in the analysis. Conclusion: This large web-mining study on parity and cancer risk produced findings very similar to those reported with traditional observational studies. It may be used as a promising strategy to generate study hypotheses for guiding and prioritizing future epidemiological studies.
- Research Organization:
- Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF)
- Sponsoring Organization:
- USDOE Office of Science (SC)
- Grant/Contract Number:
- AC05-00OR22725
- OSTI ID:
- 1236580
- Journal Information:
- Journal of the American Medical Informatics Association, Vol. 23, Issue 3; ISSN 1067-5027
- Publisher:
- Oxford University PressCopyright Statement
- Country of Publication:
- United States
- Language:
- English
Web of Science
Digital Epidemiology: Use of Digital Data Collected for Non-epidemiological Purposes in Epidemiological Studies
|
journal | January 2018 |
Similar Records
A novel web informatics approach for automated surveillance of cancer mortality trends
Risk of leukemia associated with the first course of cancer treatment: an analysis of the Surveillance, Epidemiology, and End Results Program experience