skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: The utility of web mining for epidemiological research: studying the association between parity and cancer risk [Web Mining for Epidemiological Research. Assessing its Utility in Exploring the Association Between Parity and Cancer Risk]

Abstract

Background: The World Wide Web has emerged as a powerful data source for epidemiological studies related to infectious disease surveillance. However, its potential for cancer-related epidemiological discoveries is largely unexplored. Methods: Using advanced web crawling and tailored information extraction procedures we automatically collected and analyzed the text content of 79,394 online obituary articles published between 1998 and 2014. The collected data included 51,911 cancer (27,330 breast; 9,470 lung; 6,496 pancreatic; 6,342 ovarian; 2,273 colon) and 27,483 non-cancer cases. With the derived information, we replicated a case-control study design to investigate the association between parity and cancer risk. Age-adjusted odds ratios (ORs) with 95% confidence intervals (CIs) were calculated for each cancer type and compared to those reported in large-scale epidemiological studies. Results: Parity was found to be associated with a significantly reduced risk of breast cancer (OR=0.78, 95% CI = 0.75 to 0.82), pancreatic cancer (OR=0.78, 95% CI = 0.72 to 0.83), colon cancer (OR=0.67, 95% CI = 0.60 to 0.74), and ovarian cancer (OR=0.58, 95% CI = 0.54 to 0.62). Marginal association was found for lung cancer prevalence (OR=0.87, 95% CI = 0.81 to 0.92). The linear trend between multi-parity and reduced cancer risk was dramatically more pronounced formore » breast and ovarian cancer than the other cancers included in the analysis. Conclusion: This large web-mining study on parity and cancer risk produced findings very similar to those reported with traditional observational studies. It may be used as a promising strategy to generate study hypotheses for guiding and prioritizing future epidemiological studies.« less

Authors:
 [1];  [1];  [2];  [3]
  1. Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
  2. New Jersey Inst. of Technology, Newark, NJ (United States)
  3. American Cancer Society, Atlanta, GA (United States)
Publication Date:
Research Org.:
Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF)
Sponsoring Org.:
USDOE Office of Science (SC)
OSTI Identifier:
1236580
Grant/Contract Number:  
AC05-00OR22725
Resource Type:
Journal Article: Accepted Manuscript
Journal Name:
Journal of the American Medical Informatics Association
Additional Journal Information:
Journal Volume: 23; Journal Issue: 3; Journal ID: ISSN 1067-5027
Publisher:
Oxford University Press
Country of Publication:
United States
Language:
English
Subject:
59 BASIC BIOLOGICAL SCIENCES; 97 MATHEMATICS AND COMPUTING; web mining; cancer; epidemiology

Citation Formats

Tourassi, Georgia, Yoon, Hong-Jun, Xu, Songhua, and Han, Xuesong. The utility of web mining for epidemiological research: studying the association between parity and cancer risk [Web Mining for Epidemiological Research. Assessing its Utility in Exploring the Association Between Parity and Cancer Risk]. United States: N. p., 2015. Web. doi:10.1093/jamia/ocv141.
Tourassi, Georgia, Yoon, Hong-Jun, Xu, Songhua, & Han, Xuesong. The utility of web mining for epidemiological research: studying the association between parity and cancer risk [Web Mining for Epidemiological Research. Assessing its Utility in Exploring the Association Between Parity and Cancer Risk]. United States. https://doi.org/10.1093/jamia/ocv141
Tourassi, Georgia, Yoon, Hong-Jun, Xu, Songhua, and Han, Xuesong. 2015. "The utility of web mining for epidemiological research: studying the association between parity and cancer risk [Web Mining for Epidemiological Research. Assessing its Utility in Exploring the Association Between Parity and Cancer Risk]". United States. https://doi.org/10.1093/jamia/ocv141. https://www.osti.gov/servlets/purl/1236580.
@article{osti_1236580,
title = {The utility of web mining for epidemiological research: studying the association between parity and cancer risk [Web Mining for Epidemiological Research. Assessing its Utility in Exploring the Association Between Parity and Cancer Risk]},
author = {Tourassi, Georgia and Yoon, Hong-Jun and Xu, Songhua and Han, Xuesong},
abstractNote = {Background: The World Wide Web has emerged as a powerful data source for epidemiological studies related to infectious disease surveillance. However, its potential for cancer-related epidemiological discoveries is largely unexplored. Methods: Using advanced web crawling and tailored information extraction procedures we automatically collected and analyzed the text content of 79,394 online obituary articles published between 1998 and 2014. The collected data included 51,911 cancer (27,330 breast; 9,470 lung; 6,496 pancreatic; 6,342 ovarian; 2,273 colon) and 27,483 non-cancer cases. With the derived information, we replicated a case-control study design to investigate the association between parity and cancer risk. Age-adjusted odds ratios (ORs) with 95% confidence intervals (CIs) were calculated for each cancer type and compared to those reported in large-scale epidemiological studies. Results: Parity was found to be associated with a significantly reduced risk of breast cancer (OR=0.78, 95% CI = 0.75 to 0.82), pancreatic cancer (OR=0.78, 95% CI = 0.72 to 0.83), colon cancer (OR=0.67, 95% CI = 0.60 to 0.74), and ovarian cancer (OR=0.58, 95% CI = 0.54 to 0.62). Marginal association was found for lung cancer prevalence (OR=0.87, 95% CI = 0.81 to 0.92). The linear trend between multi-parity and reduced cancer risk was dramatically more pronounced for breast and ovarian cancer than the other cancers included in the analysis. Conclusion: This large web-mining study on parity and cancer risk produced findings very similar to those reported with traditional observational studies. It may be used as a promising strategy to generate study hypotheses for guiding and prioritizing future epidemiological studies.},
doi = {10.1093/jamia/ocv141},
url = {https://www.osti.gov/biblio/1236580}, journal = {Journal of the American Medical Informatics Association},
issn = {1067-5027},
number = 3,
volume = 23,
place = {United States},
year = {Fri Nov 27 00:00:00 EST 2015},
month = {Fri Nov 27 00:00:00 EST 2015}
}

Journal Article:
Free Publicly Available Full Text
Publisher's Version of Record

Citation Metrics:
Cited by: 3 works
Citation information provided by
Web of Science

Save / Share:

Works referenced in this record:

Digital Social Networks and Health
journal, April 2013


Infodemiology and Infoveillance
journal, May 2011


Scoping Review on Search Queries and Social Media for Disease Surveillance: A Chronology of Innovation
journal, January 2013


The Internet and the Global Monitoring of Emerging Diseases: Lessons from the First 10 Years of ProMED-mail
journal, November 2005


Information Technology and Global Surveillance of Cases of 2009 H1N1 Influenza
journal, May 2010


Medicine 2.0: Social Networking, Collaboration, Participation, Apomediation, and Openness
journal, January 2008


Accessing Suicide-Related Information on the Internet: A Retrospective Observational Study of Search Behavior
journal, January 2012


Web search behavior for multiple sclerosis: An infodemiological study
journal, July 2014


Health-Related Hot Topic Detection in Online Communities Using Text Clustering
journal, February 2013


Online Interventions for Social Marketing Health Behavior Change Campaigns: A Meta-Analysis of Psychological Architectures and Adherence Factors
journal, January 2011


A Novel Evaluation of World No Tobacco Day in Latin America
journal, January 2012


Accelerated clinical discovery using self-reported patient data collected online and a patient-matching algorithm
journal, April 2011


Patient-reported Outcomes as a Source of Evidence in Off-Label Prescribing: Analysis of Data From PatientsLikeMe
journal, January 2011


Understanding Topics and Sentiment in an Online Cancer Survivor Community
journal, December 2013


The process and effect of supportive message expression and reception in online breast cancer support groups
journal, March 2011


Parity and breast cancer risk: Possible effect on age at diagnosis
journal, January 1986


The independent associations of parity, age at first full term pregnancy, and duration of breastfeeding with the risk of breast cancer
journal, January 1989


Reproductive Factors and Breast Cancer
journal, January 1993


Parity, age at first and last birth, and risk of breast cancer: A population-based study in Sweden
journal, October 1996


Mammographic density, parity and age at first birth, and risk of breast cancer: an analysis of four case–control studies
journal, January 2012


Reproductive and Hormonal Factors in Association With Ovarian Cancer in the Netherlands Cohort Study
journal, September 2010


Hormonal Risk Factors for Ovarian Cancer in Premenopausal and Postmenopausal Women
journal, February 2008


Characteristics Relating to Ovarian Cancer Risk: Collaborative Analysis of 12 US Case -Control Studies
journal, November 1992


Reproductive factors in relation to ovarian cancer: a case–control study in Northern Vietnam
journal, November 2012


Reproductive factors for ovarian cancer in southern Chinese women
journal, January 2013


Menstrual and reproductive factors in relation to ovarian cancer risk
journal, January 2001


Ovarian Cancer Risk Factors in African-American and White Women
journal, July 2009


Reproductive Factors and Risk of Pancreatic Cancer in Women: A Review of the Literature
journal, February 2009


Parity and risk of lung cancer in women: Systematic review and meta-analysis of epidemiological studies
journal, May 2012


Reproductive factors and colon cancers
journal, May 1990


Reproductive History and Risk of Colorectal Cancer in Postmenopausal Women
journal, March 2011


The Relationship between Gravidity and Parity and Colorectal Cancer Risk
journal, July 2009


Reproductive Factors, Oral Contraceptive Use, and Risk of Colorectal Cancer
journal, January 1997


Parity and Risk of Colorectal Cancer: A Dose-Response Meta-Analysis of Prospective Studies
journal, September 2013


A user-oriented web crawler for selectively acquiring online content in e-health research
journal, September 2013


The Stanford CoreNLP Natural Language Processing Toolkit
conference, January 2014

  • Manning, Christopher; Surdeanu, Mihai; Bauer, John
  • Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations
  • https://doi.org/10.3115/v1/P14-5010

Cancer statistics, 2015: Cancer Statistics, 2015
journal, January 2015


Big Data and Large Sample Size: A Cautionary Note on the Potential for Bias: B
journal, July 2014


News from the NIH: leveraging big data in the behavioral sciences
journal, June 2014


Parity and breast cancer risk: Possible effect on age at diagnosis
journal, January 1986


Sugar, meat, and fat intake, and non-dietary risk factors for colon cancer incidence in Iowa women (United States)
journal, January 1994


Lifestyle, Occupational, and Reproductive Factors and Risk of Colorectal Cancer
journal, January 2010


Childbearing, oral contraceptive use, and breast cancer
journal, April 1993


The Internet and the Global Monitoring of Emerging Diseases: Lessons from the First 10 Years of ProMED-mail
journal, November 2005


Reproductive factors and colon cancers
journal, May 1990


Accelerated clinical discovery using self-reported patient data collected online and a patient-matching algorithm
journal, April 2011


Menstrual and reproductive factors in relation to ovarian cancer risk
journal, January 2001


Hormonal Risk Factors for Ovarian Cancer in Premenopausal and Postmenopausal Women
journal, February 2008


Ovarian Cancer Risk Factors in African-American and White Women
journal, July 2009


Characteristics Relating to Ovarian Cancer Risk: Collaborative Analysis of 12 US Case -Control Studies
journal, November 1992


Social Media and Clinical Care: Ethical, Professional, and Social Implications
journal, April 2013


Works referencing / citing this record:

Digital Epidemiology: Use of Digital Data Collected for Non-epidemiological Purposes in Epidemiological Studies
journal, January 2018