DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Deep Learning for Automated Extraction of Primary Sites from Cancer Pathology Reports

Abstract

Pathology reports are a primary source of information for cancer registries which process high volumes of free-text reports annually. Information extraction and coding is a manual, labor-intensive process. Here in this study we investigated deep learning and a convolutional neural network (CNN), for extracting ICDO- 3 topographic codes from a corpus of breast and lung cancer pathology reports. We performed two experiments, using a CNN and a more conventional term frequency vector approach, to assess the effects of class prevalence and inter-class transfer learning. The experiments were based on a set of 942 pathology reports with human expert annotations as the gold standard. CNN performance was compared against a more conventional term frequency vector space approach. We observed that the deep learning models consistently outperformed the conventional approaches in the class prevalence experiment, resulting in micro and macro-F score increases of up to 0.132 and 0.226 respectively when class labels were well populated. Specifically, the best performing CNN achieved a micro-F score of 0.722 over 12 ICD-O-3 topography codes. Transfer learning provided a consistent but modest performance boost for the deep learning methods but trends were contingent on CNN method and cancer site. Finally, these encouraging results demonstrate the potentialmore » of deep learning for automated abstraction of pathology reports.« less

Authors:
 [1]; ORCiD logo [2];  [3]; ORCiD logo [4]
  1. Univ. of Tennessee, Knoxville, TN (United States); Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States). Health Data Sciences Inst.
  2. Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States). Computational Sciences and Engineering Division and the Health Data Sciences Inst., Biomedical Sciences, Engineering, and Computing Group
  3. National Cancer Inst., Bethesda, MD (United States). Surveillance Research Program
  4. Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States). Computational Sciences and Engineering Division and the Health Data Sciences Inst., Biomedical Sciences, Engineering, and Computing Group
Publication Date:
Research Org.:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF)
Sponsoring Org.:
USDOE Office of Science (SC); National Institutes of Health (NIH)
OSTI Identifier:
1408007
Grant/Contract Number:  
AC05-00OR22725; AC02-06CH11357; AC52-06NA25396; AC52-07NA27344
Resource Type:
Accepted Manuscript
Journal Name:
IEEE Journal of Biomedical and Health Informatics
Additional Journal Information:
Journal Volume: 22; Journal Issue: 1; Journal ID: ISSN 2168-2194
Publisher:
IEEE
Country of Publication:
United States
Language:
English
Subject:
60 APPLIED LIFE SCIENCES; 59 BASIC BIOLOGICAL SCIENCES; Deep learning; convolutional neural network; natural language processing; information extraction; pathology reports; primary cancer site

Citation Formats

Qiu, John, Yoon, Hong-Jun, Fearn, Paul A., and Tourassi, Georgia D. Deep Learning for Automated Extraction of Primary Sites from Cancer Pathology Reports. United States: N. p., 2017. Web. doi:10.1109/JBHI.2017.2700722.
Qiu, John, Yoon, Hong-Jun, Fearn, Paul A., & Tourassi, Georgia D. Deep Learning for Automated Extraction of Primary Sites from Cancer Pathology Reports. United States. https://doi.org/10.1109/JBHI.2017.2700722
Qiu, John, Yoon, Hong-Jun, Fearn, Paul A., and Tourassi, Georgia D. Wed . "Deep Learning for Automated Extraction of Primary Sites from Cancer Pathology Reports". United States. https://doi.org/10.1109/JBHI.2017.2700722. https://www.osti.gov/servlets/purl/1408007.
@article{osti_1408007,
title = {Deep Learning for Automated Extraction of Primary Sites from Cancer Pathology Reports},
author = {Qiu, John and Yoon, Hong-Jun and Fearn, Paul A. and Tourassi, Georgia D.},
abstractNote = {Pathology reports are a primary source of information for cancer registries which process high volumes of free-text reports annually. Information extraction and coding is a manual, labor-intensive process. Here in this study we investigated deep learning and a convolutional neural network (CNN), for extracting ICDO- 3 topographic codes from a corpus of breast and lung cancer pathology reports. We performed two experiments, using a CNN and a more conventional term frequency vector approach, to assess the effects of class prevalence and inter-class transfer learning. The experiments were based on a set of 942 pathology reports with human expert annotations as the gold standard. CNN performance was compared against a more conventional term frequency vector space approach. We observed that the deep learning models consistently outperformed the conventional approaches in the class prevalence experiment, resulting in micro and macro-F score increases of up to 0.132 and 0.226 respectively when class labels were well populated. Specifically, the best performing CNN achieved a micro-F score of 0.722 over 12 ICD-O-3 topography codes. Transfer learning provided a consistent but modest performance boost for the deep learning methods but trends were contingent on CNN method and cancer site. Finally, these encouraging results demonstrate the potential of deep learning for automated abstraction of pathology reports.},
doi = {10.1109/JBHI.2017.2700722},
journal = {IEEE Journal of Biomedical and Health Informatics},
number = 1,
volume = 22,
place = {United States},
year = {Wed May 03 00:00:00 EDT 2017},
month = {Wed May 03 00:00:00 EDT 2017}
}

Journal Article:
Free Publicly Available Full Text
Publisher's Version of Record

Citation Metrics:
Cited by: 59 works
Citation information provided by
Web of Science

Save / Share:

Works referencing / citing this record:

Machine learning mortality classification in clinical documentation with increased accuracy in visual‐based analyses
journal, December 2019

  • Slattery, Susan M.; Knight, Daniel C.; Weese‐Mayer, Debra E.
  • Acta Paediatrica, Vol. 109, Issue 7
  • DOI: 10.1111/apa.15109

The Current Research Landscape on the Artificial Intelligence Application in the Management of Depressive Disorders: A Bibliometric Analysis
journal, June 2019

  • Tran, Bach Xuan; McIntyre, Roger S.; Latkin, Carl A.
  • International Journal of Environmental Research and Public Health, Vol. 16, Issue 12
  • DOI: 10.3390/ijerph16122150

Asymmetric Residual Neural Network for Accurate Human Activity Recognition
journal, June 2019