skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Semi-Supervised Information Extraction for Cancer Pathology Reports

Abstract

Pathology reports are a main source of data for cancer surveillance programs. Manual coding of pathology reports is labor-intensive but necessary for obtaining labeled data to train automated information extraction systems. In this study, we investigated semi-supervised deep learning, improving the performance of a multitask information extraction system for automated annotation of pathology reports. We used a set of over 374,000 pathology reports from the Louisiana Tumor Registry and a novel convolutional attention-based auto-encoder. We performed a set of experiments comparing supervised training augmented with unlabeled data at 1%, 5%, 10%, and 50% of the original data size. We also compared the impact of extending text processing to include unlabeled tokens. We find that semi-supervised training consistently improved individual performance with increased micro-averaged F-scores between 0.012 and 0.064 and increased macro-averaged F-scores of up to 0.158. This demonstrates that semantic information learned via unsupervised learning can be used to improve supervised clinical task performance.

Authors:
 [1];  [1]; ORCiD logo [1];  [1]; ORCiD logo [1]; ORCiD logo [1];  [2]; ORCiD logo [1]
  1. ORNL
  2. LSUHSC-Louisiana Tumor Registry
Publication Date:
Research Org.:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Sponsoring Org.:
USDOE
OSTI Identifier:
1564225
DOE Contract Number:  
AC05-00OR22725
Resource Type:
Conference
Resource Relation:
Conference: IEEE EMBS International Conference on Biomedical & Health Informatics (IEEE-EMBS BHI 2019) - Chicago, Illinois, United States of America - 5/19/2019 8:00:00 AM-5/22/2019 8:00:00 AM
Country of Publication:
United States
Language:
English

Citation Formats

Qiu, John X., Gao, Shang, Alawad, Mohammed M., Schaefferkoetter, Noah T., Alamudun, Folami T., Yoon, Hong-Jun, Wu, Xiao-Cheng, and Tourassi, Georgia. Semi-Supervised Information Extraction for Cancer Pathology Reports. United States: N. p., 2019. Web.
Qiu, John X., Gao, Shang, Alawad, Mohammed M., Schaefferkoetter, Noah T., Alamudun, Folami T., Yoon, Hong-Jun, Wu, Xiao-Cheng, & Tourassi, Georgia. Semi-Supervised Information Extraction for Cancer Pathology Reports. United States.
Qiu, John X., Gao, Shang, Alawad, Mohammed M., Schaefferkoetter, Noah T., Alamudun, Folami T., Yoon, Hong-Jun, Wu, Xiao-Cheng, and Tourassi, Georgia. 2019. "Semi-Supervised Information Extraction for Cancer Pathology Reports". United States. https://www.osti.gov/servlets/purl/1564225.
@article{osti_1564225,
title = {Semi-Supervised Information Extraction for Cancer Pathology Reports},
author = {Qiu, John X. and Gao, Shang and Alawad, Mohammed M. and Schaefferkoetter, Noah T. and Alamudun, Folami T. and Yoon, Hong-Jun and Wu, Xiao-Cheng and Tourassi, Georgia},
abstractNote = {Pathology reports are a main source of data for cancer surveillance programs. Manual coding of pathology reports is labor-intensive but necessary for obtaining labeled data to train automated information extraction systems. In this study, we investigated semi-supervised deep learning, improving the performance of a multitask information extraction system for automated annotation of pathology reports. We used a set of over 374,000 pathology reports from the Louisiana Tumor Registry and a novel convolutional attention-based auto-encoder. We performed a set of experiments comparing supervised training augmented with unlabeled data at 1%, 5%, 10%, and 50% of the original data size. We also compared the impact of extending text processing to include unlabeled tokens. We find that semi-supervised training consistently improved individual performance with increased micro-averaged F-scores between 0.012 and 0.064 and increased macro-averaged F-scores of up to 0.158. This demonstrates that semantic information learned via unsupervised learning can be used to improve supervised clinical task performance.},
doi = {},
url = {https://www.osti.gov/biblio/1564225}, journal = {},
number = ,
volume = ,
place = {United States},
year = {2019},
month = {5}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share: