Semi-Supervised Information Extraction for Cancer Pathology Reports
Abstract
Pathology reports are a main source of data for cancer surveillance programs. Manual coding of pathology reports is labor-intensive but necessary for obtaining labeled data to train automated information extraction systems. In this study, we investigated semi-supervised deep learning, improving the performance of a multitask information extraction system for automated annotation of pathology reports. We used a set of over 374,000 pathology reports from the Louisiana Tumor Registry and a novel convolutional attention-based auto-encoder. We performed a set of experiments comparing supervised training augmented with unlabeled data at 1%, 5%, 10%, and 50% of the original data size. We also compared the impact of extending text processing to include unlabeled tokens. We find that semi-supervised training consistently improved individual performance with increased micro-averaged F-scores between 0.012 and 0.064 and increased macro-averaged F-scores of up to 0.158. This demonstrates that semantic information learned via unsupervised learning can be used to improve supervised clinical task performance.
- Authors:
-
- ORNL
- LSUHSC-Louisiana Tumor Registry
- Publication Date:
- Research Org.:
- Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
- Sponsoring Org.:
- USDOE
- OSTI Identifier:
- 1564225
- DOE Contract Number:
- AC05-00OR22725
- Resource Type:
- Conference
- Resource Relation:
- Conference: IEEE EMBS International Conference on Biomedical & Health Informatics (IEEE-EMBS BHI 2019) - Chicago, Illinois, United States of America - 5/19/2019 8:00:00 AM-5/22/2019 8:00:00 AM
- Country of Publication:
- United States
- Language:
- English
Citation Formats
Qiu, John X., Gao, Shang, Alawad, Mohammed, Schaefferkoetter, Noah, Alamudun, Folami, Yoon, Hong-Jun, Wu, Xiao-Cheng, and Tourassi, Georgia. Semi-Supervised Information Extraction for Cancer Pathology Reports. United States: N. p., 2019.
Web. doi:10.1109/BHI.2019.8834470.
Qiu, John X., Gao, Shang, Alawad, Mohammed, Schaefferkoetter, Noah, Alamudun, Folami, Yoon, Hong-Jun, Wu, Xiao-Cheng, & Tourassi, Georgia. Semi-Supervised Information Extraction for Cancer Pathology Reports. United States. https://doi.org/10.1109/BHI.2019.8834470
Qiu, John X., Gao, Shang, Alawad, Mohammed, Schaefferkoetter, Noah, Alamudun, Folami, Yoon, Hong-Jun, Wu, Xiao-Cheng, and Tourassi, Georgia. 2019.
"Semi-Supervised Information Extraction for Cancer Pathology Reports". United States. https://doi.org/10.1109/BHI.2019.8834470. https://www.osti.gov/servlets/purl/1564225.
@article{osti_1564225,
title = {Semi-Supervised Information Extraction for Cancer Pathology Reports},
author = {Qiu, John X. and Gao, Shang and Alawad, Mohammed and Schaefferkoetter, Noah and Alamudun, Folami and Yoon, Hong-Jun and Wu, Xiao-Cheng and Tourassi, Georgia},
abstractNote = {Pathology reports are a main source of data for cancer surveillance programs. Manual coding of pathology reports is labor-intensive but necessary for obtaining labeled data to train automated information extraction systems. In this study, we investigated semi-supervised deep learning, improving the performance of a multitask information extraction system for automated annotation of pathology reports. We used a set of over 374,000 pathology reports from the Louisiana Tumor Registry and a novel convolutional attention-based auto-encoder. We performed a set of experiments comparing supervised training augmented with unlabeled data at 1%, 5%, 10%, and 50% of the original data size. We also compared the impact of extending text processing to include unlabeled tokens. We find that semi-supervised training consistently improved individual performance with increased micro-averaged F-scores between 0.012 and 0.064 and increased macro-averaged F-scores of up to 0.158. This demonstrates that semantic information learned via unsupervised learning can be used to improve supervised clinical task performance.},
doi = {10.1109/BHI.2019.8834470},
url = {https://www.osti.gov/biblio/1564225},
journal = {},
number = ,
volume = ,
place = {United States},
year = {2019},
month = {5}
}
Works referenced in this record:
Hierarchical attention networks for information extraction from cancer pathology reports
journal, November 2017
- Gao, Shang; Young, Michael T.; Qiu, John X.
- Journal of the American Medical Informatics Association, Vol. 25, Issue 3
Scalable deep text comprehension for Cancer surveillance on high-performance computing
journal, December 2018
- Qiu, John X.; Yoon, Hong-Jun; Srivastava, Kshitij
- BMC Bioinformatics, Vol. 19, Issue S18
Explainable Prediction of Medical Codes from Clinical Text
conference, January 2018
- Mullenbach, James; Wiegreffe, Sarah; Duke, Jon
- Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)
Deep learning
journal, May 2015
- LeCun, Yann; Bengio, Yoshua; Hinton, Geoffrey
- Nature, Vol. 521, Issue 7553
Natural Language Processing in Medicine: An Overview
journal, September 1996
- Spyns, P.
- Methods of Information in Medicine, Vol. 35, Issue 04/05