skip to main content
DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Automatic extraction of cancer registry reportable information from free-text pathology reports using multitask convolutional neural networks

Abstract

Abstract Objective We implement 2 different multitask learning (MTL) techniques, hard parameter sharing and cross-stitch, to train a word-level convolutional neural network (CNN) specifically designed for automatic extraction of cancer data from unstructured text in pathology reports. We show the importance of learning related information extraction (IE) tasks leveraging shared representations across the tasks to achieve state-of-the-art performance in classification accuracy and computational efficiency. Materials and Methods Multitask CNN (MTCNN) attempts to tackle document information extraction by learning to extract multiple key cancer characteristics simultaneously. We trained our MTCNN to perform 5 information extraction tasks: (1) primary cancer site (65 classes), (2) laterality (4 classes), (3) behavior (3 classes), (4) histological type (63 classes), and (5) histological grade (5 classes). We evaluated the performance on a corpus of 95 231 pathology documents (71 223 unique tumors) obtained from the Louisiana Tumor Registry. We compared the performance of the MTCNN models against single-task CNN models and 2 traditional machine learning approaches, namely support vector machine (SVM) and random forest classifier (RFC). Results MTCNNs offered superior performance across all 5 tasks in terms of classification accuracy as compared with the other machine learning models. Based on retrospective evaluation, the hard parameter sharing andmore » cross-stitch MTCNN models correctly classified 59.04% and 57.93% of the pathology reports respectively across all 5 tasks. The baseline models achieved 53.68% (CNN), 46.37% (RFC), and 36.75% (SVM). Based on prospective evaluation, the percentages of correctly classified cases across the 5 tasks were 60.11% (hard parameter sharing), 58.13% (cross-stitch), 51.30% (single-task CNN), 42.07% (RFC), and 35.16% (SVM). Moreover, hard parameter sharing MTCNNs outperformed the other models in computational efficiency by using about the same number of trainable parameters as a single-task CNN. Conclusions The hard parameter sharing MTCNN offers superior classification accuracy for automated coding support of pathology documents across a wide range of cancers and multiple information extraction tasks while maintaining similar training and inference time as those of a single task–specific model.« less

Authors:
 [1];  [1];  [1];  [1];  [1];  [2];  [3];  [3];  [4];  [1]
  1. Computational Sciences and Engineering Division, Health Data Sciences Institute, Oak Ridge National Laboratory, Oak Ridge, Tennessee, USA
  2. Surveillance Research Program, Division of Cancer Control and Population Sciences, National Cancer Institute, Bethesda, Maryland, USA
  3. Louisiana Tumor Registry, Louisiana State University Health Sciences Center School of Public Health, New Orleans, Louisiana, USA
  4. Information Management Services Inc, Calverton, Maryland, USA
Publication Date:
Sponsoring Org.:
USDOE
OSTI Identifier:
1574032
Resource Type:
Published Article
Journal Name:
Journal of the American Medical Informatics Association
Additional Journal Information:
Journal Name: Journal of the American Medical Informatics Association; Journal ID: ISSN 1067-5027
Publisher:
Oxford University Press
Country of Publication:
United Kingdom
Language:
English

Citation Formats

Alawad, Mohammed, Gao, Shang, Qiu, John X., Yoon, Hong Jun, Blair Christian, J., Penberthy, Lynne, Mumphrey, Brent, Wu, Xiao-Cheng, Coyle, Linda, and Tourassi, Georgia. Automatic extraction of cancer registry reportable information from free-text pathology reports using multitask convolutional neural networks. United Kingdom: N. p., 2019. Web. doi:10.1093/jamia/ocz153.
Alawad, Mohammed, Gao, Shang, Qiu, John X., Yoon, Hong Jun, Blair Christian, J., Penberthy, Lynne, Mumphrey, Brent, Wu, Xiao-Cheng, Coyle, Linda, & Tourassi, Georgia. Automatic extraction of cancer registry reportable information from free-text pathology reports using multitask convolutional neural networks. United Kingdom. doi:10.1093/jamia/ocz153.
Alawad, Mohammed, Gao, Shang, Qiu, John X., Yoon, Hong Jun, Blair Christian, J., Penberthy, Lynne, Mumphrey, Brent, Wu, Xiao-Cheng, Coyle, Linda, and Tourassi, Georgia. Sat . "Automatic extraction of cancer registry reportable information from free-text pathology reports using multitask convolutional neural networks". United Kingdom. doi:10.1093/jamia/ocz153.
@article{osti_1574032,
title = {Automatic extraction of cancer registry reportable information from free-text pathology reports using multitask convolutional neural networks},
author = {Alawad, Mohammed and Gao, Shang and Qiu, John X. and Yoon, Hong Jun and Blair Christian, J. and Penberthy, Lynne and Mumphrey, Brent and Wu, Xiao-Cheng and Coyle, Linda and Tourassi, Georgia},
abstractNote = {Abstract Objective We implement 2 different multitask learning (MTL) techniques, hard parameter sharing and cross-stitch, to train a word-level convolutional neural network (CNN) specifically designed for automatic extraction of cancer data from unstructured text in pathology reports. We show the importance of learning related information extraction (IE) tasks leveraging shared representations across the tasks to achieve state-of-the-art performance in classification accuracy and computational efficiency. Materials and Methods Multitask CNN (MTCNN) attempts to tackle document information extraction by learning to extract multiple key cancer characteristics simultaneously. We trained our MTCNN to perform 5 information extraction tasks: (1) primary cancer site (65 classes), (2) laterality (4 classes), (3) behavior (3 classes), (4) histological type (63 classes), and (5) histological grade (5 classes). We evaluated the performance on a corpus of 95 231 pathology documents (71 223 unique tumors) obtained from the Louisiana Tumor Registry. We compared the performance of the MTCNN models against single-task CNN models and 2 traditional machine learning approaches, namely support vector machine (SVM) and random forest classifier (RFC). Results MTCNNs offered superior performance across all 5 tasks in terms of classification accuracy as compared with the other machine learning models. Based on retrospective evaluation, the hard parameter sharing and cross-stitch MTCNN models correctly classified 59.04% and 57.93% of the pathology reports respectively across all 5 tasks. The baseline models achieved 53.68% (CNN), 46.37% (RFC), and 36.75% (SVM). Based on prospective evaluation, the percentages of correctly classified cases across the 5 tasks were 60.11% (hard parameter sharing), 58.13% (cross-stitch), 51.30% (single-task CNN), 42.07% (RFC), and 35.16% (SVM). Moreover, hard parameter sharing MTCNNs outperformed the other models in computational efficiency by using about the same number of trainable parameters as a single-task CNN. Conclusions The hard parameter sharing MTCNN offers superior classification accuracy for automated coding support of pathology documents across a wide range of cancers and multiple information extraction tasks while maintaining similar training and inference time as those of a single task–specific model.},
doi = {10.1093/jamia/ocz153},
journal = {Journal of the American Medical Informatics Association},
number = ,
volume = ,
place = {United Kingdom},
year = {2019},
month = {11}
}

Journal Article:
Free Publicly Available Full Text
Publisher's Version of Record
DOI: 10.1093/jamia/ocz153

Save / Share:

Works referenced in this record:

Natural language processing systems for capturing and standardizing unstructured clinical information: A systematic review
journal, September 2017

  • Kreimeyer, Kory; Foster, Matthew; Pandey, Abhishek
  • Journal of Biomedical Informatics, Vol. 73
  • DOI: 10.1016/j.jbi.2017.07.012

Recent Trends in Deep Learning Based Natural Language Processing [Review Article]
journal, August 2018

  • Young, Tom; Hazarika, Devamanyu; Poria, Soujanya
  • IEEE Computational Intelligence Magazine, Vol. 13, Issue 3
  • DOI: 10.1109/MCI.2018.2840738

Text mining of cancer-related information: Review of current status and future directions
journal, September 2014

  • Spasić, Irena; Livsey, Jacqueline; Keane, John A.
  • International Journal of Medical Informatics, Vol. 83, Issue 9
  • DOI: 10.1016/j.ijmedinf.2014.06.009

Hierarchical attention networks for information extraction from cancer pathology reports
journal, November 2017

  • Gao, Shang; Young, Michael T.; Qiu, John X.
  • Journal of the American Medical Informatics Association, Vol. 25, Issue 3
  • DOI: 10.1093/jamia/ocx131

Clinical information extraction applications: A literature review
journal, January 2018

  • Wang, Yanshan; Wang, Liwei; Rastegar-Mojarad, Majid
  • Journal of Biomedical Informatics, Vol. 77
  • DOI: 10.1016/j.jbi.2017.11.011

Natural Language Processing methods and systems for biomedical ontology learning
journal, February 2011

  • Liu, Kaihong; Hogan, William R.; Crowley, Rebecca S.
  • Journal of Biomedical Informatics, Vol. 44, Issue 1
  • DOI: 10.1016/j.jbi.2010.07.006

Using machine learning to parse breast pathology reports
journal, November 2016

  • Yala, Adam; Barzilay, Regina; Salama, Laura
  • Breast Cancer Research and Treatment, Vol. 161, Issue 2
  • DOI: 10.1007/s10549-016-4035-1

Deep Learning for Automated Extraction of Primary Sites From Cancer Pathology Reports
journal, January 2018

  • Qiu, John X.; Yoon, Hong-Jun; Fearn, Paul A.
  • IEEE Journal of Biomedical and Health Informatics, Vol. 22, Issue 1
  • DOI: 10.1109/JBHI.2017.2700722

The feasibility of using natural language processing to extract clinical information from breast pathology reports
journal, January 2012

  • Hughes, KevinS; Buckley, JullietteM; Coopey, SuzanneB
  • Journal of Pathology Informatics, Vol. 3, Issue 1
  • DOI: 10.4103/2153-3539.97788

A systematic analysis of performance measures for classification tasks
journal, July 2009