skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Extraction of Tumor Site from Cancer Pathology Reports using Deep Filters

Abstract

Purpose: Pathology reports are the primary source of information concerning the millions of cancer cases across the United States. % Cancer registries manually process the pathology reports to extract the pertinent information including primary tumor site, behavior, histology, laterality, and grade. % Processing a large volume of the pathology reports in a timely manner is a continuing challenge for cancer registries. % The purpose of this study is to develop an information extraction pipeline to reliably and efficiently extract reportable information.\\ Method: % We have developed a novel inverse-regression (IR) based information extraction pipeline. % The inverse-regression based supervised filter has been successfully applied to many application domains. % However, its application to the information extraction from unstructured text is hindered primarily by the extreme high-dimensionality of n-gram representations of text. % In this study, we attempt to overcome the obstacles by a novel bootstrapping strategy. % First, we use an information-theoretic mutual information based filter to discard the excessive and redundant n-gram features. % This step reduces the size and improves the condition number of the sample covariance matrix, thus reducing the computational cost and improving the numerical stability of the subsequent inverse-regression step. % Then we use localizedmore » sliced inverse-regression (LSIR) to learn a low-dimensional discriminatory subspace for information inference. % In particular, we use the k-nearest neighbors of an unlabeled pathology report in the learned representation to infer the desired information from the labeled data in a supervised manner. % \\ % Results: The experiments were conducted on a set of de-identified pathology reports with human expert labels as the ground truth. % Our pipeline consistently performed better than or comparable to the best performing state-of-the-art methods while reducing the training and inference times substantially.\\ Conclusion: Our results demonstrate the potential of \emergencystretch 3em inverse-regression based information extraction pipeline for reliable and efficient information extraction from unstructured text. % The information extracted from the pathology reports can be used along with clinical information, medical imaging, and genomic information to instigate discoveries in cancer research.« less

Authors:
 [1]; ORCiD logo [1]; ORCiD logo [1]; ORCiD logo [1]
  1. ORNL
Publication Date:
Research Org.:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Sponsoring Org.:
USDOE
OSTI Identifier:
1561636
DOE Contract Number:  
AC05-00OR22725
Resource Type:
Conference
Resource Relation:
Conference: 10th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics (ACM BCB 2019) - Niagra Fall, New York, United States of America - 9/7/2019 8:00:00 AM-9/10/2019 8:00:00 AM
Country of Publication:
United States
Language:
English

Citation Formats

Dubey, Abhishek, Hinkle, Jacob, Christian, Blair, and Tourassi, Georgia. Extraction of Tumor Site from Cancer Pathology Reports using Deep Filters. United States: N. p., 2019. Web. doi:10.1145/3307339.3342173.
Dubey, Abhishek, Hinkle, Jacob, Christian, Blair, & Tourassi, Georgia. Extraction of Tumor Site from Cancer Pathology Reports using Deep Filters. United States. doi:10.1145/3307339.3342173.
Dubey, Abhishek, Hinkle, Jacob, Christian, Blair, and Tourassi, Georgia. Sun . "Extraction of Tumor Site from Cancer Pathology Reports using Deep Filters". United States. doi:10.1145/3307339.3342173. https://www.osti.gov/servlets/purl/1561636.
@article{osti_1561636,
title = {Extraction of Tumor Site from Cancer Pathology Reports using Deep Filters},
author = {Dubey, Abhishek and Hinkle, Jacob and Christian, Blair and Tourassi, Georgia},
abstractNote = {Purpose: Pathology reports are the primary source of information concerning the millions of cancer cases across the United States. % Cancer registries manually process the pathology reports to extract the pertinent information including primary tumor site, behavior, histology, laterality, and grade. % Processing a large volume of the pathology reports in a timely manner is a continuing challenge for cancer registries. % The purpose of this study is to develop an information extraction pipeline to reliably and efficiently extract reportable information.\\ Method: % We have developed a novel inverse-regression (IR) based information extraction pipeline. % The inverse-regression based supervised filter has been successfully applied to many application domains. % However, its application to the information extraction from unstructured text is hindered primarily by the extreme high-dimensionality of n-gram representations of text. % In this study, we attempt to overcome the obstacles by a novel bootstrapping strategy. % First, we use an information-theoretic mutual information based filter to discard the excessive and redundant n-gram features. % This step reduces the size and improves the condition number of the sample covariance matrix, thus reducing the computational cost and improving the numerical stability of the subsequent inverse-regression step. % Then we use localized sliced inverse-regression (LSIR) to learn a low-dimensional discriminatory subspace for information inference. % In particular, we use the k-nearest neighbors of an unlabeled pathology report in the learned representation to infer the desired information from the labeled data in a supervised manner. % \\ % Results: The experiments were conducted on a set of de-identified pathology reports with human expert labels as the ground truth. % Our pipeline consistently performed better than or comparable to the best performing state-of-the-art methods while reducing the training and inference times substantially.\\ Conclusion: Our results demonstrate the potential of \emergencystretch 3em inverse-regression based information extraction pipeline for reliable and efficient information extraction from unstructured text. % The information extracted from the pathology reports can be used along with clinical information, medical imaging, and genomic information to instigate discoveries in cancer research.},
doi = {10.1145/3307339.3342173},
journal = {},
number = ,
volume = ,
place = {United States},
year = {2019},
month = {9}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share: