Extraction of Tumor Site from Cancer Pathology Reports using Deep Filters

Dubey, Abhishek; Hinkle, Jacob; Christian, Blair; Tourassi, Georgia

doi:10.1145/3307339.3342173

Title: Extraction of Tumor Site from Cancer Pathology Reports using Deep Filters

Conference · Sun Sep 01 00:00:00 EDT 2019

DOI:https://doi.org/10.1145/3307339.3342173· OSTI ID:1820880

Dubey, Abhishek ^[1];

^[1];

^[1]

ORNL

Purpose: Pathology reports are the primary source of information concerning the millions of cancer cases across the United States. Cancer registries manually process the pathology reports to extract the pertinent information including primary tumor site, behavior, histology, laterality, and grade. Processing a large volume of the pathology reports in a timely manner is a continuing challenge for cancer registries. The purpose of this study is to develop an information extraction pipeline to reliably and efficiently extract reportable information. Method: We have developed a novel inverse-regression (IR) based information extraction pipeline. The inverse-regression based supervised filter has been successfully applied to many application domains. However, its application to the information extraction from unstructured text is hindered primarily by the extreme high-dimensionality of n-gram representations of text. In this study, we attempt to overcome the obstacles by a novel bootstrapping strategy. First, we use an information-theoretic mutual information based filter to discard the excessive and redundant n-gram features. This step reduces the size and improves the condition number of the sample covariance matrix, thus reducing the computational cost and improving the numerical stability of the subsequent inverse-regression step. Then we use localized sliced inverse-regression (LSIR) to learn a low-dimensional discriminatory subspace for information inference. In particular, we use the k-nearest neighbors of an unlabeled pathology report in the learned representation to infer the desired information from the labeled data in a supervised manner. Results: The experiments were conducted on a set of de-identified pathology reports with human expert labels as the ground truth. Our pipeline consistently performed better than or comparable to the best performing state-of-the-art methods while reducing the training and inference times substantially. Conclusion: Our results demonstrate the potential of inverse-regression based information extraction pipeline for reliable and efficient information extraction from unstructured text. The information extracted from the pathology reports can be used along with clinical information, medical imaging, and genomic information to instigate discoveries in cancer research.

View Conference

Cite

Export

Save

Research Organization:: Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)

Sponsoring Organization:: USDOE

DOE Contract Number:: AC05-00OR22725

OSTI ID:: 1820880

Resource Relation:: Conference: ACM International Conference on Bioinformatics, Computational Biology and Health Informatics: ACM-BCB 2019 - Niagara Fall, New York, United States of America - 9/7/2019 8:00:00 AM-9/10/2019 8:00:00 AM

Country of Publication:: United States

Language:: English

References (19)

Bengali word embeddings and it's application in solving document classification problem Ahmad, Adnan; Amin, Mohammad Ruhul 2016 19th International Conference on Computer and Information Technology (ICCIT) https://doi.org/10.1109/ICCITECHN.2016.7860236	conference	December 2016
The common pattern specification language Appelt, Douglas E.; Onyshkevych, Boyan Proceedings of a workshop on held at Baltimore, Maryland October 13-15, 1998 - https://doi.org/10.3115/1119089.1119095	conference	January 1996
A survey on feature selection methods Chandrashekar, Girish; Sahin, Ferat Computers & Electrical Engineering, Vol. 40, Issue 1 https://doi.org/10.1016/j.compeleceng.2013.11.024	journal	January 2014
Sufficient Dimension Reduction via Inverse Regression: A Minimum Discrepancy Approach Cook, R. Dennis; Ni, Liqiang Journal of the American Statistical Association, Vol. 100, Issue 470 https://doi.org/10.1198/016214504000001501	journal	June 2005
UIMA: an architectural approach to unstructured information processing in the corporate research environment Ferrucci, David; Lally, Adam Natural Language Engineering, Vol. 10, Issue 3-4 https://doi.org/10.1017/S1351324904003523	journal	January 1999
Gene Selection for Cancer Classification using Support Vector Machines Guyon, Isabelle; Weston, Jason; Barnhill, Stephen Machine Learning, Vol. 46, Issue 1/3, p. 389-422 https://doi.org/10.1023/A:1012487302797	journal	January 2002
Deep Convolution Neural Networks for Twitter Sentiment Analysis Jianqiang, Zhao; Xiaolin, Gui; Xuejun, Zhang IEEE Access, Vol. 6 https://doi.org/10.1109/ACCESS.2017.2776930	journal	January 2018
Sliced Inverse Regression for Dimension Reduction Li, Ker-Chau Journal of the American Statistical Association, Vol. 86, Issue 414 https://doi.org/10.1080/01621459.1991.10475035	journal	June 1991
Topic-structure-based complementary information retrieval and its application Ma, Qiang; Tanaka, Katsumi ACM Transactions on Asian Language Information Processing, Vol. 4, Issue 4 https://doi.org/10.1145/1113308.1113314	journal	December 2005
Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 27, Issue 8 https://doi.org/10.1109/TPAMI.2005.159	journal	August 2005
Glove: Global Vectors for Word Representation Pennington, Jeffrey; Socher, Richard; Manning, Christopher Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) https://doi.org/10.3115/v1/D14-1162	conference	January 2014
Aspect extraction for opinion mining with a deep convolutional neural network Poria, Soujanya; Cambria, Erik; Gelbukh, Alexander Knowledge-Based Systems, Vol. 108 https://doi.org/10.1016/j.knosys.2016.06.009	journal	September 2016
Deep Learning for Automated Extraction of Primary Sites From Cancer Pathology Reports Qiu, John X.; Yoon, Hong-Jun; Fearn, Paul A. IEEE Journal of Biomedical and Health Informatics, Vol. 22, Issue 1 https://doi.org/10.1109/JBHI.2017.2700722	journal	January 2018
Regression Shrinkage and Selection Via the Lasso Tibshirani, Robert Journal of the Royal Statistical Society: Series B (Methodological), Vol. 58, Issue 1 https://doi.org/10.1111/j.2517-6161.1996.tb02080.x	journal	January 1996
Numerical Linear Algebra Trefethen, Lloyd N.; Bau, David https://doi.org/10.1137/1.9780898719574	book	January 1997
Localized Sliced Inverse Regression Wu, Qiang; Liang, Feng; Mukherjee, Sayan Journal of Computational and Graphical Statistics, Vol. 19, Issue 4 https://doi.org/10.1198/jcgs.2010.08080	journal	January 2010
Topic evolution based on LDA and HMM and its application in stem cell research Wu, QingQiang; Zhang, CaiDong; Hong, QingQi Journal of Information Science, Vol. 40, Issue 5 https://doi.org/10.1177/0165551514540565	journal	June 2014
Application of sliced inverse regression with fuzzy clustering for thermal error modeling of CNC machine tool Zhang, Ting; Ye, Wenhua; Shan, Yicai The International Journal of Advanced Manufacturing Technology, Vol. 85, Issue 9-12 https://doi.org/10.1007/s00170-015-8135-6	journal	November 2015
Regularization and variable selection via the elastic net Zou, Hui; Hastie, Trevor Journal of the Royal Statistical Society: Series B (Statistical Methodology), Vol. 67, Issue 2 https://doi.org/10.1111/j.1467-9868.2005.00503.x	journal	April 2005

Similar Records

Extraction of Tumor Site from Cancer Pathology Reports using Deep Filters

Conference · Sun Sep 01 00:00:00 EDT 2019 · OSTI ID:1820880

Dubey, Abhishek; Hinkle, Jacob; Christian, Blair; +1 more

Inverse Regression for Extraction of Tumor Site from Cancer Pathology Reports

Conference · Wed May 01 00:00:00 EDT 2019 · OSTI ID:1820880

Dubey, Abhishek; Yoon, Hong-Jun; Tourassi, Georgia

Automatic extraction of cancer registry reportable information from free-text pathology reports using multitask convolutional neural networks

Journal Article · Sat Nov 09 00:00:00 EST 2019 · Journal of the American Medical Informatics Association · OSTI ID:1820880

Alawad, Mohammed; Gao, Shang; Qiu, John X.; +7 more

Title: Extraction of Tumor Site from Cancer Pathology Reports using Deep Filters

Citation Formats

References (19)

Similar Records

Related Subjects