Retrofitting Word Embeddings with the UMLS Metathesaurus for Clinical Information Extraction
Abstract
Deep learning has surged in popularity and proven to be effective for various artificial intelligence appli- cations including information extraction from cancer pathol- ogy reports. Since word representation is a core unit that enables deep learning algorithms to understand words and be able to perform NLP, this representation must include as much information as possible to help these algorithms achieve high classification performance. Therefore, in this work in addition to the distributional information of words in large sized corpora, we use UMLS vocabulary resources to enrich the vector space representation of words with the semantic relations between words. These resources provide many terminologies pertaining to cancer. The refined word embeddings are used with a convolutional neural (CNN) model to extract four data elements from cancer pathology reports; ICD-O-3 tumor topography codes, tumor laterality, behavior, and histological grade. We observed that using UMLS vocabulary resources to enrich word embeddings of CNN models consistently outperformed CNN models without pre- training word embeddings and even with pre-trained word embeddings on a domain specific corpus across all four tasks. The results show marginal improvement on the laterality task, but a significant improvement on the other tasks, especially for the macro-f score. Specifically, the improvementsmore »
- Authors:
-
- ORNL
- Publication Date:
- Research Org.:
- Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
- Sponsoring Org.:
- USDOE
- OSTI Identifier:
- 1491322
- DOE Contract Number:
- AC05-00OR22725
- Resource Type:
- Conference
- Resource Relation:
- Conference: IEEE International Conference on Big Data - Seatttle, Washington, United States of America - 12/10/2018 10:00:00 AM-12/13/2018 10:00:00 AM
- Country of Publication:
- United States
- Language:
- English
Citation Formats
Alawad, Mohammed, Hasan, S M Shamimul, Christian, Blair, and Tourassi, Georgia. Retrofitting Word Embeddings with the UMLS Metathesaurus for Clinical Information Extraction. United States: N. p., 2018.
Web. doi:10.1109/BigData.2018.8621999.
Alawad, Mohammed, Hasan, S M Shamimul, Christian, Blair, & Tourassi, Georgia. Retrofitting Word Embeddings with the UMLS Metathesaurus for Clinical Information Extraction. United States. https://doi.org/10.1109/BigData.2018.8621999
Alawad, Mohammed, Hasan, S M Shamimul, Christian, Blair, and Tourassi, Georgia. 2018.
"Retrofitting Word Embeddings with the UMLS Metathesaurus for Clinical Information Extraction". United States. https://doi.org/10.1109/BigData.2018.8621999. https://www.osti.gov/servlets/purl/1491322.
@article{osti_1491322,
title = {Retrofitting Word Embeddings with the UMLS Metathesaurus for Clinical Information Extraction},
author = {Alawad, Mohammed and Hasan, S M Shamimul and Christian, Blair and Tourassi, Georgia},
abstractNote = {Deep learning has surged in popularity and proven to be effective for various artificial intelligence appli- cations including information extraction from cancer pathol- ogy reports. Since word representation is a core unit that enables deep learning algorithms to understand words and be able to perform NLP, this representation must include as much information as possible to help these algorithms achieve high classification performance. Therefore, in this work in addition to the distributional information of words in large sized corpora, we use UMLS vocabulary resources to enrich the vector space representation of words with the semantic relations between words. These resources provide many terminologies pertaining to cancer. The refined word embeddings are used with a convolutional neural (CNN) model to extract four data elements from cancer pathology reports; ICD-O-3 tumor topography codes, tumor laterality, behavior, and histological grade. We observed that using UMLS vocabulary resources to enrich word embeddings of CNN models consistently outperformed CNN models without pre- training word embeddings and even with pre-trained word embeddings on a domain specific corpus across all four tasks. The results show marginal improvement on the laterality task, but a significant improvement on the other tasks, especially for the macro-f score. Specifically, the improvements are 3%, 13%, and 15% for tumor site, histological grade, and behavior tasks, respectively. This approach is encouraging to enrich word embeddings with more clinical data resources to be used for information abstraction tasks from clinical pathology reports.},
doi = {10.1109/BigData.2018.8621999},
url = {https://www.osti.gov/biblio/1491322},
journal = {},
number = ,
volume = ,
place = {United States},
year = {2018},
month = {12}
}
Works referenced in this record:
Retrofitting Word Vectors to Semantic Lexicons
conference, January 2015
- Faruqui, Manaal; Dodge, Jesse; Jauhar, Sujay Kumar
- Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Automatic lymphoma classification with sentence subgraph mining from pathology reports
journal, January 2014
- Luo, Yuan; Sohani, Aliyah R.; Hochberg, Ephraim P.
- Journal of the American Medical Informatics Association, Vol. 21, Issue 5
Automatic ICD-10 classification of cancers from free-text death certificates
journal, November 2015
- Koopman, Bevan; Zuccon, Guido; Nguyen, Anthony
- International Journal of Medical Informatics, Vol. 84, Issue 11
Automatically extracting cancer disease characteristics from pathology reports into a Disease Knowledge Representation Model
journal, October 2009
- Coden, Anni; Savova, Guergana; Sominsky, Igor
- Journal of Biomedical Informatics, Vol. 42, Issue 5
Text mining of cancer-related information: Review of current status and future directions
journal, September 2014
- Spasić, Irena; Livsey, Jacqueline; Keane, John A.
- International Journal of Medical Informatics, Vol. 83, Issue 9
Neural Machine Translation of Rare Words with Subword Units
conference, January 2016
- Sennrich, Rico; Haddow, Barry; Birch, Alexandra
- Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Improving Lexical Embeddings with Semantic Knowledge
conference, January 2014
- Yu, Mo; Dredze, Mark
- Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Convolutional Neural Networks for Sentence Classification
conference, January 2014
- Kim, Yoon
- Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)
Deep learning for stock market prediction from financial news articles
conference, June 2017
- Vargas, Manuel R.; de Lima, Beatriz S. L. P.; Evsukoff, Alexandre G.
- 2017 IEEE International Conference on Computational Intelligence and Virtual Environments for Measurement Systems and Applications (CIVEMSA)
The Unified Medical Language System
journal, January 1993
- Lindberg, D. A. B.; Humphreys, B. L.; McCray, A. T.
- Methods of Information in Medicine, Vol. 32, Issue 04