Retrofitting Word Embeddings with the UMLS Metathesaurus for Clinical Information Extraction

Alawad, Mohammed; Hasan, S M Shamimul; Christian, Blair; Tourassi, Georgia

doi:10.1109/BigData.2018.8621999

Title: Retrofitting Word Embeddings with the UMLS Metathesaurus for Clinical Information Extraction

Conference · Sat Dec 01 00:00:00 EST 2018

DOI:https://doi.org/10.1109/BigData.2018.8621999· OSTI ID:1491322

^[1]; Hasan, S M Shamimul ^[1];

^[1];

^[1]

ORNL

Deep learning has surged in popularity and proven to be effective for various artificial intelligence appli- cations including information extraction from cancer pathol- ogy reports. Since word representation is a core unit that enables deep learning algorithms to understand words and be able to perform NLP, this representation must include as much information as possible to help these algorithms achieve high classification performance. Therefore, in this work in addition to the distributional information of words in large sized corpora, we use UMLS vocabulary resources to enrich the vector space representation of words with the semantic relations between words. These resources provide many terminologies pertaining to cancer. The refined word embeddings are used with a convolutional neural (CNN) model to extract four data elements from cancer pathology reports; ICD-O-3 tumor topography codes, tumor laterality, behavior, and histological grade. We observed that using UMLS vocabulary resources to enrich word embeddings of CNN models consistently outperformed CNN models without pre- training word embeddings and even with pre-trained word embeddings on a domain specific corpus across all four tasks. The results show marginal improvement on the laterality task, but a significant improvement on the other tasks, especially for the macro-f score. Specifically, the improvements are 3%, 13%, and 15% for tumor site, histological grade, and behavior tasks, respectively. This approach is encouraging to enrich word embeddings with more clinical data resources to be used for information abstraction tasks from clinical pathology reports.

View Conference

Cite

Export

Save

Research Organization:: Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)

Sponsoring Organization:: USDOE

DOE Contract Number:: AC05-00OR22725

OSTI ID:: 1491322

Resource Relation:: Conference: IEEE International Conference on Big Data - Seatttle, Washington, United States of America - 12/10/2018 10:00:00 AM-12/13/2018 10:00:00 AM

Country of Publication:: United States

Language:: English

References (12)

An Introduction to the Bootstrap Efron, Bradley; Tibshirani, R. J. Monographs on Statistics and Applied Probability https://doi.org/10.1201/9780429246593	book	May 1994
Retrofitting Word Vectors to Semantic Lexicons Faruqui, Manaal; Dodge, Jesse; Jauhar, Sujay Kumar Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies https://doi.org/10.3115/v1/N15-1184	conference	January 2015
Automatic lymphoma classification with sentence subgraph mining from pathology reports Luo, Yuan; Sohani, Aliyah R.; Hochberg, Ephraim P. Journal of the American Medical Informatics Association, Vol. 21, Issue 5 https://doi.org/10.1136/amiajnl-2013-002443	journal	January 2014
Automatic ICD-10 classification of cancers from free-text death certificates Koopman, Bevan; Zuccon, Guido; Nguyen, Anthony International Journal of Medical Informatics, Vol. 84, Issue 11 https://doi.org/10.1016/j.ijmedinf.2015.08.004	journal	November 2015
Automatically extracting cancer disease characteristics from pathology reports into a Disease Knowledge Representation Model Coden, Anni; Savova, Guergana; Sominsky, Igor Journal of Biomedical Informatics, Vol. 42, Issue 5 https://doi.org/10.1016/j.jbi.2008.12.005	journal	October 2009
Text mining of cancer-related information: Review of current status and future directions Spasić, Irena; Livsey, Jacqueline; Keane, John A. International Journal of Medical Informatics, Vol. 83, Issue 9 https://doi.org/10.1016/j.ijmedinf.2014.06.009	journal	September 2014
Neural Machine Translation of Rare Words with Subword Units Sennrich, Rico; Haddow, Barry; Birch, Alexandra Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) https://doi.org/10.18653/v1/P16-1162	conference	January 2016
Deep Learning for Automated Extraction of Primary Sites From Cancer Pathology Reports Qiu, John X.; Yoon, Hong-Jun; Fearn, Paul A. IEEE Journal of Biomedical and Health Informatics, Vol. 22, Issue 1 https://doi.org/10.1109/JBHI.2017.2700722	journal	January 2018
Improving Lexical Embeddings with Semantic Knowledge Yu, Mo; Dredze, Mark Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) https://doi.org/10.3115/v1/P14-2089	conference	January 2014
Convolutional Neural Networks for Sentence Classification Kim, Yoon Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) https://doi.org/10.3115/v1/D14-1181	conference	January 2014
Deep learning for stock market prediction from financial news articles Vargas, Manuel R.; de Lima, Beatriz S. L. P.; Evsukoff, Alexandre G. 2017 IEEE International Conference on Computational Intelligence and Virtual Environments for Measurement Systems and Applications (CIVEMSA) https://doi.org/10.1109/CIVEMSA.2017.7995302	conference	June 2017
The Unified Medical Language System Lindberg, D. A. B.; Humphreys, B. L.; McCray, A. T. Methods of Information in Medicine, Vol. 32, Issue 04 https://doi.org/10.1055/s-0038-1634945	journal	January 1993

Similar Records

Retrofitting Word Embeddings with the UMLS Metathesaurus for Clinical Information Extraction

Conference · Sat Dec 01 00:00:00 EST 2018 · 2018 IEEE International Conference on Big Data (Big Data) · OSTI ID:1491322

Alawad, Mohammed; Hasan, S.M. Shamimul; Blair Christian, J.; +1 more

Automatic extraction of cancer registry reportable information from free-text pathology reports using multitask convolutional neural networks

Journal Article · Sat Nov 09 00:00:00 EST 2019 · Journal of the American Medical Informatics Association · OSTI ID:1491322

Alawad, Mohammed; Gao, Shang; Qiu, John X.; +7 more

Privacy-Preserving Deep Learning NLP Models for Cancer Registries

Journal Article · Thu Jul 01 00:00:00 EDT 2021 · IEEE Transactions on Emerging Topics in Computing · OSTI ID:1491322

Alawad, Mohammed; Yoon, Hong-Jun; Gao, Shang; +9 more

Title: Retrofitting Word Embeddings with the UMLS Metathesaurus for Clinical Information Extraction

Citation Formats

References (12)

Similar Records

Related Subjects