skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Retrofitting Word Embeddings with the UMLS Metathesaurus for Clinical Information Extraction

Abstract

Deep learning has surged in popularity and proven to be effective for various artificial intelligence appli- cations including information extraction from cancer pathol- ogy reports. Since word representation is a core unit that enables deep learning algorithms to understand words and be able to perform NLP, this representation must include as much information as possible to help these algorithms achieve high classification performance. Therefore, in this work in addition to the distributional information of words in large sized corpora, we use UMLS vocabulary resources to enrich the vector space representation of words with the semantic relations between words. These resources provide many terminologies pertaining to cancer. The refined word embeddings are used with a convolutional neural (CNN) model to extract four data elements from cancer pathology reports; ICD-O-3 tumor topography codes, tumor laterality, behavior, and histological grade. We observed that using UMLS vocabulary resources to enrich word embeddings of CNN models consistently outperformed CNN models without pre- training word embeddings and even with pre-trained word embeddings on a domain specific corpus across all four tasks. The results show marginal improvement on the laterality task, but a significant improvement on the other tasks, especially for the macro-f score. Specifically, the improvementsmore » are 3%, 13%, and 15% for tumor site, histological grade, and behavior tasks, respectively. This approach is encouraging to enrich word embeddings with more clinical data resources to be used for information abstraction tasks from clinical pathology reports.« less

Authors:
ORCiD logo [1];  [1]; ORCiD logo [1]; ORCiD logo [1]
  1. ORNL
Publication Date:
Research Org.:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Sponsoring Org.:
USDOE
OSTI Identifier:
1491322
DOE Contract Number:  
AC05-00OR22725
Resource Type:
Conference
Resource Relation:
Conference: IEEE International Conference on Big Data - Seatttle, Washington, United States of America - 12/10/2018 10:00:00 AM-12/13/2018 10:00:00 AM
Country of Publication:
United States
Language:
English

Citation Formats

Alawad, Mohammed M., Hasan, S M Shamimul, Christian, Blair, and Tourassi, Georgia. Retrofitting Word Embeddings with the UMLS Metathesaurus for Clinical Information Extraction. United States: N. p., 2018. Web.
Alawad, Mohammed M., Hasan, S M Shamimul, Christian, Blair, & Tourassi, Georgia. Retrofitting Word Embeddings with the UMLS Metathesaurus for Clinical Information Extraction. United States.
Alawad, Mohammed M., Hasan, S M Shamimul, Christian, Blair, and Tourassi, Georgia. Sat . "Retrofitting Word Embeddings with the UMLS Metathesaurus for Clinical Information Extraction". United States. https://www.osti.gov/servlets/purl/1491322.
@article{osti_1491322,
title = {Retrofitting Word Embeddings with the UMLS Metathesaurus for Clinical Information Extraction},
author = {Alawad, Mohammed M. and Hasan, S M Shamimul and Christian, Blair and Tourassi, Georgia},
abstractNote = {Deep learning has surged in popularity and proven to be effective for various artificial intelligence appli- cations including information extraction from cancer pathol- ogy reports. Since word representation is a core unit that enables deep learning algorithms to understand words and be able to perform NLP, this representation must include as much information as possible to help these algorithms achieve high classification performance. Therefore, in this work in addition to the distributional information of words in large sized corpora, we use UMLS vocabulary resources to enrich the vector space representation of words with the semantic relations between words. These resources provide many terminologies pertaining to cancer. The refined word embeddings are used with a convolutional neural (CNN) model to extract four data elements from cancer pathology reports; ICD-O-3 tumor topography codes, tumor laterality, behavior, and histological grade. We observed that using UMLS vocabulary resources to enrich word embeddings of CNN models consistently outperformed CNN models without pre- training word embeddings and even with pre-trained word embeddings on a domain specific corpus across all four tasks. The results show marginal improvement on the laterality task, but a significant improvement on the other tasks, especially for the macro-f score. Specifically, the improvements are 3%, 13%, and 15% for tumor site, histological grade, and behavior tasks, respectively. This approach is encouraging to enrich word embeddings with more clinical data resources to be used for information abstraction tasks from clinical pathology reports.},
doi = {},
journal = {},
number = ,
volume = ,
place = {United States},
year = {2018},
month = {12}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share: