skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Integrating Natural Language Processing and Machine Learning Algorithms to Categorize Oncologic Response in Radiology Reports

Abstract

A significant volume of medical data remains unstructured. Natural language processing (NLP) and machine learning (ML) techniques have shown to successfully extract insights from radiology reports. However, the codependent effects of NLP and ML in this context have not been well-studied. Between April 1, 2015 and November 1, 2016, 9418 cross-sectional abdomen/pelvis CT and MR examinations containing our internal structured reporting element for cancer were separated into four categories: Progression, Stable Disease, Improvement, or No Cancer. We combined each of three NLP techniques with five ML algorithms to predict the assigned label using the unstructured report text and compared the performance of each combination. The three NLP algorithms included term frequency-inverse document frequency (TF-IDF), term frequency weighting (TF), and 16-bit feature hashing. The ML algorithms included logistic regression (LR), random decision forest (RDF), one-vs-all support vector machine (SVM), one-vs-all Bayes point machine (BPM), and fully connected neural network (NN). The best-performing NLP model consisted of tokenized unigrams and bigrams with TF-IDF. Increasing N-gram length yielded little to no added benefit for most ML algorithms. With all parameters optimized, SVM had the best performance on the test dataset, with 90.6 average accuracy and F score of 0.813. The interplay between MLmore » and NLP algorithms and their effect on interpretation accuracy is complex. The best accuracy is achieved when both algorithms are optimized concurrently.« less

Authors:
; ; ;  [1]
  1. Hospital of the University of Pennsylvania, Department of Radiology, Perelman School of Medicine (United States)
Publication Date:
OSTI Identifier:
22795662
Resource Type:
Journal Article
Journal Name:
Journal of Digital Imaging (Online)
Additional Journal Information:
Journal Volume: 31; Journal Issue: 2; Other Information: Copyright (c) 2018 Society for Imaging Informatics in Medicine; Country of input: International Atomic Energy Agency (IAEA); Journal ID: ISSN 1618-727X
Country of Publication:
United States
Language:
English
Subject:
62 RADIOLOGY AND NUCLEAR MEDICINE; ABDOMEN; ACCURACY; ALGORITHMS; COMPUTERIZED TOMOGRAPHY; DATASETS; NEOPLASMS; NEURAL NETWORKS; PELVIS; PERFORMANCE; PROGRAMMING LANGUAGES; RADIOLOGY

Citation Formats

Chen, Po-Hao, Zafar, Hanna, Galperin-Aizenberg, Maya, and Cook, Tessa. Integrating Natural Language Processing and Machine Learning Algorithms to Categorize Oncologic Response in Radiology Reports. United States: N. p., 2018. Web. doi:10.1007/S10278-017-0027-X.
Chen, Po-Hao, Zafar, Hanna, Galperin-Aizenberg, Maya, & Cook, Tessa. Integrating Natural Language Processing and Machine Learning Algorithms to Categorize Oncologic Response in Radiology Reports. United States. https://doi.org/10.1007/S10278-017-0027-X
Chen, Po-Hao, Zafar, Hanna, Galperin-Aizenberg, Maya, and Cook, Tessa. 2018. "Integrating Natural Language Processing and Machine Learning Algorithms to Categorize Oncologic Response in Radiology Reports". United States. https://doi.org/10.1007/S10278-017-0027-X.
@article{osti_22795662,
title = {Integrating Natural Language Processing and Machine Learning Algorithms to Categorize Oncologic Response in Radiology Reports},
author = {Chen, Po-Hao and Zafar, Hanna and Galperin-Aizenberg, Maya and Cook, Tessa},
abstractNote = {A significant volume of medical data remains unstructured. Natural language processing (NLP) and machine learning (ML) techniques have shown to successfully extract insights from radiology reports. However, the codependent effects of NLP and ML in this context have not been well-studied. Between April 1, 2015 and November 1, 2016, 9418 cross-sectional abdomen/pelvis CT and MR examinations containing our internal structured reporting element for cancer were separated into four categories: Progression, Stable Disease, Improvement, or No Cancer. We combined each of three NLP techniques with five ML algorithms to predict the assigned label using the unstructured report text and compared the performance of each combination. The three NLP algorithms included term frequency-inverse document frequency (TF-IDF), term frequency weighting (TF), and 16-bit feature hashing. The ML algorithms included logistic regression (LR), random decision forest (RDF), one-vs-all support vector machine (SVM), one-vs-all Bayes point machine (BPM), and fully connected neural network (NN). The best-performing NLP model consisted of tokenized unigrams and bigrams with TF-IDF. Increasing N-gram length yielded little to no added benefit for most ML algorithms. With all parameters optimized, SVM had the best performance on the test dataset, with 90.6 average accuracy and F score of 0.813. The interplay between ML and NLP algorithms and their effect on interpretation accuracy is complex. The best accuracy is achieved when both algorithms are optimized concurrently.},
doi = {10.1007/S10278-017-0027-X},
url = {https://www.osti.gov/biblio/22795662}, journal = {Journal of Digital Imaging (Online)},
issn = {1618-727X},
number = 2,
volume = 31,
place = {United States},
year = {Sun Apr 15 00:00:00 EDT 2018},
month = {Sun Apr 15 00:00:00 EDT 2018}
}