DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Class imbalance in out-of-distribution datasets: Improving the robustness of the TextCNN for the classification of rare cancer types

Abstract

In the last decade, the widespread adoption of electronic health record documentation has created huge opportunities for information mining. Natural language processing (NLP) techniques using machine and deep learning are becoming increasingly widespread for information extraction tasks from unstructured clinical notes. Disparities in performance when deploying machine learning models in the real world have recently received considerable attention. In the clinical NLP domain, the robustness of convolutional neural networks (CNNs) for classifying cancer pathology reports under natural distribution shifts remains understudied. In this research, we aim to quantify and improve the performance of the CNN for text classification on out-of-distribution (OOD) datasets resulting from the natural evolution of clinical text in pathology reports. We identified class imbalance due to different prevalence of cancer types as one of the sources of performance drop and analyzed the impact of previous methods for addressing class imbalance when deploying models in real-world domains. Our results show that our novel class-specialized ensemble technique outperforms other methods for the classification of rare cancer types in terms of macro F1 scores. We also found that traditional ensemble methods perform better in top classes, leading to higher micro F1 scores. Based on our findings, we formulate a seriesmore » of recommendations for other ML practitioners on how to build robust models with extremely imbalanced datasets in biomedical NLP applications.« less

Authors:
 [1];  [2];  [3]; ORCiD logo [4];  [5];  [6];  [7];  [8];  [9];  [10];  [11];  [12]; ORCiD logo [2]; ORCiD logo [2]
  1. Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States); Univ. of Tennessee, Knoxville, TN (United States)
  2. Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
  3. Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States); Vanderbilt Univ., Nashville, TN (United States)
  4. Univ. of Kentucky, Lexington, KY (United States)
  5. Louisiana State Univ., New Orleans, LA (United States)
  6. Rutgers Univ., New Brunswick, NJ (United States)
  7. Univ. of Utah, Salt Lake City, UT (United States)
  8. Fred Hutchison Cancer Research Center, Seattle, WA (United States)
  9. Univ. of New Mexico, Albuquerque, NM (United States)
  10. California Dept. of Public Health, Sacremento, CA (United States)
  11. Information Management Services, Inc., Calverton, MD (United States)
  12. National Cancer Institute, Bethesda, MD (United States)
Publication Date:
Research Org.:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States); Argonne National Lab. (ANL), Argonne, IL (United States); Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States); Los Alamos National Lab. (LANL), Los Alamos, NM (United States)
Sponsoring Org.:
USDOE Office of Science (SC); USDOE National Nuclear Security Administration (NNSA); Centers for Disease Control and Prevention (CDC); National Cancer Institute (NCI); University of Utah; Huntsman Cancer Foundation
OSTI Identifier:
1884003
Grant/Contract Number:  
AC05-00OR22725; AC02-06-CH11357; AC52-07NA27344; AC5206NA25396; 5NU58DP006344; HHSN261201800032I; HHSN261201800015I; HHSN261201800009I; U58DP00003907; NU58DP006332-02–00; HHSN261201800014I; HHSN261291800004I; HHSN261201800016I
Resource Type:
Accepted Manuscript
Journal Name:
Journal of Biomedical Informatics
Additional Journal Information:
Journal Volume: 125; Journal Issue: 1; Journal ID: ISSN 1532-0464
Publisher:
Elsevier
Country of Publication:
United States
Language:
English
Subject:
60 APPLIED LIFE SCIENCES; deep learning; CNN; class imbalance; text classification; ensemble; NLP

Citation Formats

De Angeli, Kevin, Gao, Shang, Danciu, Ioana, Durbin, Eric B., Wu, Xiao-Cheng, Stroup, Antoinette M., Doherty, Jennifer Anne, Schwartz, Stephen Marc, Wiggins, Charles L., Damesyn, Mark A., Coyle, Linda M., Penberthy, Lynne T., Tourassi, Georgia D., and Yoon, Hong-Jun. Class imbalance in out-of-distribution datasets: Improving the robustness of the TextCNN for the classification of rare cancer types. United States: N. p., 2021. Web. doi:10.1016/j.jbi.2021.103957.
De Angeli, Kevin, Gao, Shang, Danciu, Ioana, Durbin, Eric B., Wu, Xiao-Cheng, Stroup, Antoinette M., Doherty, Jennifer Anne, Schwartz, Stephen Marc, Wiggins, Charles L., Damesyn, Mark A., Coyle, Linda M., Penberthy, Lynne T., Tourassi, Georgia D., & Yoon, Hong-Jun. Class imbalance in out-of-distribution datasets: Improving the robustness of the TextCNN for the classification of rare cancer types. United States. https://doi.org/10.1016/j.jbi.2021.103957
De Angeli, Kevin, Gao, Shang, Danciu, Ioana, Durbin, Eric B., Wu, Xiao-Cheng, Stroup, Antoinette M., Doherty, Jennifer Anne, Schwartz, Stephen Marc, Wiggins, Charles L., Damesyn, Mark A., Coyle, Linda M., Penberthy, Lynne T., Tourassi, Georgia D., and Yoon, Hong-Jun. Mon . "Class imbalance in out-of-distribution datasets: Improving the robustness of the TextCNN for the classification of rare cancer types". United States. https://doi.org/10.1016/j.jbi.2021.103957. https://www.osti.gov/servlets/purl/1884003.
@article{osti_1884003,
title = {Class imbalance in out-of-distribution datasets: Improving the robustness of the TextCNN for the classification of rare cancer types},
author = {De Angeli, Kevin and Gao, Shang and Danciu, Ioana and Durbin, Eric B. and Wu, Xiao-Cheng and Stroup, Antoinette M. and Doherty, Jennifer Anne and Schwartz, Stephen Marc and Wiggins, Charles L. and Damesyn, Mark A. and Coyle, Linda M. and Penberthy, Lynne T. and Tourassi, Georgia D. and Yoon, Hong-Jun},
abstractNote = {In the last decade, the widespread adoption of electronic health record documentation has created huge opportunities for information mining. Natural language processing (NLP) techniques using machine and deep learning are becoming increasingly widespread for information extraction tasks from unstructured clinical notes. Disparities in performance when deploying machine learning models in the real world have recently received considerable attention. In the clinical NLP domain, the robustness of convolutional neural networks (CNNs) for classifying cancer pathology reports under natural distribution shifts remains understudied. In this research, we aim to quantify and improve the performance of the CNN for text classification on out-of-distribution (OOD) datasets resulting from the natural evolution of clinical text in pathology reports. We identified class imbalance due to different prevalence of cancer types as one of the sources of performance drop and analyzed the impact of previous methods for addressing class imbalance when deploying models in real-world domains. Our results show that our novel class-specialized ensemble technique outperforms other methods for the classification of rare cancer types in terms of macro F1 scores. We also found that traditional ensemble methods perform better in top classes, leading to higher micro F1 scores. Based on our findings, we formulate a series of recommendations for other ML practitioners on how to build robust models with extremely imbalanced datasets in biomedical NLP applications.},
doi = {10.1016/j.jbi.2021.103957},
journal = {Journal of Biomedical Informatics},
number = 1,
volume = 125,
place = {United States},
year = {Mon Nov 22 00:00:00 EST 2021},
month = {Mon Nov 22 00:00:00 EST 2021}
}

Works referenced in this record:

Clinical Text Classification with Rule-based Features and Knowledge-guided Convolutional Neural Networks
conference, June 2018

  • Yao, Liang; Mao, Chengsheng; Luo, Yuan
  • 2018 IEEE International Conference on Healthcare Informatics Workshop (ICHI-W)
  • DOI: 10.1109/ICHI-W.2018.00024

Shortcut learning in deep neural networks
journal, November 2020

  • Geirhos, Robert; Jacobsen, Jörn-Henrik; Michaelis, Claudio
  • Nature Machine Intelligence, Vol. 2, Issue 11
  • DOI: 10.1038/s42256-020-00257-z

Use of Natural Language Processing to Extract Clinical Cancer Phenotypes from Electronic Medical Records
journal, November 2019


Classifying medical relations in clinical text via convolutional neural networks
journal, January 2019


SMOTE: Synthetic Minority Over-sampling Technique
journal, January 2002

  • Chawla, N. V.; Bowyer, K. W.; Hall, L. O.
  • Journal of Artificial Intelligence Research, Vol. 16
  • DOI: 10.1613/jair.953

SMOTE for high-dimensional class-imbalanced data
journal, March 2013


Classifying cancer pathology reports with hierarchical self-attention networks
journal, November 2019


Dealing with Data Imbalance in Text Classification
journal, January 2019


Data Sampling Methods to Deal With the Big Data Multi-Class Imbalance Problem
journal, February 2020

  • Rendón, Eréndira; Alejo, Roberto; Castorena, Carlos
  • Applied Sciences, Vol. 10, Issue 4
  • DOI: 10.3390/app10041276

Bagging predictors
journal, August 1996


Automatic extraction of cancer registry reportable information from free-text pathology reports using multitask convolutional neural networks
journal, November 2019

  • Alawad, Mohammed; Gao, Shang; Qiu, John X.
  • Journal of the American Medical Informatics Association, Vol. 27, Issue 1
  • DOI: 10.1093/jamia/ocz153

Hierarchical attention networks for information extraction from cancer pathology reports
journal, November 2017

  • Gao, Shang; Young, Michael T.; Qiu, John X.
  • Journal of the American Medical Informatics Association, Vol. 25, Issue 3
  • DOI: 10.1093/jamia/ocx131

Convolutional neural networks for biomedical text classification
conference, September 2015

  • Rios, Anthony; Kavuluru, Ramakanth
  • Proceedings of the 6th ACM Conference on Bioinformatics, Computational Biology and Health Informatics
  • DOI: 10.1145/2808719.2808746

Deep active learning for classifying cancer pathology reports
journal, March 2021


On Robustness and Transferability of Convolutional Neural Networks
conference, June 2021

  • Djolonga, Josip; Yung, Jessica; Tschannen, Michael
  • 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • DOI: 10.1109/CVPR46437.2021.01619

Measuring Domain Shift for Deep Learning in Histopathology
journal, February 2021

  • Stacke, Karin; Eilertsen, Gabriel; Unger, Jonas
  • IEEE Journal of Biomedical and Health Informatics, Vol. 25, Issue 2
  • DOI: 10.1109/JBHI.2020.3032060

Deep Learning for Automated Extraction of Primary Sites From Cancer Pathology Reports
journal, January 2018

  • Qiu, John X.; Yoon, Hong-Jun; Fearn, Paul A.
  • IEEE Journal of Biomedical and Health Informatics, Vol. 22, Issue 1
  • DOI: 10.1109/JBHI.2017.2700722

Virtual Adversarial Training: A Regularization Method for Supervised and Semi-Supervised Learning
journal, August 2019

  • Miyato, Takeru; Maeda, Shin-Ichi; Koyama, Masanori
  • IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 41, Issue 8
  • DOI: 10.1109/TPAMI.2018.2858821

On the Class Imbalance Problem
conference, October 2008

  • Guo, Xinjian; Yin, Yilong; Dong, Cailing
  • 2008 Fourth International Conference on Natural Computation
  • DOI: 10.1109/ICNC.2008.871

Hierarchical Convolutional Attention Networks for Text Classification
conference, January 2018

  • Gao, Shang; Ramanathan, Arvind; Tourassi, Georgia
  • Proceedings of The Third Workshop on Representation Learning for NLP
  • DOI: 10.18653/v1/W18-3002

Survey on deep learning with class imbalance
journal, March 2019


Experimental perspectives on learning from imbalanced data
conference, January 2007

  • Van Hulse, Jason; Khoshgoftaar, Taghi M.; Napolitano, Amri
  • Proceedings of the 24th international conference on Machine learning - ICML '07
  • DOI: 10.1145/1273496.1273614

The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization
conference, October 2021

  • Hendrycks, Dan; Basart, Steven; Mu, Norman
  • 2021 IEEE/CVF International Conference on Computer Vision (ICCV)
  • DOI: 10.1109/ICCV48922.2021.00823