Class imbalance in out-of-distribution datasets: Improving the robustness of the TextCNN for the classification of rare cancer types
Abstract
In the last decade, the widespread adoption of electronic health record documentation has created huge opportunities for information mining. Natural language processing (NLP) techniques using machine and deep learning are becoming increasingly widespread for information extraction tasks from unstructured clinical notes. Disparities in performance when deploying machine learning models in the real world have recently received considerable attention. In the clinical NLP domain, the robustness of convolutional neural networks (CNNs) for classifying cancer pathology reports under natural distribution shifts remains understudied. In this research, we aim to quantify and improve the performance of the CNN for text classification on out-of-distribution (OOD) datasets resulting from the natural evolution of clinical text in pathology reports. We identified class imbalance due to different prevalence of cancer types as one of the sources of performance drop and analyzed the impact of previous methods for addressing class imbalance when deploying models in real-world domains. Our results show that our novel class-specialized ensemble technique outperforms other methods for the classification of rare cancer types in terms of macro F1 scores. We also found that traditional ensemble methods perform better in top classes, leading to higher micro F1 scores. Based on our findings, we formulate a seriesmore »
- Authors:
-
- Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States); Univ. of Tennessee, Knoxville, TN (United States)
- Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
- Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States); Vanderbilt Univ., Nashville, TN (United States)
- Univ. of Kentucky, Lexington, KY (United States)
- Louisiana State Univ., New Orleans, LA (United States)
- Rutgers Univ., New Brunswick, NJ (United States)
- Univ. of Utah, Salt Lake City, UT (United States)
- Fred Hutchison Cancer Research Center, Seattle, WA (United States)
- Univ. of New Mexico, Albuquerque, NM (United States)
- California Dept. of Public Health, Sacremento, CA (United States)
- Information Management Services, Inc., Calverton, MD (United States)
- National Cancer Institute, Bethesda, MD (United States)
- Publication Date:
- Research Org.:
- Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States); Argonne National Lab. (ANL), Argonne, IL (United States); Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States); Los Alamos National Lab. (LANL), Los Alamos, NM (United States)
- Sponsoring Org.:
- USDOE Office of Science (SC); USDOE National Nuclear Security Administration (NNSA); Centers for Disease Control and Prevention (CDC); National Cancer Institute (NCI); University of Utah; Huntsman Cancer Foundation
- OSTI Identifier:
- 1884003
- Grant/Contract Number:
- AC05-00OR22725; AC02-06-CH11357; AC52-07NA27344; AC5206NA25396; 5NU58DP006344; HHSN261201800032I; HHSN261201800015I; HHSN261201800009I; U58DP00003907; NU58DP006332-02–00; HHSN261201800014I; HHSN261291800004I; HHSN261201800016I
- Resource Type:
- Accepted Manuscript
- Journal Name:
- Journal of Biomedical Informatics
- Additional Journal Information:
- Journal Volume: 125; Journal Issue: 1; Journal ID: ISSN 1532-0464
- Publisher:
- Elsevier
- Country of Publication:
- United States
- Language:
- English
- Subject:
- 60 APPLIED LIFE SCIENCES; deep learning; CNN; class imbalance; text classification; ensemble; NLP
Citation Formats
De Angeli, Kevin, Gao, Shang, Danciu, Ioana, Durbin, Eric B., Wu, Xiao-Cheng, Stroup, Antoinette M., Doherty, Jennifer Anne, Schwartz, Stephen Marc, Wiggins, Charles L., Damesyn, Mark A., Coyle, Linda M., Penberthy, Lynne T., Tourassi, Georgia D., and Yoon, Hong-Jun. Class imbalance in out-of-distribution datasets: Improving the robustness of the TextCNN for the classification of rare cancer types. United States: N. p., 2021.
Web. doi:10.1016/j.jbi.2021.103957.
De Angeli, Kevin, Gao, Shang, Danciu, Ioana, Durbin, Eric B., Wu, Xiao-Cheng, Stroup, Antoinette M., Doherty, Jennifer Anne, Schwartz, Stephen Marc, Wiggins, Charles L., Damesyn, Mark A., Coyle, Linda M., Penberthy, Lynne T., Tourassi, Georgia D., & Yoon, Hong-Jun. Class imbalance in out-of-distribution datasets: Improving the robustness of the TextCNN for the classification of rare cancer types. United States. https://doi.org/10.1016/j.jbi.2021.103957
De Angeli, Kevin, Gao, Shang, Danciu, Ioana, Durbin, Eric B., Wu, Xiao-Cheng, Stroup, Antoinette M., Doherty, Jennifer Anne, Schwartz, Stephen Marc, Wiggins, Charles L., Damesyn, Mark A., Coyle, Linda M., Penberthy, Lynne T., Tourassi, Georgia D., and Yoon, Hong-Jun. Mon .
"Class imbalance in out-of-distribution datasets: Improving the robustness of the TextCNN for the classification of rare cancer types". United States. https://doi.org/10.1016/j.jbi.2021.103957. https://www.osti.gov/servlets/purl/1884003.
@article{osti_1884003,
title = {Class imbalance in out-of-distribution datasets: Improving the robustness of the TextCNN for the classification of rare cancer types},
author = {De Angeli, Kevin and Gao, Shang and Danciu, Ioana and Durbin, Eric B. and Wu, Xiao-Cheng and Stroup, Antoinette M. and Doherty, Jennifer Anne and Schwartz, Stephen Marc and Wiggins, Charles L. and Damesyn, Mark A. and Coyle, Linda M. and Penberthy, Lynne T. and Tourassi, Georgia D. and Yoon, Hong-Jun},
abstractNote = {In the last decade, the widespread adoption of electronic health record documentation has created huge opportunities for information mining. Natural language processing (NLP) techniques using machine and deep learning are becoming increasingly widespread for information extraction tasks from unstructured clinical notes. Disparities in performance when deploying machine learning models in the real world have recently received considerable attention. In the clinical NLP domain, the robustness of convolutional neural networks (CNNs) for classifying cancer pathology reports under natural distribution shifts remains understudied. In this research, we aim to quantify and improve the performance of the CNN for text classification on out-of-distribution (OOD) datasets resulting from the natural evolution of clinical text in pathology reports. We identified class imbalance due to different prevalence of cancer types as one of the sources of performance drop and analyzed the impact of previous methods for addressing class imbalance when deploying models in real-world domains. Our results show that our novel class-specialized ensemble technique outperforms other methods for the classification of rare cancer types in terms of macro F1 scores. We also found that traditional ensemble methods perform better in top classes, leading to higher micro F1 scores. Based on our findings, we formulate a series of recommendations for other ML practitioners on how to build robust models with extremely imbalanced datasets in biomedical NLP applications.},
doi = {10.1016/j.jbi.2021.103957},
journal = {Journal of Biomedical Informatics},
number = 1,
volume = 125,
place = {United States},
year = {Mon Nov 22 00:00:00 EST 2021},
month = {Mon Nov 22 00:00:00 EST 2021}
}
Works referenced in this record:
Clinical Text Classification with Rule-based Features and Knowledge-guided Convolutional Neural Networks
conference, June 2018
- Yao, Liang; Mao, Chengsheng; Luo, Yuan
- 2018 IEEE International Conference on Healthcare Informatics Workshop (ICHI-W)
Shortcut learning in deep neural networks
journal, November 2020
- Geirhos, Robert; Jacobsen, Jörn-Henrik; Michaelis, Claudio
- Nature Machine Intelligence, Vol. 2, Issue 11
Use of Natural Language Processing to Extract Clinical Cancer Phenotypes from Electronic Medical Records
journal, November 2019
- Savova, Guergana K.; Danciu, Ioana; Alamudun, Folami
- Cancer Research, Vol. 79, Issue 21
Classifying medical relations in clinical text via convolutional neural networks
journal, January 2019
- He, Bin; Guan, Yi; Dai, Rui
- Artificial Intelligence in Medicine, Vol. 93
SMOTE: Synthetic Minority Over-sampling Technique
journal, January 2002
- Chawla, N. V.; Bowyer, K. W.; Hall, L. O.
- Journal of Artificial Intelligence Research, Vol. 16
SMOTE for high-dimensional class-imbalanced data
journal, March 2013
- Blagus, Rok; Lusa, Lara
- BMC Bioinformatics, Vol. 14, Issue 1
Classifying cancer pathology reports with hierarchical self-attention networks
journal, November 2019
- Gao, Shang; Qiu, John X.; Alawad, Mohammed
- Artificial Intelligence in Medicine, Vol. 101
Dealing with Data Imbalance in Text Classification
journal, January 2019
- Padurariu, Cristian; Breaban, Mihaela Elena
- Procedia Computer Science, Vol. 159
Data Sampling Methods to Deal With the Big Data Multi-Class Imbalance Problem
journal, February 2020
- Rendón, Eréndira; Alejo, Roberto; Castorena, Carlos
- Applied Sciences, Vol. 10, Issue 4
Automatic extraction of cancer registry reportable information from free-text pathology reports using multitask convolutional neural networks
journal, November 2019
- Alawad, Mohammed; Gao, Shang; Qiu, John X.
- Journal of the American Medical Informatics Association, Vol. 27, Issue 1
Hierarchical attention networks for information extraction from cancer pathology reports
journal, November 2017
- Gao, Shang; Young, Michael T.; Qiu, John X.
- Journal of the American Medical Informatics Association, Vol. 25, Issue 3
Convolutional neural networks for biomedical text classification
conference, September 2015
- Rios, Anthony; Kavuluru, Ramakanth
- Proceedings of the 6th ACM Conference on Bioinformatics, Computational Biology and Health Informatics
Deep active learning for classifying cancer pathology reports
journal, March 2021
- De Angeli, Kevin; Gao, Shang; Alawad, Mohammed
- BMC Bioinformatics, Vol. 22, Issue 1
On Robustness and Transferability of Convolutional Neural Networks
conference, June 2021
- Djolonga, Josip; Yung, Jessica; Tschannen, Michael
- 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Measuring Domain Shift for Deep Learning in Histopathology
journal, February 2021
- Stacke, Karin; Eilertsen, Gabriel; Unger, Jonas
- IEEE Journal of Biomedical and Health Informatics, Vol. 25, Issue 2
Deep Learning for Automated Extraction of Primary Sites From Cancer Pathology Reports
journal, January 2018
- Qiu, John X.; Yoon, Hong-Jun; Fearn, Paul A.
- IEEE Journal of Biomedical and Health Informatics, Vol. 22, Issue 1
Virtual Adversarial Training: A Regularization Method for Supervised and Semi-Supervised Learning
journal, August 2019
- Miyato, Takeru; Maeda, Shin-Ichi; Koyama, Masanori
- IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 41, Issue 8
On the Class Imbalance Problem
conference, October 2008
- Guo, Xinjian; Yin, Yilong; Dong, Cailing
- 2008 Fourth International Conference on Natural Computation
Hierarchical Convolutional Attention Networks for Text Classification
conference, January 2018
- Gao, Shang; Ramanathan, Arvind; Tourassi, Georgia
- Proceedings of The Third Workshop on Representation Learning for NLP
Survey on deep learning with class imbalance
journal, March 2019
- Johnson, Justin M.; Khoshgoftaar, Taghi M.
- Journal of Big Data, Vol. 6, Issue 1
Experimental perspectives on learning from imbalanced data
conference, January 2007
- Van Hulse, Jason; Khoshgoftaar, Taghi M.; Napolitano, Amri
- Proceedings of the 24th international conference on Machine learning - ICML '07
The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization
conference, October 2021
- Hendrycks, Dan; Basart, Steven; Mu, Norman
- 2021 IEEE/CVF International Conference on Computer Vision (ICCV)