Class imbalance in out-of-distribution datasets: Improving the robustness of the TextCNN for the classification of rare cancer types

De Angeli, Kevin; Gao, Shang; Danciu, Ioana; Durbin, Eric B.; Wu, Xiao-Cheng; Stroup, Antoinette M.; Doherty, Jennifer Anne; Schwartz, Stephen Marc; Wiggins, Charles L.; Damesyn, Mark A.; Coyle, Linda M.; Penberthy, Lynne T.; Tourassi, Georgia D.; Yoon, Hong-Jun

doi:10.1016/j.jbi.2021.103957

Class imbalance in out-of-distribution datasets: Improving the robustness of the TextCNN for the classification of rare cancer types

Journal Article · Sun Nov 21 23:00:00 EST 2021 · Journal of Biomedical Informatics

DOI:https://doi.org/10.1016/j.jbi.2021.103957· OSTI ID:1884003

De Angeli, Kevin ^[1]; Gao, Shang ^[2]; Danciu, Ioana ^[3]; ^[4]; Wu, Xiao-Cheng ^[5]; Stroup, Antoinette M. ^[6]; Doherty, Jennifer Anne ^[7]; Schwartz, Stephen Marc ^[8]; Wiggins, Charles L. ^[9]; Damesyn, Mark A. ^[10]; Coyle, Linda M. ^[11]; Penberthy, Lynne T. ^[12]; ^[2]; ^[2]

Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States); Univ. of Tennessee, Knoxville, TN (United States)
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States); Vanderbilt Univ., Nashville, TN (United States)
Univ. of Kentucky, Lexington, KY (United States)
Louisiana State Univ., New Orleans, LA (United States)
Rutgers Univ., New Brunswick, NJ (United States)
Univ. of Utah, Salt Lake City, UT (United States)
Fred Hutchison Cancer Research Center, Seattle, WA (United States)
Univ. of New Mexico, Albuquerque, NM (United States)
California Dept. of Public Health, Sacremento, CA (United States)
Information Management Services, Inc., Calverton, MD (United States)
National Cancer Institute, Bethesda, MD (United States)

In the last decade, the widespread adoption of electronic health record documentation has created huge opportunities for information mining. Natural language processing (NLP) techniques using machine and deep learning are becoming increasingly widespread for information extraction tasks from unstructured clinical notes. Disparities in performance when deploying machine learning models in the real world have recently received considerable attention. In the clinical NLP domain, the robustness of convolutional neural networks (CNNs) for classifying cancer pathology reports under natural distribution shifts remains understudied. In this research, we aim to quantify and improve the performance of the CNN for text classification on out-of-distribution (OOD) datasets resulting from the natural evolution of clinical text in pathology reports. We identified class imbalance due to different prevalence of cancer types as one of the sources of performance drop and analyzed the impact of previous methods for addressing class imbalance when deploying models in real-world domains. Our results show that our novel class-specialized ensemble technique outperforms other methods for the classification of rare cancer types in terms of macro F1 scores. We also found that traditional ensemble methods perform better in top classes, leading to higher micro F1 scores. Based on our findings, we formulate a series of recommendations for other ML practitioners on how to build robust models with extremely imbalanced datasets in biomedical NLP applications.

View Accepted Manuscript (DOE)

Research Organization:: Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States); Argonne National Laboratory (ANL), Argonne, IL (United States); Lawrence Livermore National Laboratory (LLNL), Livermore, CA (United States); Los Alamos National Laboratory (LANL), Los Alamos, NM (United States)

Sponsoring Organization:: USDOE Office of Science (SC); USDOE National Nuclear Security Administration (NNSA); Centers for Disease Control and Prevention (CDC); National Cancer Institute (NCI); University of Utah; Huntsman Cancer Foundation

Grant/Contract Number:: AC05-00OR22725; AC02-06CH11357; AC52-07NA27344; AC52-06NA25396

OSTI ID:: 1884003

Journal Information:: Journal of Biomedical Informatics, Journal Name: Journal of Biomedical Informatics Journal Issue: 1 Vol. 125; ISSN 1532-0464

Publisher:: ElsevierCopyright Statement

Country of Publication:: United States

Language:: English

References (24)

Bagging predictors Breiman, Leo Machine Learning, Vol. 24, Issue 2, p. 123-140 https://doi.org/10.1007/BF00058655	journal	August 1996
Classifying medical relations in clinical text via convolutional neural networks He, Bin; Guan, Yi; Dai, Rui Artificial Intelligence in Medicine, Vol. 93 https://doi.org/10.1016/j.artmed.2018.05.001	journal	January 2019
Classifying cancer pathology reports with hierarchical self-attention networks Gao, Shang; Qiu, John X.; Alawad, Mohammed Artificial Intelligence in Medicine, Vol. 101 https://doi.org/10.1016/j.artmed.2019.101726	journal	November 2019
Dealing with Data Imbalance in Text Classification Padurariu, Cristian; Breaban, Mihaela Elena Procedia Computer Science, Vol. 159 https://doi.org/10.1016/j.procs.2019.09.229	journal	January 2019
Shortcut learning in deep neural networks Geirhos, Robert; Jacobsen, Jörn-Henrik; Michaelis, Claudio Nature Machine Intelligence, Vol. 2, Issue 11 https://doi.org/10.1038/s42256-020-00257-z	journal	November 2020
Hierarchical attention networks for information extraction from cancer pathology reports Gao, Shang; Young, Michael T.; Qiu, John X. Journal of the American Medical Informatics Association, Vol. 25, Issue 3 https://doi.org/10.1093/jamia/ocx131	journal	November 2017
Automatic extraction of cancer registry reportable information from free-text pathology reports using multitask convolutional neural networks Alawad, Mohammed; Gao, Shang; Qiu, John X. Journal of the American Medical Informatics Association, Vol. 27, Issue 1 https://doi.org/10.1093/jamia/ocz153	journal	November 2019
On Robustness and Transferability of Convolutional Neural Networks Djolonga, Josip; Yung, Jessica; Tschannen, Michael 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) https://doi.org/10.1109/CVPR46437.2021.01619	conference	June 2021
The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization Hendrycks, Dan; Basart, Steven; Mu, Norman 2021 IEEE/CVF International Conference on Computer Vision (ICCV) https://doi.org/10.1109/ICCV48922.2021.00823	conference	October 2021
Clinical Text Classification with Rule-based Features and Knowledge-guided Convolutional Neural Networks Yao, Liang; Mao, Chengsheng; Luo, Yuan 2018 IEEE International Conference on Healthcare Informatics Workshop (ICHI-W) https://doi.org/10.1109/ICHI-W.2018.00024	conference	June 2018
On the Class Imbalance Problem Guo, Xinjian; Yin, Yilong; Dong, Cailing 2008 Fourth International Conference on Natural Computation https://doi.org/10.1109/ICNC.2008.871	conference	October 2008
Deep Learning for Automated Extraction of Primary Sites From Cancer Pathology Reports Qiu, John X.; Yoon, Hong-Jun; Fearn, Paul A. IEEE Journal of Biomedical and Health Informatics, Vol. 22, Issue 1 https://doi.org/10.1109/JBHI.2017.2700722	journal	January 2018
Measuring Domain Shift for Deep Learning in Histopathology Stacke, Karin; Eilertsen, Gabriel; Unger, Jonas IEEE Journal of Biomedical and Health Informatics, Vol. 25, Issue 2 https://doi.org/10.1109/JBHI.2020.3032060	journal	February 2021
Virtual Adversarial Training: A Regularization Method for Supervised and Semi-Supervised Learning Miyato, Takeru; Maeda, Shin-Ichi; Koyama, Masanori IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 41, Issue 8 https://doi.org/10.1109/TPAMI.2018.2858821	journal	August 2019
Experimental perspectives on learning from imbalanced data Van Hulse, Jason; Khoshgoftaar, Taghi M.; Napolitano, Amri Proceedings of the 24th international conference on Machine learning - ICML '07 https://doi.org/10.1145/1273496.1273614	conference	January 2007
Convolutional neural networks for biomedical text classification Rios, Anthony; Kavuluru, Ramakanth Proceedings of the 6th ACM Conference on Bioinformatics, Computational Biology and Health Informatics https://doi.org/10.1145/2808719.2808746	conference	September 2015
Use of Natural Language Processing to Extract Clinical Cancer Phenotypes from Electronic Medical Records Savova, Guergana K.; Danciu, Ioana; Alamudun, Folami Cancer Research, Vol. 79, Issue 21 https://doi.org/10.1158/0008-5472.CAN-19-0579	journal	November 2019
SMOTE for high-dimensional class-imbalanced data Blagus, Rok; Lusa, Lara BMC Bioinformatics, Vol. 14, Issue 1 https://doi.org/10.1186/1471-2105-14-106	journal	March 2013
Deep active learning for classifying cancer pathology reports De Angeli, Kevin; Gao, Shang; Alawad, Mohammed BMC Bioinformatics, Vol. 22, Issue 1 https://doi.org/10.1186/s12859-021-04047-1	journal	March 2021
Survey on deep learning with class imbalance Johnson, Justin M.; Khoshgoftaar, Taghi M. Journal of Big Data, Vol. 6, Issue 1 https://doi.org/10.1186/s40537-019-0192-5	journal	March 2019
SMOTE: Synthetic Minority Over-sampling Technique Chawla, N. V.; Bowyer, K. W.; Hall, L. O. Journal of Artificial Intelligence Research, Vol. 16 https://doi.org/10.1613/jair.953	journal	January 2002
Hierarchical Convolutional Attention Networks for Text Classification Gao, Shang; Ramanathan, Arvind; Tourassi, Georgia Proceedings of The Third Workshop on Representation Learning for NLP https://doi.org/10.18653/v1/W18-3002	conference	January 2018
Data Sampling Methods to Deal With the Big Data Multi-Class Imbalance Problem Rendón, Eréndira; Alejo, Roberto; Castorena, Carlos Applied Sciences, Vol. 10, Issue 4 https://doi.org/10.3390/app10041276	journal	February 2020
Deep active learning for classifying cancer pathology reports Angeli, Kevin De; Gao, Shang; Alawad, Mohammed figshare https://doi.org/10.6084/m9.figshare.c.5333258	collection	January 2021

Similar Records

Characterizing Quantum Classifier Utility in Natural Language Processing Workflows

Conference · Fri Sep 01 00:00:00 EDT 2023 · OSTI ID:2397467

Related Subjects

60 APPLIED LIFE SCIENCES
CNN
NLP
class imbalance
deep learning
ensemble
text classification

Class imbalance in out-of-distribution datasets: Improving the robustness of the TextCNN for the classification of rare cancer types

Citation Formats

References (24)

Similar Records

Related Subjects