Class imbalance in out-of-distribution datasets: Improving the robustness of the TextCNN for the classification of rare cancer types

De Angeli, Kevin; Gao, Shang; Danciu, Ioana; Durbin, Eric B.; Wu, Xiao-Cheng; Stroup, Antoinette M.; Doherty, Jennifer Anne; Schwartz, Stephen Marc; Wiggins, Charles L.; Damesyn, Mark A.; Coyle, Linda M.; Penberthy, Lynne T.; Tourassi, Georgia D.; Yoon, Hong-Jun

doi:10.1016/j.jbi.2021.103957

Title: Class imbalance in out-of-distribution datasets: Improving the robustness of the TextCNN for the classification of rare cancer types

Abstract

In the last decade, the widespread adoption of electronic health record documentation has created huge opportunities for information mining. Natural language processing (NLP) techniques using machine and deep learning are becoming increasingly widespread for information extraction tasks from unstructured clinical notes. Disparities in performance when deploying machine learning models in the real world have recently received considerable attention. In the clinical NLP domain, the robustness of convolutional neural networks (CNNs) for classifying cancer pathology reports under natural distribution shifts remains understudied. In this research, we aim to quantify and improve the performance of the CNN for text classification on out-of-distribution (OOD) datasets resulting from the natural evolution of clinical text in pathology reports. We identified class imbalance due to different prevalence of cancer types as one of the sources of performance drop and analyzed the impact of previous methods for addressing class imbalance when deploying models in real-world domains. Our results show that our novel class-specialized ensemble technique outperforms other methods for the classification of rare cancer types in terms of macro F1 scores. We also found that traditional ensemble methods perform better in top classes, leading to higher micro F1 scores. Based on our findings, we formulate a seriesmore »« less

Authors:

De Angeli, Kevin ^[1]; Gao, Shang ^[2]; Danciu, Ioana ^[3];

^[4]; Wu, Xiao-Cheng ^[5]; Stroup, Antoinette M. ^[6]; Doherty, Jennifer Anne ^[7]; Schwartz, Stephen Marc ^[8]; Wiggins, Charles L. ^[9]; Damesyn, Mark A. ^[10]; Coyle, Linda M. ^[11]; Penberthy, Lynne T. ^[12];

^[2];

^[2]

Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States); Univ. of Tennessee, Knoxville, TN (United States)
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States); Vanderbilt Univ., Nashville, TN (United States)
Univ. of Kentucky, Lexington, KY (United States)
Louisiana State Univ., New Orleans, LA (United States)
Rutgers Univ., New Brunswick, NJ (United States)
Univ. of Utah, Salt Lake City, UT (United States)
Fred Hutchison Cancer Research Center, Seattle, WA (United States)
Univ. of New Mexico, Albuquerque, NM (United States)
California Dept. of Public Health, Sacremento, CA (United States)
Information Management Services, Inc., Calverton, MD (United States)
National Cancer Institute, Bethesda, MD (United States)

Publication Date:: Mon Nov 22 00:00:00 EST 2021

Research Org.:: Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States); Argonne National Lab. (ANL), Argonne, IL (United States); Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States); Los Alamos National Lab. (LANL), Los Alamos, NM (United States)

Sponsoring Org.:: USDOE Office of Science (SC); USDOE National Nuclear Security Administration (NNSA); Centers for Disease Control and Prevention (CDC); National Cancer Institute (NCI); University of Utah; Huntsman Cancer Foundation

OSTI Identifier:: 1884003

Grant/Contract Number:: AC05-00OR22725; AC02-06-CH11357; AC52-07NA27344; AC5206NA25396; 5NU58DP006344; HHSN261201800032I; HHSN261201800015I; HHSN261201800009I; U58DP00003907; NU58DP006332-02–00; HHSN261201800014I; HHSN261291800004I; HHSN261201800016I

Resource Type:: Accepted Manuscript

Journal Name:: Journal of Biomedical Informatics

Additional Journal Information:: Journal Volume: 125; Journal Issue: 1; Journal ID: ISSN 1532-0464

Publisher:: Elsevier

Country of Publication:: United States

Language:: English

Subject:: 60 APPLIED LIFE SCIENCES; deep learning; CNN; class imbalance; text classification; ensemble; NLP

Citation Formats


                    De Angeli, Kevin, Gao, Shang, Danciu, Ioana, Durbin, Eric B., Wu, Xiao-Cheng, Stroup, Antoinette M., Doherty, Jennifer Anne, Schwartz, Stephen Marc, Wiggins, Charles L., Damesyn, Mark A., Coyle, Linda M., Penberthy, Lynne T., Tourassi, Georgia D., and Yoon, Hong-Jun. Class imbalance in out-of-distribution datasets: Improving the robustness of the TextCNN for the classification of rare cancer types.  United States: N. p., 2021. 
Web.  doi:10.1016/j.jbi.2021.103957.

Copy to clipboard


                    De Angeli, Kevin, Gao, Shang, Danciu, Ioana, Durbin, Eric B., Wu, Xiao-Cheng, Stroup, Antoinette M., Doherty, Jennifer Anne, Schwartz, Stephen Marc, Wiggins, Charles L., Damesyn, Mark A., Coyle, Linda M., Penberthy, Lynne T., Tourassi, Georgia D., & Yoon, Hong-Jun. Class imbalance in out-of-distribution datasets: Improving the robustness of the TextCNN for the classification of rare cancer types.  United States.  https://doi.org/10.1016/j.jbi.2021.103957

Copy to clipboard


                    De Angeli, Kevin, Gao, Shang, Danciu, Ioana, Durbin, Eric B., Wu, Xiao-Cheng, Stroup, Antoinette M., Doherty, Jennifer Anne, Schwartz, Stephen Marc, Wiggins, Charles L., Damesyn, Mark A., Coyle, Linda M., Penberthy, Lynne T., Tourassi, Georgia D., and Yoon, Hong-Jun. Mon .  
"Class imbalance in out-of-distribution datasets: Improving the robustness of the TextCNN for the classification of rare cancer types".  United States.  https://doi.org/10.1016/j.jbi.2021.103957.  https://www.osti.gov/servlets/purl/1884003.

Copy to clipboard


                    
@article{osti_1884003,

  title        = {Class imbalance in out-of-distribution datasets: Improving the robustness of the TextCNN for the classification of rare cancer types},

  author       = {De Angeli, Kevin and Gao, Shang and Danciu, Ioana and Durbin, Eric B. and Wu, Xiao-Cheng and Stroup, Antoinette M. and Doherty, Jennifer Anne and Schwartz, Stephen Marc and Wiggins, Charles L. and Damesyn, Mark A. and Coyle, Linda M. and Penberthy, Lynne T. and Tourassi, Georgia D. and Yoon, Hong-Jun},

  abstractNote = {In the last decade, the widespread adoption of electronic health record documentation has created huge opportunities for information mining. Natural language processing (NLP) techniques using machine and deep learning are becoming increasingly widespread for information extraction tasks from unstructured clinical notes. Disparities in performance when deploying machine learning models in the real world have recently received considerable attention. In the clinical NLP domain, the robustness of convolutional neural networks (CNNs) for classifying cancer pathology reports under natural distribution shifts remains understudied. In this research, we aim to quantify and improve the performance of the CNN for text classification on out-of-distribution (OOD) datasets resulting from the natural evolution of clinical text in pathology reports. We identified class imbalance due to different prevalence of cancer types as one of the sources of performance drop and analyzed the impact of previous methods for addressing class imbalance when deploying models in real-world domains. Our results show that our novel class-specialized ensemble technique outperforms other methods for the classification of rare cancer types in terms of macro F1 scores. We also found that traditional ensemble methods perform better in top classes, leading to higher micro F1 scores. Based on our findings, we formulate a series of recommendations for other ML practitioners on how to build robust models with extremely imbalanced datasets in biomedical NLP applications.},

  doi          = {10.1016/j.jbi.2021.103957},

  journal      = {Journal of Biomedical Informatics},

  number       = 1,

  volume       = 125,

  place        = {United States},

  year         = {Mon Nov 22 00:00:00 EST 2021},

  month        = {Mon Nov 22 00:00:00 EST 2021}

}

Copy to clipboard

Journal Article:

Free Publicly Available Full Text

Accepted Manuscript (DOE)

Publisher's Version of Record

https://doi.org/10.1016/j.jbi.2021.103957

Other availability

Search WorldCat to find libraries that may hold this journal

Save / Share:

Export Metadata

Save to My Library

Works referenced in this record:

Clinical Text Classification with Rule-based Features and Knowledge-guided Convolutional Neural Networks
conference, June 2018

Yao, Liang; Mao, Chengsheng; Luo, Yuan
2018 IEEE International Conference on Healthcare Informatics Workshop (ICHI-W)
DOI: 10.1109/ICHI-W.2018.00024

Shortcut learning in deep neural networks
journal, November 2020

Geirhos, Robert; Jacobsen, Jörn-Henrik; Michaelis, Claudio
Nature Machine Intelligence, Vol. 2, Issue 11
DOI: 10.1038/s42256-020-00257-z

Use of Natural Language Processing to Extract Clinical Cancer Phenotypes from Electronic Medical Records
journal, November 2019

Savova, Guergana K.; Danciu, Ioana; Alamudun, Folami
Cancer Research, Vol. 79, Issue 21
DOI: 10.1158/0008-5472.CAN-19-0579

Classifying medical relations in clinical text via convolutional neural networks
journal, January 2019

He, Bin; Guan, Yi; Dai, Rui
Artificial Intelligence in Medicine, Vol. 93
DOI: 10.1016/j.artmed.2018.05.001

SMOTE: Synthetic Minority Over-sampling Technique
journal, January 2002

Chawla, N. V.; Bowyer, K. W.; Hall, L. O.
Journal of Artificial Intelligence Research, Vol. 16
DOI: 10.1613/jair.953

SMOTE for high-dimensional class-imbalanced data
journal, March 2013

Blagus, Rok; Lusa, Lara
BMC Bioinformatics, Vol. 14, Issue 1
DOI: 10.1186/1471-2105-14-106

Classifying cancer pathology reports with hierarchical self-attention networks
journal, November 2019

Gao, Shang; Qiu, John X.; Alawad, Mohammed
Artificial Intelligence in Medicine, Vol. 101
DOI: 10.1016/j.artmed.2019.101726

Dealing with Data Imbalance in Text Classification
journal, January 2019

Padurariu, Cristian; Breaban, Mihaela Elena
Procedia Computer Science, Vol. 159
DOI: 10.1016/j.procs.2019.09.229

Data Sampling Methods to Deal With the Big Data Multi-Class Imbalance Problem
journal, February 2020

Rendón, Eréndira; Alejo, Roberto; Castorena, Carlos
Applied Sciences, Vol. 10, Issue 4
DOI: 10.3390/app10041276

Bagging predictors
journal, August 1996

Breiman, Leo
Machine Learning, Vol. 24, Issue 2, p. 123-140
DOI: 10.1007/BF00058655

Automatic extraction of cancer registry reportable information from free-text pathology reports using multitask convolutional neural networks
journal, November 2019

Alawad, Mohammed; Gao, Shang; Qiu, John X.
Journal of the American Medical Informatics Association, Vol. 27, Issue 1
DOI: 10.1093/jamia/ocz153

Hierarchical attention networks for information extraction from cancer pathology reports
journal, November 2017

Gao, Shang; Young, Michael T.; Qiu, John X.
Journal of the American Medical Informatics Association, Vol. 25, Issue 3
DOI: 10.1093/jamia/ocx131

Convolutional neural networks for biomedical text classification
conference, September 2015

Rios, Anthony; Kavuluru, Ramakanth
Proceedings of the 6th ACM Conference on Bioinformatics, Computational Biology and Health Informatics
DOI: 10.1145/2808719.2808746

Deep active learning for classifying cancer pathology reports
journal, March 2021

De Angeli, Kevin; Gao, Shang; Alawad, Mohammed
BMC Bioinformatics, Vol. 22, Issue 1
DOI: 10.1186/s12859-021-04047-1

On Robustness and Transferability of Convolutional Neural Networks
conference, June 2021

Djolonga, Josip; Yung, Jessica; Tschannen, Michael
2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
DOI: 10.1109/CVPR46437.2021.01619

Measuring Domain Shift for Deep Learning in Histopathology
journal, February 2021

Stacke, Karin; Eilertsen, Gabriel; Unger, Jonas
IEEE Journal of Biomedical and Health Informatics, Vol. 25, Issue 2
DOI: 10.1109/JBHI.2020.3032060

Deep Learning for Automated Extraction of Primary Sites From Cancer Pathology Reports
journal, January 2018

Qiu, John X.; Yoon, Hong-Jun; Fearn, Paul A.
IEEE Journal of Biomedical and Health Informatics, Vol. 22, Issue 1
DOI: 10.1109/JBHI.2017.2700722

Virtual Adversarial Training: A Regularization Method for Supervised and Semi-Supervised Learning
journal, August 2019

Miyato, Takeru; Maeda, Shin-Ichi; Koyama, Masanori
IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 41, Issue 8
DOI: 10.1109/TPAMI.2018.2858821

On the Class Imbalance Problem
conference, October 2008

Guo, Xinjian; Yin, Yilong; Dong, Cailing
2008 Fourth International Conference on Natural Computation
DOI: 10.1109/ICNC.2008.871

Hierarchical Convolutional Attention Networks for Text Classification
conference, January 2018

Gao, Shang; Ramanathan, Arvind; Tourassi, Georgia
Proceedings of The Third Workshop on Representation Learning for NLP
DOI: 10.18653/v1/W18-3002

Survey on deep learning with class imbalance
journal, March 2019

Johnson, Justin M.; Khoshgoftaar, Taghi M.
Journal of Big Data, Vol. 6, Issue 1
DOI: 10.1186/s40537-019-0192-5

Experimental perspectives on learning from imbalanced data
conference, January 2007

Van Hulse, Jason; Khoshgoftaar, Taghi M.; Napolitano, Amri
Proceedings of the 24th international conference on Machine learning - ICML '07
DOI: 10.1145/1273496.1273614

The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization
conference, October 2021

Hendrycks, Dan; Basart, Steven; Mu, Norman
2021 IEEE/CVF International Conference on Computer Vision (ICCV)
DOI: 10.1109/ICCV48922.2021.00823

Similar Records in DOE PAGES and OSTI.GOV collections:

Using ensembles and distillation to optimize the deployment of deep learning models for the classification of electronic cancer pathology reports

Journal Article De Angeli, Kevin ; Gao, Shang ; Blanchard, Andrew ; ... - JAMIA Open

One of the goals of the Surveillance, Epidemiology, and End Results (SEER) program is to estimate incidence, prevalence, and mortality of all cancers. To that end, cancer registries across the country maintain a massive database of cancer pathology reports which contain rich information to understand cancer trends. However, these reports are stored in the form of unstructured text, and human annotators are required to read and extract relevant information. In this article, we show that existing deep learning models for automating information extraction from cancer pathology reports can be significantly improved by using ensemble model distillation. We found that bymore »« less
https://doi.org/10.1093/jamiaopen/ooac075

Full Text Available
Optimal vocabulary selection approaches for privacy-preserving deep NLP model training for information extraction and cancer epidemiology

Journal Article Yoon, Hong-Jun ; Stanley, Christopher B. ; Christian, J. Blair ; ... - Cancer Biomarkers

With the use of artificial intelligence and machine learning techniques for biomedical informatics, security and privacy concerns over the data and subject identities have also become an important issue and essential research topic. Without intentional safeguards, machine learning models may find patterns and features to improve task performance that are associated with private personal information. The privacy vulnerability of deep learning models for information extraction from medical textural contents needs to be quantified since the models are exposed to private health information and personally identifiable information. The objective of the study is to quantify the privacy vulnerability of the deepmore »« less
https://doi.org/10.3233/cbm-210306

Full Text Available
A Keyword-Enhanced Approach to Handle Class Imbalance in Clinical Text Classification

Journal Article Blanchard, Andrew E. ; Gao, Shang ; Yoon, Hong-Jun ; ... - IEEE Journal of Biomedical and Health Informatics

Recent applications of deep learning have shown promising results for classifying unstructured text in the healthcare domain. However, the reliability of models in production settings has been hindered by imbalanced data sets in which a small subset of the classes dominate. In the absence of adequate training data, rare classes necessitate additional model constraints for robust performance. Here, we present a strategy for incorporating short sequences of text (i.e.~ keywords) into training to boost model accuracy on rare classes. In our approach, we assemble a set of keywords, including short phrases, associated with each class. The keywords are then usedmore »« less
https://doi.org/10.1109/JBHI.2022.3141976
Multimodal Data Representation with Deep Learning for Extracting Cancer Characteristics from Clinical Text

Conference Alawad, Mohammed ; Gao, Shang ; Alamudun, Folami ; ...

This paper presents a multimodal data representation to improve the performance of deep learning models for extracting cancer key characteristics from unstructured text in pathology reports. Specifically, in addition to using the text as the input to deep learning models, we use concept unique identifiers (CUIs) as another source of information to the models. We analyze the performance of different text and CUI data representations, including word embeddings and bag of embeddings (BOE), with a convolutional neural network (CNN) and a fully connected multilayer perceptron neural network (MLP-NN). The high level document embeddings from text and CUI inputs are combinedmore »« less
Full Text Available
Accelerated training of bootstrap aggregation-based deep information extraction systems from cancer pathology reports

Journal Article Yoon, Hong-Jun ; Klasky, Hilda B. ; Gounley, John P. ; ... - Journal of Biomedical Informatics

Objective: In machine learning, it is apparent that the classification of the task performance increases if bootstrap aggregation (bagging) is applied. However, the bagging of deep neural networks takes tremendous amounts of computational resources and training time. The research question that we aimed to answer in this research is whether we could achieve higher task performance scores and accelerate the training by dividing a problem into sub-problems. Materials and Methods: The data used in this study consist of free text from electronic cancer pathology reports. We applied bagging and partitioned data training using Multi-Task Convolutional Neural Network (MT-CNN) and Multi-Taskmore »« less
https://doi.org/10.1016/j.jbi.2020.103564

Similar Records

Title: Class imbalance in out-of-distribution datasets: Improving the robustness of the TextCNN for the classification of rare cancer types

Abstract

Citation Formats

Clinical Text Classification with Rule-based Features and Knowledge-guided Convolutional Neural Networks conference, June 2018

Shortcut learning in deep neural networks journal, November 2020

Use of Natural Language Processing to Extract Clinical Cancer Phenotypes from Electronic Medical Records journal, November 2019

Classifying medical relations in clinical text via convolutional neural networks journal, January 2019

SMOTE: Synthetic Minority Over-sampling Technique journal, January 2002

SMOTE for high-dimensional class-imbalanced data journal, March 2013

Classifying cancer pathology reports with hierarchical self-attention networks journal, November 2019

Dealing with Data Imbalance in Text Classification journal, January 2019

Data Sampling Methods to Deal With the Big Data Multi-Class Imbalance Problem journal, February 2020

Bagging predictors journal, August 1996

Automatic extraction of cancer registry reportable information from free-text pathology reports using multitask convolutional neural networks journal, November 2019

Hierarchical attention networks for information extraction from cancer pathology reports journal, November 2017

Convolutional neural networks for biomedical text classification conference, September 2015

Deep active learning for classifying cancer pathology reports journal, March 2021

On Robustness and Transferability of Convolutional Neural Networks conference, June 2021

Measuring Domain Shift for Deep Learning in Histopathology journal, February 2021

Deep Learning for Automated Extraction of Primary Sites From Cancer Pathology Reports journal, January 2018

Virtual Adversarial Training: A Regularization Method for Supervised and Semi-Supervised Learning journal, August 2019

On the Class Imbalance Problem conference, October 2008

Hierarchical Convolutional Attention Networks for Text Classification conference, January 2018

Survey on deep learning with class imbalance journal, March 2019

Experimental perspectives on learning from imbalanced data conference, January 2007

The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization conference, October 2021

Clinical Text Classification with Rule-based Features and Knowledge-guided Convolutional Neural Networks
conference, June 2018

Shortcut learning in deep neural networks
journal, November 2020

Use of Natural Language Processing to Extract Clinical Cancer Phenotypes from Electronic Medical Records
journal, November 2019

Classifying medical relations in clinical text via convolutional neural networks
journal, January 2019

SMOTE: Synthetic Minority Over-sampling Technique
journal, January 2002

SMOTE for high-dimensional class-imbalanced data
journal, March 2013

Classifying cancer pathology reports with hierarchical self-attention networks
journal, November 2019

Dealing with Data Imbalance in Text Classification
journal, January 2019

Data Sampling Methods to Deal With the Big Data Multi-Class Imbalance Problem
journal, February 2020

Bagging predictors
journal, August 1996

Automatic extraction of cancer registry reportable information from free-text pathology reports using multitask convolutional neural networks
journal, November 2019

Hierarchical attention networks for information extraction from cancer pathology reports
journal, November 2017

Convolutional neural networks for biomedical text classification
conference, September 2015

Deep active learning for classifying cancer pathology reports
journal, March 2021

On Robustness and Transferability of Convolutional Neural Networks
conference, June 2021

Measuring Domain Shift for Deep Learning in Histopathology
journal, February 2021

Deep Learning for Automated Extraction of Primary Sites From Cancer Pathology Reports
journal, January 2018

Virtual Adversarial Training: A Regularization Method for Supervised and Semi-Supervised Learning
journal, August 2019

On the Class Imbalance Problem
conference, October 2008

Hierarchical Convolutional Attention Networks for Text Classification
conference, January 2018

Survey on deep learning with class imbalance
journal, March 2019

Experimental perspectives on learning from imbalanced data
conference, January 2007

The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization
conference, October 2021