Class imbalance in out-of-distribution datasets: Improving the robustness of the TextCNN for the classification of rare cancer types
Journal Article
·
· Journal of Biomedical Informatics
- Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States); Univ. of Tennessee, Knoxville, TN (United States)
- Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
- Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States); Vanderbilt Univ., Nashville, TN (United States)
- Univ. of Kentucky, Lexington, KY (United States)
- Louisiana State Univ., New Orleans, LA (United States)
- Rutgers Univ., New Brunswick, NJ (United States)
- Univ. of Utah, Salt Lake City, UT (United States)
- Fred Hutchison Cancer Research Center, Seattle, WA (United States)
- Univ. of New Mexico, Albuquerque, NM (United States)
- California Dept. of Public Health, Sacremento, CA (United States)
- Information Management Services, Inc., Calverton, MD (United States)
- National Cancer Institute, Bethesda, MD (United States)
In the last decade, the widespread adoption of electronic health record documentation has created huge opportunities for information mining. Natural language processing (NLP) techniques using machine and deep learning are becoming increasingly widespread for information extraction tasks from unstructured clinical notes. Disparities in performance when deploying machine learning models in the real world have recently received considerable attention. In the clinical NLP domain, the robustness of convolutional neural networks (CNNs) for classifying cancer pathology reports under natural distribution shifts remains understudied. In this research, we aim to quantify and improve the performance of the CNN for text classification on out-of-distribution (OOD) datasets resulting from the natural evolution of clinical text in pathology reports. We identified class imbalance due to different prevalence of cancer types as one of the sources of performance drop and analyzed the impact of previous methods for addressing class imbalance when deploying models in real-world domains. Our results show that our novel class-specialized ensemble technique outperforms other methods for the classification of rare cancer types in terms of macro F1 scores. We also found that traditional ensemble methods perform better in top classes, leading to higher micro F1 scores. Based on our findings, we formulate a series of recommendations for other ML practitioners on how to build robust models with extremely imbalanced datasets in biomedical NLP applications.
- Research Organization:
- Argonne National Laboratory (ANL), Argonne, IL (United States); Lawrence Livermore National Laboratory (LLNL), Livermore, CA (United States); Los Alamos National Laboratory (LANL), Los Alamos, NM (United States); Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
- Sponsoring Organization:
- Centers for Disease Control and Prevention (CDC); Huntsman Cancer Foundation; National Cancer Institute (NCI); USDOE National Nuclear Security Administration (NNSA); USDOE Office of Science (SC); University of Utah
- Grant/Contract Number:
- AC02-06CH11357; AC05-00OR22725; AC52-06NA25396; AC52-07NA27344
- OSTI ID:
- 1884003
- Journal Information:
- Journal of Biomedical Informatics, Journal Name: Journal of Biomedical Informatics Journal Issue: 1 Vol. 125; ISSN 1532-0464
- Publisher:
- ElsevierCopyright Statement
- Country of Publication:
- United States
- Language:
- English
Similar Records
Characterizing Quantum Classifier Utility in Natural Language Processing Workflows
Deep Learning for Automated Extraction of Primary Sites from Cancer Pathology Reports
Deformable phrase level attention: A flexible approach for improving AI based medical coding
Conference
·
Fri Sep 01 00:00:00 EDT 2023
·
OSTI ID:2397467
Deep Learning for Automated Extraction of Primary Sites from Cancer Pathology Reports
Journal Article
·
Tue May 02 20:00:00 EDT 2017
· IEEE Journal of Biomedical and Health Informatics
·
OSTI ID:1408007
Deformable phrase level attention: A flexible approach for improving AI based medical coding
Journal Article
·
Fri Nov 14 19:00:00 EST 2025
· Artificial Intelligence in Medicine
·
OSTI ID:3020940