DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Using ensembles and distillation to optimize the deployment of deep learning models for the classification of electronic cancer pathology reports

Journal Article · · JAMIA Open
 [1];  [2];  [2];  [3];  [4];  [5];  [6];  [7];  [8];  [9];  [10]; ORCiD logo [2]; ORCiD logo [2]
  1. Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States); Univ. of Tennessee, Knoxville, TN (United States)
  2. Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
  3. Univ. of Kentucky, Lexington, KY (United States)
  4. Louisiana State Univ., New Orleans, LA (United States)
  5. Rutgers Univ., New Brunswick, NJ (United States)
  6. Univ. of Utah, Salt Lake City, UT (United States)
  7. Univ. of Washington, Seattle, WA (United States)
  8. Univ. of New Mexico, Albuquerque, NM (United States)
  9. Information Management Services, Inc., Calverton, MD (United States)
  10. National Cancer Institute, Bethesda, MD (United States)

One of the goals of the Surveillance, Epidemiology, and End Results (SEER) program is to estimate incidence, prevalence, and mortality of all cancers. To that end, cancer registries across the country maintain a massive database of cancer pathology reports which contain rich information to understand cancer trends. However, these reports are stored in the form of unstructured text, and human annotators are required to read and extract relevant information. In this article, we show that existing deep learning models for automating information extraction from cancer pathology reports can be significantly improved by using ensemble model distillation. We found that by training multiple predictive models and transferring their knowledge to a single, low-resource model, we can reduce the number of highly confident wrong predictions. Our results show that our implemented methods could save 1000s of manual annotation hours.

Research Organization:
Argonne National Laboratory (ANL), Argonne, IL (United States); Lawrence Livermore National Laboratory (LLNL), Livermore, CA (United States); Los Alamos National Laboratory (LANL), Los Alamos, NM (United States); Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF)
Sponsoring Organization:
Centers for Disease Control and Prevention (CDC); NCI Surveillance, Epidemiology and End Results (SEER); National Institutes of Health (NIH); USDOE National Nuclear Security Administration (NNSA); USDOE Office of Science (SC)
Grant/Contract Number:
AC02-06CH11357; AC05-00OR22725; AC52-06NA25396; AC52-07NA27344
OSTI ID:
1887696
Journal Information:
JAMIA Open, Journal Name: JAMIA Open Journal Issue: 3 Vol. 5; ISSN 2574-2531
Publisher:
Oxford University PressCopyright Statement
Country of Publication:
United States
Language:
English

References (17)

Overfitting of Neural Nets Under Class Imbalance: Analysis and Improvements for Segmentation book January 2019
Bagging predictors journal August 1996
Classifying cancer pathology reports with hierarchical self-attention networks journal November 2019
Class imbalance in out-of-distribution datasets: Improving the robustness of the TextCNN for the classification of rare cancer types journal January 2022
Automatic extraction of cancer registry reportable information from free-text pathology reports using multitask convolutional neural networks journal November 2019
Limitations of Transformers on Clinical Text Classification journal September 2021
A Keyword-Enhanced Approach to Handle Class Imbalance in Clinical Text Classification journal June 2022
An optimum character recognition system using decision functions journal December 1957
On optimum recognition error and reject tradeoff journal January 1970
Analyzing Overfitting Under Class Imbalance in Neural Networks for Image Segmentation journal March 2021
Ensemble Learning With Manifold-Based Data Splitting for Noisy Label Correction journal January 2022
The Nearest Neighbor Classification Rule with a Reject Option journal July 1970
Adaptive Mixtures of Local Experts journal February 1991
Deep active learning for classifying cancer pathology reports journal March 2021
Generalization bounds for averaged classifiers journal August 2004
Cancer statistics, 2022 journal January 2022
Deep active learning for classifying cancer pathology reports collection January 2021