Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Path-BigBird: An AI-Driven Transformer Approach to Classification of Cancer Pathology Reports

Journal Article · · JCO Clinical Cancer Informatics
DOI:https://doi.org/10.1200/cci.23.00148· OSTI ID:2320385
PURPOSE Surgical pathology reports are critical for cancer diagnosis and management. To accurately extract information about tumor characteristics from pathology reports in near real time, we explore the impact of using domain-specific transformer models that understand cancer pathology reports. METHODS We built a pathology transformer model, Path-BigBird, by using 2.7 million pathology reports from six SEER cancer registries. We then compare different variations of Path-BigBird with two less computationally intensive methods: Hierarchical Self-Attention Network (HiSAN) classification model and an offthe-shelf clinical transformer model (Clinical BigBird). We use five pathology information extraction tasks for evaluation: site, subsite, laterality, histology, and behavior. Model performance is evaluated by using macro and micro F1 scores. RESULTS We found that Path-BigBird and Clinical BigBird outperformed the HiSAN in all tasks. Clinical BigBird performed better on the site and laterality tasks. Versions of the Path-BigBird model performed best on the two most difficult tasks: subsite (micro F1 score of 72.53, macro F1 score of 35.76) and histology (micro F1 score of 80.96, macro F1 score of 37.94). The largest performance gains over the HiSAN model were for histology, for which a Path-BigBird model increased the micro F1 score by 1.44 points and the macro F1 score by 3.55 points. Overall, the results suggest that a Path-BigBird model with a vocabulary derived from wellcurated and deidentified data is the best-performing model. CONCLUSION The Path-BigBird pathology transformer model improves automated information extraction from pathology reports. Although Path-BigBird outperforms Clinical BigBird and HiSAN, these less computationally expensive models still have utility when resources are constrained.
Research Organization:
Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States); Argonne National Laboratory (ANL), Argonne, IL (United States); Lawrence Livermore National Laboratory (LLNL), Livermore, CA (United States); Los Alamos National Laboratory (LANL), Los Alamos, NM (United States)
Sponsoring Organization:
USDOE Office of Science (SC); USDOE National Nuclear Security Administration (NNSA)
Grant/Contract Number:
AC05-00OR22725; AC52-06NA25396; AC52-07NA27344; AC02-06CH11357
OSTI ID:
2320385
Journal Information:
JCO Clinical Cancer Informatics, Journal Name: JCO Clinical Cancer Informatics Vol. 8; ISSN 2473-4276
Publisher:
ASCO PublicationsCopyright Statement
Country of Publication:
United States
Language:
English

References (12)

How generalizable are the SEER registries to the cancer populations of the USA? journal July 2016
Natural Language Processing in Pathology journal November 2022
Classifying cancer pathology reports with hierarchical self-attention networks journal November 2019
Considering the possibilities and pitfalls of Generative Pre-trained Transformer 3 (GPT-3) in healthcare delivery journal June 2021
A large language model for electronic health records journal December 2022
BioBERT: a pre-trained biomedical language representation model for biomedical text mining journal September 2019
Automatic extraction of cancer registry reportable information from free-text pathology reports using multitask convolutional neural networks journal November 2019
Limitations of Transformers on Clinical Text Classification journal September 2021
Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing journal October 2021
Cancer statistics, 2023 journal January 2023
AI Meets Exascale Computing: Advancing Cancer Research With Large-Scale High Performance Computing journal October 2019
ChatGPT Utility in Healthcare Education, Research, and Practice: Systematic Review on the Promising Perspectives and Valid Concerns journal March 2023

Similar Records

Information Extraction from Cancer Pathology Reports with Graph Convolution Networks for Natural Language Texts
Conference · Sat Nov 30 23:00:00 EST 2019 · OSTI ID:1606856

Classifying Cancer Pathology Reports with Hierarchical Self-Attention Networks
Journal Article · Tue Oct 15 00:00:00 EDT 2019 · Artificial Intelligence in Medicine · OSTI ID:1785219

Multimodal Data Representation with Deep Learning for Extracting Cancer Characteristics from Clinical Text
Conference · Mon Nov 30 23:00:00 EST 2020 · OSTI ID:1737476

Related Subjects