Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Classifying Cancer Pathology Reports with Hierarchical Self-Attention Networks

Journal Article · · Artificial Intelligence in Medicine
 [1];  [1];  [1];  [1];  [1];  [1];  [1];  [2];  [2];  [3];  [4];  [1];  [1]
  1. Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States). Health Data Sciences Institute, Computational Sciences and Engineering Division
  2. National Cancer Institute, Bethesda, MD (United States). Division of Cancer Control and Population Sciences, Surveillance Informatics Branch
  3. Louisiana State Univ., New Orleans, LA (United States). School of Public Health, Health Sciences Center, Louisiana Tumor Registry
  4. Information Management Services, Inc., Calverton, MD (United States)

We introduce a deep learning architecture, hierarchical self-attention networks (HiSANs), designed for classifying pathology reports and show how its unique architecture leads to a new state-of-the-art in accuracy, faster training, and clear interpretability. We evaluate performance on a corpus of 374,899 pathology reports obtained from the National Cancer Institute's (NCI) Surveillance, Epidemiology, and End Results (SEER) program. Each pathology report is associated with five clinical classification tasks – site, laterality, behavior, histology, and grade. We compare the performance of the HiSAN against other machine learning and deep learning approaches commonly used on medical text data – Naive Bayes, logistic regression, convolutional neural networks, and hierarchical attention networks (the previous state-of-the-art). We show that HiSANs are superior to other machine learning and deep learning text classifiers in both accuracy and macro F-score across all five classification tasks. Compared to the previous state-of-the-art, hierarchical attention networks, HiSANs not only are an order of magnitude faster to train, but also achieve about 1% better relative accuracy and 5% better relative macro F-score.

Research Organization:
Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
Sponsoring Organization:
USDOE Office of Science (SC)
Grant/Contract Number:
AC05-00OR22725
OSTI ID:
1785219
Journal Information:
Artificial Intelligence in Medicine, Journal Name: Artificial Intelligence in Medicine Vol. 101; ISSN 0933-3657
Publisher:
ElsevierCopyright Statement
Country of Publication:
United States
Language:
English

References (16)

Deep learning for sentiment analysis: A survey journal March 2018
Using machine learning to parse breast pathology reports journal November 2016
Automatically extracting cancer disease characteristics from pathology reports into a Disease Knowledge Representation Model journal October 2009
Clinical information extraction applications: A literature review journal January 2018
Using Natural Language Processing to Improve Efficiency of Manual Chart Abstraction in Research: The Case of Breast Cancer Recurrence journal January 2014
Hierarchical attention networks for information extraction from cancer pathology reports journal November 2017
Opportunities and challenges in developing deep learning models using electronic health records data: a systematic review journal June 2018
Deep Learning for Automated Extraction of Primary Sites From Cancer Pathology Reports journal January 2018
Semantic Structure and Interpretability of Word Embeddings journal October 2018
Symbolic rule-based classification of lung cancer stages from free-text pathology reports journal July 2010
Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms journal October 1998
Automated extraction of Biomarker information from pathology reports journal May 2018
Comparing deep learning and concept extraction based methods for patient phenotyping from clinical narratives journal February 2018
Structured prediction models for RNN based sequence labeling in clinical text conference January 2016
Hierarchical Convolutional Attention Networks for Text Classification conference January 2018
Application of text information extraction system for real-time cancer case identification in an integrated healthcare organization journal January 2017