DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: CAT: computer aided triage improving upon the Bayes risk through ε-refusal triage rules

Abstract

Manual extraction of information from electronic pathology (epath) reports to populate the Surveillance, Epidemiology, and End Result (SEER) database is labor intensive. Systematizing the data extraction automatically using machine-learning (ML) and natural language processing (NLP) is desirable to reduce the human labor required to populate the SEER database and to improve the timeliness of the data. This enables scaling up registry efficiency and collection of new data elements. To ensure the integrity, quality, and continuity of the SEER data, the misclassification error of ML and NPL algorithms needs to be negligible. Current algorithms fail to achieve the precision of human experts who can bring additional information in their assessments. Differences in registry format and the desire to develop a common information extraction platform further complicate the ML/NLP tasks. The purpose of our study is to develop triage rules to partially automate registry workflow to improve the precision of the auto-extracted information. This paper presents a mathematical framework to improve the precision of a classifier beyond that of the Bayes classifier by selectively classifying item that are most likely to be correct. This results in a triage rule that only classifies a subset of the item. We characterize the optimal triagemore » rule and demonstrate its usefulness in the problem of classifying cancer site from electronic pathology reports to achieve a desired precision. From the mathematical formalism, we propose a heuristic estimate for triage rule based on post-processing the soft-max output from standard machine learning algorithms. We show, in test cases, that the triage rule significantly improve the classification accuracy.« less

Authors:
 [1];  [1];  [2]; ORCiD logo [3];  [3]; ORCiD logo [3];  [1]
  1. Los Alamos National Lab. (LANL), Los Alamos, NM (United States)
  2. Louisiana State Univ., New Orleans, LA (United States)
  3. Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Publication Date:
Research Org.:
Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
Sponsoring Org.:
USDOE
OSTI Identifier:
1545570
Grant/Contract Number:  
AC05-00OR22725
Resource Type:
Accepted Manuscript
Journal Name:
BMC Bioinformatics
Additional Journal Information:
Journal Volume: 19; Journal Issue: S18; Journal ID: ISSN 1471-2105
Publisher:
BioMed Central
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING; 60 APPLIED LIFE SCIENCES; Machine learning; Classification

Citation Formats

Hengartner, Nicolas, Cuellar, Leticia, Wu, Xiao -Cheng, Tourassi, Georgia, Qiu, John X., Christian, Blair, and Bhattacharya, Tanmoy. CAT: computer aided triage improving upon the Bayes risk through ε-refusal triage rules. United States: N. p., 2018. Web. doi:10.1186/s12859-018-2503-9.
Hengartner, Nicolas, Cuellar, Leticia, Wu, Xiao -Cheng, Tourassi, Georgia, Qiu, John X., Christian, Blair, & Bhattacharya, Tanmoy. CAT: computer aided triage improving upon the Bayes risk through ε-refusal triage rules. United States. https://doi.org/10.1186/s12859-018-2503-9
Hengartner, Nicolas, Cuellar, Leticia, Wu, Xiao -Cheng, Tourassi, Georgia, Qiu, John X., Christian, Blair, and Bhattacharya, Tanmoy. Fri . "CAT: computer aided triage improving upon the Bayes risk through ε-refusal triage rules". United States. https://doi.org/10.1186/s12859-018-2503-9. https://www.osti.gov/servlets/purl/1545570.
@article{osti_1545570,
title = {CAT: computer aided triage improving upon the Bayes risk through ε-refusal triage rules},
author = {Hengartner, Nicolas and Cuellar, Leticia and Wu, Xiao -Cheng and Tourassi, Georgia and Qiu, John X. and Christian, Blair and Bhattacharya, Tanmoy},
abstractNote = {Manual extraction of information from electronic pathology (epath) reports to populate the Surveillance, Epidemiology, and End Result (SEER) database is labor intensive. Systematizing the data extraction automatically using machine-learning (ML) and natural language processing (NLP) is desirable to reduce the human labor required to populate the SEER database and to improve the timeliness of the data. This enables scaling up registry efficiency and collection of new data elements. To ensure the integrity, quality, and continuity of the SEER data, the misclassification error of ML and NPL algorithms needs to be negligible. Current algorithms fail to achieve the precision of human experts who can bring additional information in their assessments. Differences in registry format and the desire to develop a common information extraction platform further complicate the ML/NLP tasks. The purpose of our study is to develop triage rules to partially automate registry workflow to improve the precision of the auto-extracted information. This paper presents a mathematical framework to improve the precision of a classifier beyond that of the Bayes classifier by selectively classifying item that are most likely to be correct. This results in a triage rule that only classifies a subset of the item. We characterize the optimal triage rule and demonstrate its usefulness in the problem of classifying cancer site from electronic pathology reports to achieve a desired precision. From the mathematical formalism, we propose a heuristic estimate for triage rule based on post-processing the soft-max output from standard machine learning algorithms. We show, in test cases, that the triage rule significantly improve the classification accuracy.},
doi = {10.1186/s12859-018-2503-9},
journal = {BMC Bioinformatics},
number = S18,
volume = 19,
place = {United States},
year = {Fri Dec 21 00:00:00 EST 2018},
month = {Fri Dec 21 00:00:00 EST 2018}
}

Journal Article:
Free Publicly Available Full Text
Publisher's Version of Record

Citation Metrics:
Cited by: 2 works
Citation information provided by
Web of Science

Save / Share:

Works referenced in this record:

On the Problem of the Most Efficient Tests of Statistical Hypotheses
journal, January 1933

  • Neyman, J.; Pearson, E. S.
  • Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, Vol. 231, Issue 694-706
  • DOI: 10.1098/rsta.1933.0009

Works referencing / citing this record:

AI Meets Exascale Computing: Advancing Cancer Research With Large-Scale High Performance Computing
journal, October 2019

  • Bhattacharya, Tanmoy; Brettin, Thomas; Doroshow, James H.
  • Frontiers in Oncology, Vol. 9
  • DOI: 10.3389/fonc.2019.00984