CAT: computer aided triage improving upon the Bayes risk through ε-refusal triage rules

Hengartner, Nicolas; Cuellar, Leticia; Wu, Xiao -Cheng; Tourassi, Georgia; Qiu, John X.; Christian, Blair; Bhattacharya, Tanmoy

doi:10.1186/s12859-018-2503-9

Title: CAT: computer aided triage improving upon the Bayes risk through ε-refusal triage rules

Abstract

Manual extraction of information from electronic pathology (epath) reports to populate the Surveillance, Epidemiology, and End Result (SEER) database is labor intensive. Systematizing the data extraction automatically using machine-learning (ML) and natural language processing (NLP) is desirable to reduce the human labor required to populate the SEER database and to improve the timeliness of the data. This enables scaling up registry efficiency and collection of new data elements. To ensure the integrity, quality, and continuity of the SEER data, the misclassification error of ML and NPL algorithms needs to be negligible. Current algorithms fail to achieve the precision of human experts who can bring additional information in their assessments. Differences in registry format and the desire to develop a common information extraction platform further complicate the ML/NLP tasks. The purpose of our study is to develop triage rules to partially automate registry workflow to improve the precision of the auto-extracted information. This paper presents a mathematical framework to improve the precision of a classifier beyond that of the Bayes classifier by selectively classifying item that are most likely to be correct. This results in a triage rule that only classifies a subset of the item. We characterize the optimal triagemore »« less

Authors:

Hengartner, Nicolas ^[1]; Cuellar, Leticia ^[1]; Wu, Xiao -Cheng ^[2];

^[3]; Qiu, John X. ^[3];

^[3]; Bhattacharya, Tanmoy ^[1]

Los Alamos National Lab. (LANL), Los Alamos, NM (United States)
Louisiana State Univ., New Orleans, LA (United States)
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)

Publication Date:: Fri Dec 21 00:00:00 EST 2018

Research Org.:: Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)

Sponsoring Org.:: USDOE

OSTI Identifier:: 1545570

Grant/Contract Number:: AC05-00OR22725

Resource Type:: Accepted Manuscript

Journal Name:: BMC Bioinformatics

Additional Journal Information:: Journal Volume: 19; Journal Issue: S18; Journal ID: ISSN 1471-2105

Publisher:: BioMed Central

Country of Publication:: United States

Language:: English

Subject:: 97 MATHEMATICS AND COMPUTING; 60 APPLIED LIFE SCIENCES; Machine learning; Classification

Citation Formats


                    Hengartner, Nicolas, Cuellar, Leticia, Wu, Xiao -Cheng, Tourassi, Georgia, Qiu, John X., Christian, Blair, and Bhattacharya, Tanmoy. CAT: computer aided triage improving upon the Bayes risk through ε-refusal triage rules.  United States: N. p., 2018. 
Web.  doi:10.1186/s12859-018-2503-9.

Copy to clipboard


                    Hengartner, Nicolas, Cuellar, Leticia, Wu, Xiao -Cheng, Tourassi, Georgia, Qiu, John X., Christian, Blair, & Bhattacharya, Tanmoy. CAT: computer aided triage improving upon the Bayes risk through ε-refusal triage rules.  United States.  https://doi.org/10.1186/s12859-018-2503-9

Copy to clipboard


                    Hengartner, Nicolas, Cuellar, Leticia, Wu, Xiao -Cheng, Tourassi, Georgia, Qiu, John X., Christian, Blair, and Bhattacharya, Tanmoy. Fri .  
"CAT: computer aided triage improving upon the Bayes risk through ε-refusal triage rules".  United States.  https://doi.org/10.1186/s12859-018-2503-9.  https://www.osti.gov/servlets/purl/1545570.

Copy to clipboard


                    
@article{osti_1545570,

  title        = {CAT: computer aided triage improving upon the Bayes risk through ε-refusal triage rules},

  author       = {Hengartner, Nicolas and Cuellar, Leticia and Wu, Xiao -Cheng and Tourassi, Georgia and Qiu, John X. and Christian, Blair and Bhattacharya, Tanmoy},

  abstractNote = {Manual extraction of information from electronic pathology (epath) reports to populate the Surveillance, Epidemiology, and End Result (SEER) database is labor intensive. Systematizing the data extraction automatically using machine-learning (ML) and natural language processing (NLP) is desirable to reduce the human labor required to populate the SEER database and to improve the timeliness of the data. This enables scaling up registry efficiency and collection of new data elements. To ensure the integrity, quality, and continuity of the SEER data, the misclassification error of ML and NPL algorithms needs to be negligible. Current algorithms fail to achieve the precision of human experts who can bring additional information in their assessments. Differences in registry format and the desire to develop a common information extraction platform further complicate the ML/NLP tasks. The purpose of our study is to develop triage rules to partially automate registry workflow to improve the precision of the auto-extracted information. This paper presents a mathematical framework to improve the precision of a classifier beyond that of the Bayes classifier by selectively classifying item that are most likely to be correct. This results in a triage rule that only classifies a subset of the item. We characterize the optimal triage rule and demonstrate its usefulness in the problem of classifying cancer site from electronic pathology reports to achieve a desired precision. From the mathematical formalism, we propose a heuristic estimate for triage rule based on post-processing the soft-max output from standard machine learning algorithms. We show, in test cases, that the triage rule significantly improve the classification accuracy.},

  doi          = {10.1186/s12859-018-2503-9},

  journal      = {BMC Bioinformatics},

  number       = S18,

  volume       = 19,

  place        = {United States},

  year         = {Fri Dec 21 00:00:00 EST 2018},

  month        = {Fri Dec 21 00:00:00 EST 2018}

}

Copy to clipboard

Journal Article:

Free Publicly Available Full Text

Accepted Manuscript (DOE)

Publisher's Version of Record

https://doi.org/10.1186/s12859-018-2503-9

Other availability

Search WorldCat to find libraries that may hold this journal

Citation Metrics:

Cited by: 2 works

Citation information provided by
Web of Science

Save / Share:

Export Metadata

Save to My Library

Works referenced in this record:

On the Problem of the Most Efficient Tests of Statistical Hypotheses
journal, January 1933

Neyman, J.; Pearson, E. S.
Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, Vol. 231, Issue 694-706
DOI: 10.1098/rsta.1933.0009

Works referencing / citing this record:

AI Meets Exascale Computing: Advancing Cancer Research With Large-Scale High Performance Computing
journal, October 2019

Bhattacharya, Tanmoy; Brettin, Thomas; Doroshow, James H.
Frontiers in Oncology, Vol. 9
DOI: 10.3389/fonc.2019.00984

Similar Records in DOE PAGES and OSTI.GOV collections:

FrESCO

Software Spannaus, Adam ; Gounley, John ; Chandra Shekar, Mayanka ; ...

The National Cancer Institute (NCI) monitors population level cancer trends as part of its Surveillance, Epidemiology, and End Results (SEER) program. This program consists of state or regional level cancer registries which collect, analyze, and annotate cancer pathology reports. From these annotated pathology reports, each individual registry aggregates cancer phenotype information and summary statistics about cancer prevalence to facilitate population level monitoring of cancer incidence. Extracting cancer phenotype from these reports is a labor intensive task, requiring specialized knowledge about the reports and cancer. Automating this information extraction process from cancer pathology reports has the potential to improve not onlymore »« less
https://doi.org/10.11578/dc.20230227.2

View Software
FrESCO: Framework for Exploring Scalable Computational Oncology

Journal Article Spannaus, Adam ; Gounley, John ; Shekar, Mayanka Chandra ; ... - Journal of Open Source Software

The National Cancer Institute (NCI) monitors population level cancer trends as part of its Surveillance, Epidemiology, and End Results (SEER) program. This program consists of state or regional level cancer registries which collect, analyze, and annotate cancer pathology reports. From these annotated pathology reports, each individual registry aggregates cancer phenotype information from electronic health records. This data is then used to create summary statistics about cancer incidence and mortality to facilitate population health monitoring. Extracting phenotypic information from these reports is a labor intensive task, requiring specialized knowledge about the reports and cancer. Automating the information extraction process from cancermore »« less
https://doi.org/10.21105/joss.05345

Full Text Available
Deep learning uncertainty quantification for clinical text classification

Journal Article Peluso, Alina ; Danciu, Ioana ; Yoon, Hong-Jun ; ... - Journal of Biomedical Informatics

Machine learning algorithms are expected to work side-by-side with humans in decision-making pipelines. Thus, the ability of classifiers to make reliable decisions is of paramount importance. Deep neural networks (DNNs) represent the state-of-the-art models to address real-world classification. Although the strength of activation in DNNs is often correlated with the network’s confidence, in-depth analyses are needed to establish whether they are well calibrated. In this paper, we demonstrate the use of DNN-based classification tools to benefit cancer registries by automating information extraction of disease at diagnosis and at surgery from electronic text pathology reports from the US National Cancer Institutemore »« less
https://doi.org/10.1016/j.jbi.2023.104576

Full Text Available
Using ensembles and distillation to optimize the deployment of deep learning models for the classification of electronic cancer pathology reports

Journal Article De Angeli, Kevin ; Gao, Shang ; Blanchard, Andrew ; ... - JAMIA Open

One of the goals of the Surveillance, Epidemiology, and End Results (SEER) program is to estimate incidence, prevalence, and mortality of all cancers. To that end, cancer registries across the country maintain a massive database of cancer pathology reports which contain rich information to understand cancer trends. However, these reports are stored in the form of unstructured text, and human annotators are required to read and extract relevant information. In this article, we show that existing deep learning models for automating information extraction from cancer pathology reports can be significantly improved by using ensemble model distillation. We found that bymore »« less
https://doi.org/10.1093/jamiaopen/ooac075

Full Text Available
Machine Learning for Automated Metadata Assignment in Buildings: Cooperative Research and Development (Final Report, CRADA Number CRD-18-00767)

Technical Report Cutler, Dylan ; Neely, Sean ; Venne, Jean-Simon

RealTerm Energy and NREL have identified a shared vision to evaluate opportunities to facilitate the organization and assignment of metadata to building control system (BCS) data via industry-informed machine learning (ML). Manual metadata assignment is labor intensive and costly, slowing down any Energy Management and Information System (EMIS) deployment in the building space. This project aims to develop methodologies to accurately assign this metadata and significantly decrease the level of effort associated with deploying EMIS. The objective of this project is to identify/design methodologies to assign metadata to HVAC control points automatically. The identified methodologies will be programmed in analyticsmore »« less
https://doi.org/10.2172/1861065

Full Text Available

Similar Records

Title: CAT: computer aided triage improving upon the Bayes risk through ε-refusal triage rules

Abstract

Citation Formats

On the Problem of the Most Efficient Tests of Statistical Hypotheses journal, January 1933

AI Meets Exascale Computing: Advancing Cancer Research With Large-Scale High Performance Computing journal, October 2019

On the Problem of the Most Efficient Tests of Statistical Hypotheses
journal, January 1933

AI Meets Exascale Computing: Advancing Cancer Research With Large-Scale High Performance Computing
journal, October 2019