Synthetic data for design and evaluation of binary classifiers in the context of Bayesian transfer learning

Maddouri, Omar; Qian, Xiaoning; Alexander, Francis J.; Dougherty, Edward R.; Yoon, Byung-Jun

doi:10.1016/j.dib.2022.108113

Title: Synthetic data for design and evaluation of binary classifiers in the context of Bayesian transfer learning

Abstract

Transfer learning (TL) techniques can enable effective learning in data scarce domains by allowing one to re-purpose data or scientific knowledge available in relevant source domains for predictive tasks in a target domain of interest. In this Data in Brief article, we present a synthetic dataset for binary classification in the context of Bayesian transfer learning, which can be used for the design and evaluation of TL-based classifiers. For this purpose, we consider numerous combinations of classification settings, based on which we simulate a diverse set of feature-label distributions with varying learning complexity. For each set of model parameters, we provide a pair of target and source datasets that have been jointly sampled from the underlying feature-label distributions in the target and source domains, respectively. For both target and source domains, the data in a given class and domain are normally distributed, where the distributions across domains are related to each other through a joint prior. To ensure the consistency of the classification complexity across the provided datasets, we have controlled the Bayes error such that it is maintained within a range of predefined values that mimic realistic classification scenarios across different relatedness levels. The provided datasets may serve asmore »« less

Authors:: Maddouri, Omar; Qian, Xiaoning; Alexander, Francis J.; Dougherty, Edward R.; Yoon, Byung-Jun

Publication Date:: Wed Jun 01 00:00:00 EDT 2022

Research Org.:: Brookhaven National Lab. (BNL), Upton, NY (United States)

Sponsoring Org.:: USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR); National Science Foundation (NSF)

OSTI Identifier:: 1861465

Alternate Identifier(s):: OSTI ID: 1872376

Report Number(s):: BNL-223068-2022-JAAM
Journal ID: ISSN 2352-3409; S2352340922003237; 108113; PII: S2352340922003237

Grant/Contract Number:: SC0019303; SC0012704; 1835690

Resource Type:: Published Article

Journal Name:: Data in Brief

Additional Journal Information:: Journal Name: Data in Brief Journal Volume: 42 Journal Issue: C; Journal ID: ISSN 2352-3409

Publisher:: Elsevier

Country of Publication:: United States

Language:: English

Subject:: 97 MATHEMATICS AND COMPUTING; Bayesian transfer learning; Binary classification; Classifier design; Error estimation

Citation Formats


                    Maddouri, Omar, Qian, Xiaoning, Alexander, Francis J., Dougherty, Edward R., and Yoon, Byung-Jun. Synthetic data for design and evaluation of binary classifiers in the context of Bayesian transfer learning.  United States: N. p., 2022. 
Web.  doi:10.1016/j.dib.2022.108113.

Copy to clipboard


                    Maddouri, Omar, Qian, Xiaoning, Alexander, Francis J., Dougherty, Edward R., & Yoon, Byung-Jun. Synthetic data for design and evaluation of binary classifiers in the context of Bayesian transfer learning.  United States.  https://doi.org/10.1016/j.dib.2022.108113

Copy to clipboard


                    Maddouri, Omar, Qian, Xiaoning, Alexander, Francis J., Dougherty, Edward R., and Yoon, Byung-Jun. Wed .  
"Synthetic data for design and evaluation of binary classifiers in the context of Bayesian transfer learning".  United States.  https://doi.org/10.1016/j.dib.2022.108113.

Copy to clipboard


                    
@article{osti_1861465,

  title        = {Synthetic data for design and evaluation of binary classifiers in the context of Bayesian transfer learning},

  author       = {Maddouri, Omar and Qian, Xiaoning and Alexander, Francis J. and Dougherty, Edward R. and Yoon, Byung-Jun},

  abstractNote = {Transfer learning (TL) techniques can enable effective learning in data scarce domains by allowing one to re-purpose data or scientific knowledge available in relevant source domains for predictive tasks in a target domain of interest. In this Data in Brief article, we present a synthetic dataset for binary classification in the context of Bayesian transfer learning, which can be used for the design and evaluation of TL-based classifiers. For this purpose, we consider numerous combinations of classification settings, based on which we simulate a diverse set of feature-label distributions with varying learning complexity. For each set of model parameters, we provide a pair of target and source datasets that have been jointly sampled from the underlying feature-label distributions in the target and source domains, respectively. For both target and source domains, the data in a given class and domain are normally distributed, where the distributions across domains are related to each other through a joint prior. To ensure the consistency of the classification complexity across the provided datasets, we have controlled the Bayes error such that it is maintained within a range of predefined values that mimic realistic classification scenarios across different relatedness levels. The provided datasets may serve as useful resources for designing and benchmarking transfer learning schemes for binary classification as well as the estimation of classification error.},

  doi          = {10.1016/j.dib.2022.108113},

  journal      = {Data in Brief},

  number       = C,

  volume       = 42,

  place        = {United States},

  year         = {Wed Jun 01 00:00:00 EDT 2022},

  month        = {Wed Jun 01 00:00:00 EDT 2022}

}

Copy to clipboard

Journal Article:

Free Publicly Available Full Text

Publisher's Version of Record
https://doi.org/10.1016/j.dib.2022.108113

Other availability

Search WorldCat to find libraries that may hold this journal

Save / Share:

Export Metadata

Save to My Library

Works referenced in this record:

Robust importance sampling for error estimation in the context of optimal Bayesian transfer learning
journal, March 2022

Maddouri, Omar; Qian, Xiaoning; Alexander, Francis J.
Patterns, Vol. 3, Issue 3
DOI: 10.1016/j.patter.2021.100428

Optimal Bayesian Transfer Learning
journal, July 2018

Karbalayghareh, Alireza; Qian, Xiaoning; Dougherty, Edward R.
IEEE Transactions on Signal Processing, Vol. 66, Issue 14
DOI: 10.1109/TSP.2018.2839583

Similar Records in DOE PAGES and OSTI.GOV collections:

Robust importance sampling for error estimation in the context of optimal Bayesian transfer learning

Journal Article Maddouri, Omar ; Qian, Xiaoning ; Alexander, Francis J. ; ... - Patterns

Classification has been a major task for building intelligent systems because it enables decision-making under uncertainty. Classifier design aims at building models from training data for representing feature-label distributions-either explicitly or implicitly. In many scientific or clinical settings, training data are typically limited, which impedes the design and evaluation of accurate classifiers. Although transfer learning can improve the learning in target domains by incorporating data from relevant source domains, it has received little attention for performance assessment, notably in error estimation. Here, we investigate knowledge transferability in the context of classification error estimation within a Bayesian paradigm. We introduce amore »« less
https://doi.org/10.1016/j.patter.2021.100428
Efficient Active Learning for Gaussian Process Classification by Error Reduction

Journal Article Zhao, Guang ; Dougherty, Edward R. ; Yoon, Byung-Jun ; ... - Advances in Neural Information Processing Systems

Active learning sequentially selects the best instance for labeling by optimizing an acquisition function to enhance data/label efficiency. The selection can be either from a discrete instance set (pool-based scenario) or a continuous instance space (query synthesis scenario). In this work, we study both active learning scenarios for Gaussian Process Classification (GPC). The existing active learning strategies that maximize the Estimated Error Reduction (EER) aim at reducing the classification error after training with the new acquired instance in a onestep-look-ahead manner. The computation of EER-based acquisition functions is typically prohibitive as it requires retraining the GPC with every new query.more »« less
Full Text Available
BioADAPT-MRC: adversarial learning-based domain adaptation improves biomedical machine reading comprehension task

Journal Article Mahbub, Maria ; Srinivasan, Sudarshan ; Begoli, Edmon ; ... - Bioinformatics

ABSTRACT Motivation Biomedical machine reading comprehension (biomedical-MRC) aims to comprehend complex biomedical narratives and assist healthcare professionals in retrieving information from them. The high performance of modern neural network-based MRC systems depends on high-quality, large-scale, human-annotated training datasets. In the biomedical domain, a crucial challenge in creating such datasets is the requirement for domain knowledge, inducing the scarcity of labeled data and the need for transfer learning from the labeled general-purpose (source) domain to the biomedical (target) domain. However, there is a discrepancy in marginal distributions between the general-purpose and biomedical domains due to the variances in topics. Therefore, direct-transferringmore »« less
https://doi.org/10.1093/bioinformatics/btac508
DeepAstroUDA: semi-supervised universal domain adaptation for cross-survey galaxy morphology classification and anomaly detection

Journal Article Ćiprijanović, A. ; Lewis, A. ; Pedro, K. ; ... - Machine Learning: Science and Technology

Artificial intelligence methods show great promise in increasing the quality and speed of work with large astronomical datasets, but the high complexity of these methods leads to the extraction of dataset-specific, non-robust features. Therefore, such methods do not generalize well across multiple datasets. We present a universal domain adaptation method, DeepAstroUDA, as an approach to overcome this challenge. This algorithm performs semi-supervised domain adaptation (DA) and can be applied to datasets with different data distributions and class overlaps. Non-overlapping classes can be present in any of the two datasets (the labeled source domain, or the unlabeled target domain), and themore »« less
https://doi.org/10.1088/2632-2153/acca5f
Correcting evaluation bias of relational classifiers with network cross validation

Journal Article Neville, Jennifer ; Gallagher, Brian ; Eliassi-Rad, Tina ; ... - Knowledge and Information Systems

Recently, a number of modeling techniques have been developed for data mining and machine learning in relational and network domains where the instances are not independent and identically distributed (i.i.d.). These methods specifically exploit the statistical dependencies among instances in order to improve classification accuracy. However, there has been little focus on how these same dependencies affect our ability to draw accurate conclusions about the performance of the models. More specifically, the complex link structure and attribute dependencies in relational data violate the assumptions of many conventional statistical tests and make it difficult to use these tests to assess themore »« less
Cited by 4
https://doi.org/10.1007/s10115-010-0373-1

Full Text Available

Similar Records

Title: Synthetic data for design and evaluation of binary classifiers in the context of Bayesian transfer learning

Abstract

Citation Formats

Robust importance sampling for error estimation in the context of optimal Bayesian transfer learning journal, March 2022

Optimal Bayesian Transfer Learning journal, July 2018

Robust importance sampling for error estimation in the context of optimal Bayesian transfer learning
journal, March 2022

Optimal Bayesian Transfer Learning
journal, July 2018