Scalable and efficient learning from crowds with Gaussian processes

Morales-Álvarez, Pablo; Ruiz, Pablo; Santos-Rodríguez, Raúl; Molina, Rafael; Katsaggelos, Aggelos K.

doi:10.1016/j.inffus.2018.12.008

Title: Scalable and efficient learning from crowds with Gaussian processes

Abstract

Over the last few years, multiply-annotated data has become a very popular source of information. Online platforms such as Amazon Mechanical Turk have revolutionized the labelling process needed for any classification task, sharing the effort between a number of annotators (instead of the classical single expert). This crowdsourcing approach has introduced new challenging problems, such as handling disagreements on the annotated samples or combining the unknown expertise of the annotators. Probabilistic methods, such as Gaussian Processes (GP), have proven successful to model this new crowdsourcing scenario. However, GPs do not scale up well with the training set size, which makes them prohibitive for medium-to-large datasets (beyond 10K training instances). This constitutes a serious limitation for current real-world applications. In this work, we introduce two scalable and efficient GP-based crowdsourcing methods that allow for processing previously-prohibitive datasets. The first one is an efficient and fast approximation to GP with squared exponential (SE) kernel. The second allows for learning a more flexible kernel at the expense of a heavier training (but still scalable to large datasets). Since the latter is not a GP-SE approximation, it can be also considered as a whole new scalable and efficient crowdsourcing method, useful for any datasetmore »« less

Authors:

Morales-Álvarez, Pablo ^[1];

^[2]; Santos-Rodríguez, Raúl ^[3];

^[1]; Katsaggelos, Aggelos K. ^[2]

Univ. of Granada (Spain)
Northwestern Univ., Evanston, IL (United States)
University of Bristol (United Kingdom)

Publication Date:: Wed Jan 02 00:00:00 EST 2019

Research Org.:: Northwestern Univ., Evanston, IL (United States)

Sponsoring Org.:: USDOE National Nuclear Security Administration (NNSA)

OSTI Identifier:: 1801110

Alternate Identifier(s):: OSTI ID: 1547904

Grant/Contract Number:: NA0002520

Resource Type:: Accepted Manuscript

Journal Name:: Information Fusion

Additional Journal Information:: Journal Volume: 52; Journal Issue: C; Journal ID: ISSN 1566-2535

Publisher:: Elsevier

Country of Publication:: United States

Language:: English

Subject:: 96 KNOWLEDGE MANAGEMENT AND PRESERVATION; computer science; scalable crowdsourcing; classification; Gaussian processes; Fourier features; Bayesian modeling; variational inference

Citation Formats


                    Morales-Álvarez, Pablo, Ruiz, Pablo, Santos-Rodríguez, Raúl, Molina, Rafael, and Katsaggelos, Aggelos K. Scalable and efficient learning from crowds with Gaussian processes.  United States: N. p., 2019. 
Web.  doi:10.1016/j.inffus.2018.12.008.

Copy to clipboard


                    Morales-Álvarez, Pablo, Ruiz, Pablo, Santos-Rodríguez, Raúl, Molina, Rafael, & Katsaggelos, Aggelos K. Scalable and efficient learning from crowds with Gaussian processes.  United States.  https://doi.org/10.1016/j.inffus.2018.12.008

Copy to clipboard


                    Morales-Álvarez, Pablo, Ruiz, Pablo, Santos-Rodríguez, Raúl, Molina, Rafael, and Katsaggelos, Aggelos K. Wed .  
"Scalable and efficient learning from crowds with Gaussian processes".  United States.  https://doi.org/10.1016/j.inffus.2018.12.008.  https://www.osti.gov/servlets/purl/1801110.

Copy to clipboard


                    
@article{osti_1801110,

  title        = {Scalable and efficient learning from crowds with Gaussian processes},

  author       = {Morales-Álvarez, Pablo and Ruiz, Pablo and Santos-Rodríguez, Raúl and Molina, Rafael and Katsaggelos, Aggelos K.},

  abstractNote = {Over the last few years, multiply-annotated data has become a very popular source of information. Online platforms such as Amazon Mechanical Turk have revolutionized the labelling process needed for any classification task, sharing the effort between a number of annotators (instead of the classical single expert). This crowdsourcing approach has introduced new challenging problems, such as handling disagreements on the annotated samples or combining the unknown expertise of the annotators. Probabilistic methods, such as Gaussian Processes (GP), have proven successful to model this new crowdsourcing scenario. However, GPs do not scale up well with the training set size, which makes them prohibitive for medium-to-large datasets (beyond 10K training instances). This constitutes a serious limitation for current real-world applications. In this work, we introduce two scalable and efficient GP-based crowdsourcing methods that allow for processing previously-prohibitive datasets. The first one is an efficient and fast approximation to GP with squared exponential (SE) kernel. The second allows for learning a more flexible kernel at the expense of a heavier training (but still scalable to large datasets). Since the latter is not a GP-SE approximation, it can be also considered as a whole new scalable and efficient crowdsourcing method, useful for any dataset size. Both methods use Fourier features and variational inference, can predict the class of new samples, and estimate the expertise of the involved annotators. A complete experimentation compares them with state-of-the-art probabilistic approaches in synthetic and real crowdsourcing datasets of different sizes. Finally, they stand out as the best performing approach for large scale problems. Moreover, the second method is competitive with the current state-of-the-art for small datasets.},

  doi          = {10.1016/j.inffus.2018.12.008},

  journal      = {Information Fusion},

  number       = C,

  volume       = 52,

  place        = {United States},

  year         = {Wed Jan 02 00:00:00 EST 2019},

  month        = {Wed Jan 02 00:00:00 EST 2019}

}

Copy to clipboard

Journal Article:

Free Publicly Available Full Text

Accepted Manuscript (Publisher)

Accepted Manuscript (DOE)

Publisher's Version of Record

https://doi.org/10.1016/j.inffus.2018.12.008

Other availability

Search WorldCat to find libraries that may hold this journal

Citation Metrics:

Cited by: 11 works

Citation information provided by
Web of Science

Save / Share:

Export Metadata

Save to My Library

Works referenced in this record:

Joint Data Filtering and Labeling Using Gaussian Processes and Alternating Direction Method of Multipliers
journal, July 2016

Ruiz, Pablo; Molina, Rafael; Katsaggelos, Aggelos K.
IEEE Transactions on Image Processing, Vol. 25, Issue 7
DOI: 10.1109/TIP.2016.2558472

Learning Supervised Topic Models for Classification and Regression from Crowds
journal, December 2017

Rodrigues, Filipe; Lourenco, Mariana; Ribeiro, Bernardete
IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 39, Issue 12
DOI: 10.1109/TPAMI.2017.2648786

Learning from multiple annotators with varying expertise
journal, October 2013

Yan, Yan; Rosales, Rómer; Fung, Glenn
Machine Learning, Vol. 95, Issue 3
DOI: 10.1007/s10994-013-5412-1

Musical genre classification of audio signals
journal, July 2002

Tzanetakis, G.; Cook, P.
IEEE Transactions on Speech and Audio Processing, Vol. 10, Issue 5
DOI: 10.1109/TSA.2002.800560

Gravity Spy: integrating advanced LIGO detector characterization, machine learning, and citizen science
journal, February 2017

Zevin, M.; Coughlin, S.; Bahaadini, S.
Classical and Quantum Gravity, Vol. 34, Issue 6
DOI: 10.1088/1361-6382/aa5cea

Learning from crowds with variational Gaussian processes
journal, April 2019

Ruiz, Pablo; Morales-Álvarez, Pablo; Molina, Rafael
Pattern Recognition, Vol. 88
DOI: 10.1016/j.patcog.2018.11.021

Learning from crowdsourced labeled data: a survey
journal, July 2016

Zhang, Jing; Wu, Xindong; Sheng, Victor S.
Artificial Intelligence Review, Vol. 46, Issue 4
DOI: 10.1007/s10462-016-9491-9

Variational Inference: A Review for Statisticians
journal, July 2016

Blei, David M.; Kucukelbir, Alp; McAuliffe, Jon D.
Journal of the American Statistical Association, Vol. 112, Issue 518
DOI: 10.1080/01621459.2017.1285773

Remote Sensing Image Classification With Large-Scale Gaussian Processes
journal, February 2018

Morales-Alvarez, Pablo; Perez-Suay, Adrian; Molina, Rafael
IEEE Transactions on Geoscience and Remote Sensing, Vol. 56, Issue 2
DOI: 10.1109/TGRS.2017.2758922

Bayesian Active Remote Sensing Image Classification
journal, April 2014

Ruiz, Pablo; Mateos, Javier; Camps-Valls, Gustavo
IEEE Transactions on Geoscience and Remote Sensing, Vol. 52, Issue 4
DOI: 10.1109/TGRS.2013.2258468

Learning from multiple annotators: Distinguishing good from random labelers
journal, September 2013

Rodrigues, Filipe; Pereira, Francisco; Ribeiro, Bernardete
Pattern Recognition Letters, Vol. 34, Issue 12
DOI: 10.1016/j.patrec.2013.05.012

AggNet: Deep Learning From Crowds for Mitosis Detection in Breast Cancer Histology Images
journal, May 2016

Albarqouni, Shadi; Baur, Christoph; Achilles, Felix
IEEE Transactions on Medical Imaging, Vol. 35, Issue 5
DOI: 10.1109/TMI.2016.2528120

Similar Records in DOE PAGES and OSTI.GOV collections:

Learning from crowds with variational Gaussian processes

Journal Article Ruiz, Pablo ; Morales-Alvarez, Pablo ; Molina, Rafael ; ... - Pattern Recognition

Solving a supervised learning problem requires to label a training set. This task is traditionally performed by an expert, who provides a label for each sample. The proliferation of social web services (e.g., Amazon Mechanical Turk) has introduced an alternative crowdsourcing approach. Anybody with a computer can register in one of these services and label, either partially or completely, a dataset. The effort of labeling is then shared between a great number of annotators. However, this approach introduces scientifically challenging problems such as combining the unknown expertise of the annotators, handling disagreements on the annotated samples, or detecting the existencemore »« less
Cited by 21
https://doi.org/10.1016/j.patcog.2018.11.021

Full Text Available
Improve Learning from Crowds via Generative Augmentation

Conference Chu, Zhendong ; Wang, Hongning - Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining

Crowdsourcing provides an efficient label collection schema for supervised machine learning. However, to control annotation cost, each instance in the crowdsourced data is typically annotated by a small number of annotators. This creates a sparsity issue and limits the quality of machine learning models trained on such data. In this paper, we study how to handle sparsity in crowdsourced data using data augmentation. Specifically, we propose to directly learn a classifier by augmenting the raw sparse annotations. We implement two principles of high-quality augmentation using Generative Adversarial Networks: 1) the generated annotations should follow the distribution of authentic ones, whichmore »« less
https://doi.org/10.1145/3447548.3467409

Full Text Available
Towards Efficient Uncertainty estimation in deep learning for robust energy prediction in crystal materials

Conference Bi, Sirui ; Fung, Victor ; Zhang, Jiaxin ; ...

In material science, recent studies have started to explore the potential of using deep learning to improve property prediction from high-fidelity simulations, e.g, density functional theory (DFT). However, the design spaces are sometimes too large and intractable to sample completely. This results in a critical question that is how to evaluate the confidence and robustness of the prediction. In this paper, we propose an efficient approach to estimate uncertainty in deep learning using a single forward pass and then apply it for robust prediction of the total energy in crystal lattice structures. Our approach is built upon the deep kernelmore »« less
Full Text Available
Scalable Pattern Matching in Metadata Graphs via Constraint Checking

Journal Article Reza, Tahsin ; Halawa, Hassan ; Ripeanu, Matei ; ... - ACM Transactions on Parallel Computing

Pattern matching is a fundamental tool for answering complex graph queries. Unfortunately, existing solutions have limited capabilities: They do not scale to process large graphs and/or support only a restricted set of search templates or usage scenarios. Moreover, the algorithms at the core of the existing techniques are not suitable for today’s graph processing infrastructures relying on horizontal scalability and shared-nothing clusters, as most of these algorithms are inherently sequential and difficult to parallelize. In this article we present an algorithmic pipeline that bases pattern matching on constraint checking. The key intuition is that each vertex and edge participating inmore »« less
https://doi.org/10.1145/3434391

Full Text Available
Learning from Crowds by Modeling Common Confusions

Conference Wang, Hongning ; Chu, Zhendong ; Ma, Jing

Crowdsourcing provides a practical way to obtain large amounts of labeled data at a low cost. However, the annotation quality of annotators varies considerably, which imposes new challenges in learning a high-quality model from the crowdsourced annotations. In this work, we provide a new perspective to decompose annotation noise into common noise and individual noise and differentiate the source of confusion based on instance difficulty and annotator expertise on a per-instance-annotator basis. We realize this new crowdsourcing model by an end-to-end learning solution with two types of noise adaptation layers: one is shared across annotators to capture their commonly sharedmore »« less
Full Text Available

Similar Records

Title: Scalable and efficient learning from crowds with Gaussian processes

Abstract

Citation Formats

Joint Data Filtering and Labeling Using Gaussian Processes and Alternating Direction Method of Multipliers journal, July 2016

Learning Supervised Topic Models for Classification and Regression from Crowds journal, December 2017

Learning from multiple annotators with varying expertise journal, October 2013

Musical genre classification of audio signals journal, July 2002

Gravity Spy: integrating advanced LIGO detector characterization, machine learning, and citizen science journal, February 2017

Learning from crowds with variational Gaussian processes journal, April 2019

Learning from crowdsourced labeled data: a survey journal, July 2016

Variational Inference: A Review for Statisticians journal, July 2016

Remote Sensing Image Classification With Large-Scale Gaussian Processes journal, February 2018

Bayesian Active Remote Sensing Image Classification journal, April 2014

Learning from multiple annotators: Distinguishing good from random labelers journal, September 2013

AggNet: Deep Learning From Crowds for Mitosis Detection in Breast Cancer Histology Images journal, May 2016

Joint Data Filtering and Labeling Using Gaussian Processes and Alternating Direction Method of Multipliers
journal, July 2016

Learning Supervised Topic Models for Classification and Regression from Crowds
journal, December 2017

Learning from multiple annotators with varying expertise
journal, October 2013

Musical genre classification of audio signals
journal, July 2002

Gravity Spy: integrating advanced LIGO detector characterization, machine learning, and citizen science
journal, February 2017

Learning from crowds with variational Gaussian processes
journal, April 2019

Learning from crowdsourced labeled data: a survey
journal, July 2016

Variational Inference: A Review for Statisticians
journal, July 2016

Remote Sensing Image Classification With Large-Scale Gaussian Processes
journal, February 2018

Bayesian Active Remote Sensing Image Classification
journal, April 2014

Learning from multiple annotators: Distinguishing good from random labelers
journal, September 2013

AggNet: Deep Learning From Crowds for Mitosis Detection in Breast Cancer Histology Images
journal, May 2016