DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Scalable and efficient learning from crowds with Gaussian processes

Abstract

Over the last few years, multiply-annotated data has become a very popular source of information. Online platforms such as Amazon Mechanical Turk have revolutionized the labelling process needed for any classification task, sharing the effort between a number of annotators (instead of the classical single expert). This crowdsourcing approach has introduced new challenging problems, such as handling disagreements on the annotated samples or combining the unknown expertise of the annotators. Probabilistic methods, such as Gaussian Processes (GP), have proven successful to model this new crowdsourcing scenario. However, GPs do not scale up well with the training set size, which makes them prohibitive for medium-to-large datasets (beyond 10K training instances). This constitutes a serious limitation for current real-world applications. In this work, we introduce two scalable and efficient GP-based crowdsourcing methods that allow for processing previously-prohibitive datasets. The first one is an efficient and fast approximation to GP with squared exponential (SE) kernel. The second allows for learning a more flexible kernel at the expense of a heavier training (but still scalable to large datasets). Since the latter is not a GP-SE approximation, it can be also considered as a whole new scalable and efficient crowdsourcing method, useful for any datasetmore » size. Both methods use Fourier features and variational inference, can predict the class of new samples, and estimate the expertise of the involved annotators. A complete experimentation compares them with state-of-the-art probabilistic approaches in synthetic and real crowdsourcing datasets of different sizes. Finally, they stand out as the best performing approach for large scale problems. Moreover, the second method is competitive with the current state-of-the-art for small datasets.« less

Authors:
 [1]; ORCiD logo [2];  [3]; ORCiD logo [1];  [2]
  1. Univ. of Granada (Spain)
  2. Northwestern Univ., Evanston, IL (United States)
  3. University of Bristol (United Kingdom)
Publication Date:
Research Org.:
Northwestern Univ., Evanston, IL (United States)
Sponsoring Org.:
USDOE National Nuclear Security Administration (NNSA)
OSTI Identifier:
1801110
Alternate Identifier(s):
OSTI ID: 1547904
Grant/Contract Number:  
NA0002520
Resource Type:
Accepted Manuscript
Journal Name:
Information Fusion
Additional Journal Information:
Journal Volume: 52; Journal Issue: C; Journal ID: ISSN 1566-2535
Publisher:
Elsevier
Country of Publication:
United States
Language:
English
Subject:
96 KNOWLEDGE MANAGEMENT AND PRESERVATION; computer science; scalable crowdsourcing; classification; Gaussian processes; Fourier features; Bayesian modeling; variational inference

Citation Formats

Morales-Álvarez, Pablo, Ruiz, Pablo, Santos-Rodríguez, Raúl, Molina, Rafael, and Katsaggelos, Aggelos K. Scalable and efficient learning from crowds with Gaussian processes. United States: N. p., 2019. Web. doi:10.1016/j.inffus.2018.12.008.
Morales-Álvarez, Pablo, Ruiz, Pablo, Santos-Rodríguez, Raúl, Molina, Rafael, & Katsaggelos, Aggelos K. Scalable and efficient learning from crowds with Gaussian processes. United States. https://doi.org/10.1016/j.inffus.2018.12.008
Morales-Álvarez, Pablo, Ruiz, Pablo, Santos-Rodríguez, Raúl, Molina, Rafael, and Katsaggelos, Aggelos K. Wed . "Scalable and efficient learning from crowds with Gaussian processes". United States. https://doi.org/10.1016/j.inffus.2018.12.008. https://www.osti.gov/servlets/purl/1801110.
@article{osti_1801110,
title = {Scalable and efficient learning from crowds with Gaussian processes},
author = {Morales-Álvarez, Pablo and Ruiz, Pablo and Santos-Rodríguez, Raúl and Molina, Rafael and Katsaggelos, Aggelos K.},
abstractNote = {Over the last few years, multiply-annotated data has become a very popular source of information. Online platforms such as Amazon Mechanical Turk have revolutionized the labelling process needed for any classification task, sharing the effort between a number of annotators (instead of the classical single expert). This crowdsourcing approach has introduced new challenging problems, such as handling disagreements on the annotated samples or combining the unknown expertise of the annotators. Probabilistic methods, such as Gaussian Processes (GP), have proven successful to model this new crowdsourcing scenario. However, GPs do not scale up well with the training set size, which makes them prohibitive for medium-to-large datasets (beyond 10K training instances). This constitutes a serious limitation for current real-world applications. In this work, we introduce two scalable and efficient GP-based crowdsourcing methods that allow for processing previously-prohibitive datasets. The first one is an efficient and fast approximation to GP with squared exponential (SE) kernel. The second allows for learning a more flexible kernel at the expense of a heavier training (but still scalable to large datasets). Since the latter is not a GP-SE approximation, it can be also considered as a whole new scalable and efficient crowdsourcing method, useful for any dataset size. Both methods use Fourier features and variational inference, can predict the class of new samples, and estimate the expertise of the involved annotators. A complete experimentation compares them with state-of-the-art probabilistic approaches in synthetic and real crowdsourcing datasets of different sizes. Finally, they stand out as the best performing approach for large scale problems. Moreover, the second method is competitive with the current state-of-the-art for small datasets.},
doi = {10.1016/j.inffus.2018.12.008},
journal = {Information Fusion},
number = C,
volume = 52,
place = {United States},
year = {Wed Jan 02 00:00:00 EST 2019},
month = {Wed Jan 02 00:00:00 EST 2019}
}

Journal Article:

Citation Metrics:
Cited by: 11 works
Citation information provided by
Web of Science

Save / Share:

Works referenced in this record:

Joint Data Filtering and Labeling Using Gaussian Processes and Alternating Direction Method of Multipliers
journal, July 2016

  • Ruiz, Pablo; Molina, Rafael; Katsaggelos, Aggelos K.
  • IEEE Transactions on Image Processing, Vol. 25, Issue 7
  • DOI: 10.1109/TIP.2016.2558472

Learning Supervised Topic Models for Classification and Regression from Crowds
journal, December 2017

  • Rodrigues, Filipe; Lourenco, Mariana; Ribeiro, Bernardete
  • IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 39, Issue 12
  • DOI: 10.1109/TPAMI.2017.2648786

Learning from multiple annotators with varying expertise
journal, October 2013


Musical genre classification of audio signals
journal, July 2002

  • Tzanetakis, G.; Cook, P.
  • IEEE Transactions on Speech and Audio Processing, Vol. 10, Issue 5
  • DOI: 10.1109/TSA.2002.800560

Gravity Spy: integrating advanced LIGO detector characterization, machine learning, and citizen science
journal, February 2017


Learning from crowds with variational Gaussian processes
journal, April 2019


Learning from crowdsourced labeled data: a survey
journal, July 2016


Variational Inference: A Review for Statisticians
journal, July 2016

  • Blei, David M.; Kucukelbir, Alp; McAuliffe, Jon D.
  • Journal of the American Statistical Association, Vol. 112, Issue 518
  • DOI: 10.1080/01621459.2017.1285773

Remote Sensing Image Classification With Large-Scale Gaussian Processes
journal, February 2018

  • Morales-Alvarez, Pablo; Perez-Suay, Adrian; Molina, Rafael
  • IEEE Transactions on Geoscience and Remote Sensing, Vol. 56, Issue 2
  • DOI: 10.1109/TGRS.2017.2758922

Bayesian Active Remote Sensing Image Classification
journal, April 2014

  • Ruiz, Pablo; Mateos, Javier; Camps-Valls, Gustavo
  • IEEE Transactions on Geoscience and Remote Sensing, Vol. 52, Issue 4
  • DOI: 10.1109/TGRS.2013.2258468

Learning from multiple annotators: Distinguishing good from random labelers
journal, September 2013

  • Rodrigues, Filipe; Pereira, Francisco; Ribeiro, Bernardete
  • Pattern Recognition Letters, Vol. 34, Issue 12
  • DOI: 10.1016/j.patrec.2013.05.012

AggNet: Deep Learning From Crowds for Mitosis Detection in Breast Cancer Histology Images
journal, May 2016

  • Albarqouni, Shadi; Baur, Christoph; Achilles, Felix
  • IEEE Transactions on Medical Imaging, Vol. 35, Issue 5
  • DOI: 10.1109/TMI.2016.2528120