skip to main content
DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

This content will become publicly available on July 19, 2020

Title: Scalable Bayesian Nonparametric Clustering and Classification

Abstract

We develop a scalable multi-step Monte Carlo algorithm for inference under a large class of nonparametric Bayesian models for clustering and classification. Each step is “embarrassingly parallel” and can be implemented using the same Markov chain Monte Carlo sampler. The simplicity and generality of our approach makes inference for a wide range of Bayesian nonparametric mixture models applicable to large datasets. Specifically, we apply the approach to inference under a product partition model with regression on covariates. We show results for inference with two motivating data sets: a large set of electronic health records (EHR) and a bank telemarketing dataset. We find interesting clusters and competitive classification performance relative to other widely used competing classifiers. Supplementary materials for this article are available online.

Authors:
 [1];  [2];  [3];  [4];  [5];  [6]
  1. Texas A & M Univ., College Station, TX (United States). Dept. of Statistics; Univ. of Texas, Austin, TX (United States). Dept. of Statistics and Data Sciences
  2. Univ. of Texas, Austin, TX (United States). Dept. of Mathematics
  3. Univ. of Texas, Austin, TX (United States). Dept. of Statistics and Data Sciences
  4. Univ. of Texas, Austin, TX (United States). Dept. of Information, Risk, and Operations Management
  5. NorthShore Univ. HealthSystem, Evanston, IL (United States). Program for Computational Genomics and Medicine
  6. Univ. of Chicago, IL (United States). Dept. of Public Health Sciences
Publication Date:
Research Org.:
Argonne National Lab. (ANL), Argonne, IL (United States)
Sponsoring Org.:
National Institutes of Health (NIH)
OSTI Identifier:
1566123
Grant/Contract Number:  
AC02-06CH11357
Resource Type:
Accepted Manuscript
Journal Name:
Journal of Computational and Graphical Statistics
Additional Journal Information:
Journal Volume: none; Journal Issue: none; Journal ID: ISSN 1061-8600
Publisher:
Taylor & Francis
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING; Electronic health records; non-conjugate models; parallel computing; product partition models

Citation Formats

Ni, Yang, Müller, Peter, Diesendruck, Maurice, Williamson, Sinead, Zhu, Yitan, and Ji, Yuan. Scalable Bayesian Nonparametric Clustering and Classification. United States: N. p., 2019. Web. doi:10.1080/10618600.2019.1624366.
Ni, Yang, Müller, Peter, Diesendruck, Maurice, Williamson, Sinead, Zhu, Yitan, & Ji, Yuan. Scalable Bayesian Nonparametric Clustering and Classification. United States. doi:10.1080/10618600.2019.1624366.
Ni, Yang, Müller, Peter, Diesendruck, Maurice, Williamson, Sinead, Zhu, Yitan, and Ji, Yuan. Fri . "Scalable Bayesian Nonparametric Clustering and Classification". United States. doi:10.1080/10618600.2019.1624366.
@article{osti_1566123,
title = {Scalable Bayesian Nonparametric Clustering and Classification},
author = {Ni, Yang and Müller, Peter and Diesendruck, Maurice and Williamson, Sinead and Zhu, Yitan and Ji, Yuan},
abstractNote = {We develop a scalable multi-step Monte Carlo algorithm for inference under a large class of nonparametric Bayesian models for clustering and classification. Each step is “embarrassingly parallel” and can be implemented using the same Markov chain Monte Carlo sampler. The simplicity and generality of our approach makes inference for a wide range of Bayesian nonparametric mixture models applicable to large datasets. Specifically, we apply the approach to inference under a product partition model with regression on covariates. We show results for inference with two motivating data sets: a large set of electronic health records (EHR) and a bank telemarketing dataset. We find interesting clusters and competitive classification performance relative to other widely used competing classifiers. Supplementary materials for this article are available online.},
doi = {10.1080/10618600.2019.1624366},
journal = {Journal of Computational and Graphical Statistics},
number = none,
volume = none,
place = {United States},
year = {2019},
month = {7}
}

Journal Article:
Free Publicly Available Full Text
This content will become publicly available on July 19, 2020
Publisher's Version of Record

Citation Metrics:
Cited by: 1 work
Citation information provided by
Web of Science

Save / Share:

Works referenced in this record:

MapReduce: simplified data processing on large clusters
journal, January 2008

  • Dean, Jeffrey; Ghemawat, Sanjay; Mehta, Brijesh
  • Communications of the ACM, Vol. 51, Issue 1
  • DOI: 10.1145/1327452.1327492

Comparing clusterings—an information based distance
journal, May 2007


Bayes and big data: the consensus Monte Carlo algorithm
journal, February 2016

  • Scott, Steven L.; Blocker, Alexander W.; Bonassi, Fernando V.
  • International Journal of Management Science and Engineering Management, Vol. 11, Issue 2
  • DOI: 10.1080/17509653.2016.1142191

Two-Stage Metropolis-Hastings for Tall Data
journal, March 2018


Bayesian density estimation and model selection using nonparametric hierarchical mixtures
journal, April 2010

  • Argiento, Raffaele; Guglielmi, Alessandra; Pievatolo, Antonio
  • Computational Statistics & Data Analysis, Vol. 54, Issue 4
  • DOI: 10.1016/j.csda.2009.11.002

Multivariate mixtures of normals with unknown number of components
journal, January 2006


Speeding Up MCMC by Efficient Data Subsampling
journal, July 2018

  • Quiroz, Matias; Kohn, Robert; Villani, Mattias
  • Journal of the American Statistical Association, Vol. 114, Issue 526
  • DOI: 10.1080/01621459.2018.1448827

Partition models
journal, January 1990


Data clustering: 50 years beyond K-means
journal, June 2010


A data-driven approach to predict the success of bank telemarketing
journal, June 2014


Markov Chain Sampling Methods for Dirichlet Process Mixture Models
journal, June 2000


BART: Bayesian additive regression trees
journal, March 2010

  • Chipman, Hugh A.; George, Edward I.; McCulloch, Robert E.
  • The Annals of Applied Statistics, Vol. 4, Issue 1
  • DOI: 10.1214/09-AOAS285

MCMC for Normalized Random Measure Mixture Models
journal, August 2013

  • Favaro, Stefano; Teh, Yee Whye
  • Statistical Science, Vol. 28, Issue 3
  • DOI: 10.1214/13-STS422

Hierarchical Mixture Modeling With Normalized Inverse-Gaussian Priors
journal, December 2005

  • Lijoi, Antonio; Mena, Ramsés H.; Prünster, Igor
  • Journal of the American Statistical Association, Vol. 100, Issue 472
  • DOI: 10.1198/016214505000000132

Bayesian nonparametric classification for spectroscopy data
journal, October 2014

  • Gutiérrez, Luis; Gutiérrez-Peña, Eduardo; Mena, Ramsés H.
  • Computational Statistics & Data Analysis, Vol. 78
  • DOI: 10.1016/j.csda.2014.04.010

Modeling with Normalized Random Measure Mixture Models
journal, August 2013

  • Barrios, Ernesto; Lijoi, Antonio; Nieto-Barajas, Luis E.
  • Statistical Science, Vol. 28, Issue 3
  • DOI: 10.1214/13-STS416

Fitting semiparametric random effects models to large data sets
journal, December 2006


Controlling the reinforcement in Bayesian non-parametric mixture models
journal, September 2007

  • Lijoi, Antonio; Mena, Ramsés H.; Prünster, Igor
  • Journal of the Royal Statistical Society: Series B (Statistical Methodology), Vol. 69, Issue 4
  • DOI: 10.1111/j.1467-9868.2007.00609.x

Optimal Bayesian estimators for latent variable cluster models
journal, October 2017


Are Gibbs-Type Priors the Most Natural Generalization of the Dirichlet Process?
journal, February 2015

  • De Blasi, Pierpaolo; Favaro, Stefano; Lijoi, Antonio
  • IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 37, Issue 2
  • DOI: 10.1109/TPAMI.2013.217

On a Class of Bayesian Nonparametric Estimates: I. Density Estimates
journal, March 1984


Heterogeneous reciprocal graphical models: hRGM
journal, October 2017

  • Ni, Yang; Müller, Peter; Zhu, Yitan
  • Biometrics, Vol. 74, Issue 2
  • DOI: 10.1111/biom.12791

Bayesian nonparametric clustering for large data sets
journal, February 2018

  • Zuanetti, Daiane Aparecida; Müller, Peter; Zhu, Yitan
  • Statistics and Computing, Vol. 29, Issue 2
  • DOI: 10.1007/s11222-018-9803-9

Sparse covariance estimation in heterogeneous samples
journal, January 2011

  • Rodríguez, Abel; Lenkoski, Alex; Dobra, Adrian
  • Electronic Journal of Statistics, Vol. 5, Issue 0
  • DOI: 10.1214/11-EJS634

Estimating Normal Means with a Dirichlet Process Prior
journal, March 1994


The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator
journal, April 1997


Defining Predictive Probability Functions for Species Sampling Models
journal, May 2013

  • Lee, Jaeyong; Quintana, Fernando A.; Müller, Peter
  • Statistical Science, Vol. 28, Issue 2
  • DOI: 10.1214/12-STS407

On Bayesian Analysis of Mixtures with an Unknown Number of Components (with discussion)
journal, November 1997

  • Richardson, Sylvia.; Green, Peter J.
  • Journal of the Royal Statistical Society: Series B (Statistical Methodology), Vol. 59, Issue 4
  • DOI: 10.1111/1467-9868.00095

Fast Bayesian Inference in Dirichlet Process Mixture Models
journal, January 2011

  • Wang, Lianming; Dunson, David B.
  • Journal of Computational and Graphical Statistics, Vol. 20, Issue 1
  • DOI: 10.1198/jcgs.2010.07081

Bayesian Cluster Analysis: Point Estimation and Credible Balls (with Discussion)
journal, June 2018

  • Wade, Sara; Ghahramani, Zoubin
  • Bayesian Analysis, Vol. 13, Issue 2
  • DOI: 10.1214/17-BA1073

Estimating Mixture of Dirichlet Process Models
journal, June 1998

  • MacEachern, Steven N.; Müller, Peter; Muller, Peter
  • Journal of Computational and Graphical Statistics, Vol. 7, Issue 2
  • DOI: 10.2307/1390815

Quantum Support Vector Machine for Big Data Classification
journal, September 2014


Support-vector networks
journal, September 1995

  • Cortes, Corinna; Vapnik, Vladimir
  • Machine Learning, Vol. 20, Issue 3
  • DOI: 10.1007/BF00994018

Bayesian Model-Based Clustering Procedures
journal, September 2007

  • Lau, John W.; Green, Peter J.
  • Journal of Computational and Graphical Statistics, Vol. 16, Issue 3
  • DOI: 10.1198/106186007X238855

Piecewise Approximate Bayesian Computation: fast inference for discretely observed Markov models using a factorised posterior distribution
journal, November 2013


Algorithm AS 136: A K-Means Clustering Algorithm
journal, January 1979

  • Hartigan, J. A.; Wong, M. A.
  • Applied Statistics, Vol. 28, Issue 1
  • DOI: 10.2307/2346830

Regularization Paths for Generalized Linear Models via Coordinate Descent
journal, January 2010

  • Friedman, Jerome; Hastie, Trevor; Tibshirani, Robert
  • Journal of Statistical Software, Vol. 33, Issue 1
  • DOI: 10.18637/jss.v033.i01

A Product Partition Model With Regression on Covariates
journal, January 2011

  • Müller, Peter; Quintana, Fernando; Rosner, Gary L.
  • Journal of Computational and Graphical Statistics, Vol. 20, Issue 1
  • DOI: 10.1198/jcgs.2011.09066

Semiparametric Bayesian classification with longitudinal markers
journal, March 2007

  • Cruz-Mesía, Rolando De la; Quintana, Fernando A.; Müller, Peter
  • Journal of the Royal Statistical Society: Series C (Applied Statistics), Vol. 56, Issue 2
  • DOI: 10.1111/j.1467-9876.2007.00569.x

A “Density-Based” Algorithm for Cluster Analysis Using Species Sampling Gaussian Mixture Models
journal, October 2014

  • Argiento, Raffaele; Cremaschi, Andrea; Guglielmi, Alessandra
  • Journal of Computational and Graphical Statistics, Vol. 23, Issue 4
  • DOI: 10.1080/10618600.2013.856796

The Representation of Partition Structures
journal, October 1978


Sampling the Dirichlet Mixture Model with Slices
journal, January 2007

  • Walker, Stephen G.
  • Communications in Statistics - Simulation and Computation, Vol. 36, Issue 1
  • DOI: 10.1080/03610910601096262

Identifying Mixtures of Mixtures Using Bayesian Estimation
journal, October 2016

  • Malsiner-Walli, Gertraud; Frühwirth-Schnatter, Sylvia; Grün, Bettina
  • Journal of Computational and Graphical Statistics, Vol. 26, Issue 2
  • DOI: 10.1080/10618600.2016.1200472

A scalable bootstrap for massive data
journal, March 2014

  • Kleiner, Ariel; Talwalkar, Ameet; Sarkar, Purnamrita
  • Journal of the Royal Statistical Society: Series B (Statistical Methodology), Vol. 76, Issue 4
  • DOI: 10.1111/rssb.12050