Scalable Bayesian Nonparametric Clustering and Classification
Abstract
We develop a scalable multi-step Monte Carlo algorithm for inference under a large class of nonparametric Bayesian models for clustering and classification. Each step is “embarrassingly parallel” and can be implemented using the same Markov chain Monte Carlo sampler. The simplicity and generality of our approach makes inference for a wide range of Bayesian nonparametric mixture models applicable to large datasets. Specifically, we apply the approach to inference under a product partition model with regression on covariates. We show results for inference with two motivating data sets: a large set of electronic health records (EHR) and a bank telemarketing dataset. We find interesting clusters and competitive classification performance relative to other widely used competing classifiers. Supplementary materials for this article are available online.
- Authors:
-
- Texas A & M Univ., College Station, TX (United States). Dept. of Statistics; Univ. of Texas, Austin, TX (United States). Dept. of Statistics and Data Sciences
- Univ. of Texas, Austin, TX (United States). Dept. of Mathematics
- Univ. of Texas, Austin, TX (United States). Dept. of Statistics and Data Sciences
- Univ. of Texas, Austin, TX (United States). Dept. of Information, Risk, and Operations Management
- NorthShore Univ. HealthSystem, Evanston, IL (United States). Program for Computational Genomics and Medicine
- Univ. of Chicago, IL (United States). Dept. of Public Health Sciences
- Publication Date:
- Research Org.:
- Argonne National Lab. (ANL), Argonne, IL (United States)
- Sponsoring Org.:
- National Institutes of Health (NIH)
- OSTI Identifier:
- 1566123
- Grant/Contract Number:
- AC02-06CH11357
- Resource Type:
- Accepted Manuscript
- Journal Name:
- Journal of Computational and Graphical Statistics
- Additional Journal Information:
- Journal Volume: none; Journal Issue: none; Journal ID: ISSN 1061-8600
- Publisher:
- Taylor & Francis
- Country of Publication:
- United States
- Language:
- English
- Subject:
- 97 MATHEMATICS AND COMPUTING; Electronic health records; non-conjugate models; parallel computing; product partition models
Citation Formats
Ni, Yang, Müller, Peter, Diesendruck, Maurice, Williamson, Sinead, Zhu, Yitan, and Ji, Yuan. Scalable Bayesian Nonparametric Clustering and Classification. United States: N. p., 2019.
Web. doi:10.1080/10618600.2019.1624366.
Ni, Yang, Müller, Peter, Diesendruck, Maurice, Williamson, Sinead, Zhu, Yitan, & Ji, Yuan. Scalable Bayesian Nonparametric Clustering and Classification. United States. doi:https://doi.org/10.1080/10618600.2019.1624366
Ni, Yang, Müller, Peter, Diesendruck, Maurice, Williamson, Sinead, Zhu, Yitan, and Ji, Yuan. Fri .
"Scalable Bayesian Nonparametric Clustering and Classification". United States. doi:https://doi.org/10.1080/10618600.2019.1624366. https://www.osti.gov/servlets/purl/1566123.
@article{osti_1566123,
title = {Scalable Bayesian Nonparametric Clustering and Classification},
author = {Ni, Yang and Müller, Peter and Diesendruck, Maurice and Williamson, Sinead and Zhu, Yitan and Ji, Yuan},
abstractNote = {We develop a scalable multi-step Monte Carlo algorithm for inference under a large class of nonparametric Bayesian models for clustering and classification. Each step is “embarrassingly parallel” and can be implemented using the same Markov chain Monte Carlo sampler. The simplicity and generality of our approach makes inference for a wide range of Bayesian nonparametric mixture models applicable to large datasets. Specifically, we apply the approach to inference under a product partition model with regression on covariates. We show results for inference with two motivating data sets: a large set of electronic health records (EHR) and a bank telemarketing dataset. We find interesting clusters and competitive classification performance relative to other widely used competing classifiers. Supplementary materials for this article are available online.},
doi = {10.1080/10618600.2019.1624366},
journal = {Journal of Computational and Graphical Statistics},
number = none,
volume = none,
place = {United States},
year = {2019},
month = {7}
}
Web of Science
Works referenced in this record:
MapReduce: simplified data processing on large clusters
journal, January 2008
- Dean, Jeffrey; Ghemawat, Sanjay; Mehta, Brijesh
- Communications of the ACM, Vol. 51, Issue 1
Comparing clusterings—an information based distance
journal, May 2007
- Meilă, Marina
- Journal of Multivariate Analysis, Vol. 98, Issue 5
Bayes and big data: the consensus Monte Carlo algorithm
journal, February 2016
- Scott, Steven L.; Blocker, Alexander W.; Bonassi, Fernando V.
- International Journal of Management Science and Engineering Management, Vol. 11, Issue 2
Two-Stage Metropolis-Hastings for Tall Data
journal, March 2018
- Payne, Richard D.; Mallick, Bani K.
- Journal of Classification, Vol. 35, Issue 1
Bayesian density estimation and model selection using nonparametric hierarchical mixtures
journal, April 2010
- Argiento, Raffaele; Guglielmi, Alessandra; Pievatolo, Antonio
- Computational Statistics & Data Analysis, Vol. 54, Issue 4
Multivariate mixtures of normals with unknown number of components
journal, January 2006
- Dellaportas, Petros; Papageorgiou, Ioulia
- Statistics and Computing, Vol. 16, Issue 1
Speeding Up MCMC by Efficient Data Subsampling
journal, July 2018
- Quiroz, Matias; Kohn, Robert; Villani, Mattias
- Journal of the American Statistical Association, Vol. 114, Issue 526
Partition models
journal, January 1990
- Hartigan, J. A.
- Communications in Statistics - Theory and Methods, Vol. 19, Issue 8
Data clustering: 50 years beyond K-means
journal, June 2010
- Jain, Anil K.
- Pattern Recognition Letters, Vol. 31, Issue 8
A data-driven approach to predict the success of bank telemarketing
journal, June 2014
- Moro, Sérgio; Cortez, Paulo; Rita, Paulo
- Decision Support Systems, Vol. 62
Markov Chain Sampling Methods for Dirichlet Process Mixture Models
journal, June 2000
- Neal, Radford M.
- Journal of Computational and Graphical Statistics, Vol. 9, Issue 2
BART: Bayesian additive regression trees
journal, March 2010
- Chipman, Hugh A.; George, Edward I.; McCulloch, Robert E.
- The Annals of Applied Statistics, Vol. 4, Issue 1
MCMC for Normalized Random Measure Mixture Models
journal, August 2013
- Favaro, Stefano; Teh, Yee Whye
- Statistical Science, Vol. 28, Issue 3
Hierarchical Mixture Modeling With Normalized Inverse-Gaussian Priors
journal, December 2005
- Lijoi, Antonio; Mena, Ramsés H.; Prünster, Igor
- Journal of the American Statistical Association, Vol. 100, Issue 472
Bayesian nonparametric classification for spectroscopy data
journal, October 2014
- Gutiérrez, Luis; Gutiérrez-Peña, Eduardo; Mena, Ramsés H.
- Computational Statistics & Data Analysis, Vol. 78
Modeling with Normalized Random Measure Mixture Models
journal, August 2013
- Barrios, Ernesto; Lijoi, Antonio; Nieto-Barajas, Luis E.
- Statistical Science, Vol. 28, Issue 3
Fitting semiparametric random effects models to large data sets
journal, December 2006
- Pennell, M. L.; Dunson, D. B.
- Biostatistics, Vol. 8, Issue 4
Controlling the reinforcement in Bayesian non-parametric mixture models
journal, September 2007
- Lijoi, Antonio; Mena, Ramsés H.; Prünster, Igor
- Journal of the Royal Statistical Society: Series B (Statistical Methodology), Vol. 69, Issue 4
Optimal Bayesian estimators for latent variable cluster models
journal, October 2017
- Rastelli, Riccardo; Friel, Nial
- Statistics and Computing, Vol. 28, Issue 6
Are Gibbs-Type Priors the Most Natural Generalization of the Dirichlet Process?
journal, February 2015
- De Blasi, Pierpaolo; Favaro, Stefano; Lijoi, Antonio
- IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 37, Issue 2
On a Class of Bayesian Nonparametric Estimates: I. Density Estimates
journal, March 1984
- Lo, Albert Y.
- The Annals of Statistics, Vol. 12, Issue 1
Heterogeneous reciprocal graphical models: hRGM
journal, October 2017
- Ni, Yang; Müller, Peter; Zhu, Yitan
- Biometrics, Vol. 74, Issue 2
Bayesian nonparametric clustering for large data sets
journal, February 2018
- Zuanetti, Daiane Aparecida; Müller, Peter; Zhu, Yitan
- Statistics and Computing, Vol. 29, Issue 2
Sparse covariance estimation in heterogeneous samples
journal, January 2011
- Rodríguez, Abel; Lenkoski, Alex; Dobra, Adrian
- Electronic Journal of Statistics, Vol. 5, Issue 0
Estimating Normal Means with a Dirichlet Process Prior
journal, March 1994
- Escobar, Michael D.
- Journal of the American Statistical Association, Vol. 89, Issue 425
The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator
journal, April 1997
- Pitman, Jim; Yor, Marc
- The Annals of Probability, Vol. 25, Issue 2
Defining Predictive Probability Functions for Species Sampling Models
journal, May 2013
- Lee, Jaeyong; Quintana, Fernando A.; Müller, Peter
- Statistical Science, Vol. 28, Issue 2
On Bayesian Analysis of Mixtures with an Unknown Number of Components (with discussion)
journal, November 1997
- Richardson, Sylvia.; Green, Peter J.
- Journal of the Royal Statistical Society: Series B (Statistical Methodology), Vol. 59, Issue 4
Fast Bayesian Inference in Dirichlet Process Mixture Models
journal, January 2011
- Wang, Lianming; Dunson, David B.
- Journal of Computational and Graphical Statistics, Vol. 20, Issue 1
Bayesian Cluster Analysis: Point Estimation and Credible Balls (with Discussion)
journal, June 2018
- Wade, Sara; Ghahramani, Zoubin
- Bayesian Analysis, Vol. 13, Issue 2
A Survey of Clustering Algorithms for Big Data: Taxonomy and Empirical Analysis
journal, September 2014
- Fahad, Adil; Alshatri, Najlaa; Tari, Zahir
- IEEE Transactions on Emerging Topics in Computing, Vol. 2, Issue 3
Estimating Mixture of Dirichlet Process Models
journal, June 1998
- MacEachern, Steven N.; Müller, Peter; Muller, Peter
- Journal of Computational and Graphical Statistics, Vol. 7, Issue 2
Quantum Support Vector Machine for Big Data Classification
journal, September 2014
- Rebentrost, Patrick; Mohseni, Masoud; Lloyd, Seth
- Physical Review Letters, Vol. 113, Issue 13
Support-vector networks
journal, September 1995
- Cortes, Corinna; Vapnik, Vladimir
- Machine Learning, Vol. 20, Issue 3
Bayesian Model-Based Clustering Procedures
journal, September 2007
- Lau, John W.; Green, Peter J.
- Journal of Computational and Graphical Statistics, Vol. 16, Issue 3
Piecewise Approximate Bayesian Computation: fast inference for discretely observed Markov models using a factorised posterior distribution
journal, November 2013
- White, S. R.; Kypraios, T.; Preston, S. P.
- Statistics and Computing, Vol. 25, Issue 2
Algorithm AS 136: A K-Means Clustering Algorithm
journal, January 1979
- Hartigan, J. A.; Wong, M. A.
- Applied Statistics, Vol. 28, Issue 1
Regularization Paths for Generalized Linear Models via Coordinate Descent
journal, January 2010
- Friedman, Jerome; Hastie, Trevor; Tibshirani, Robert
- Journal of Statistical Software, Vol. 33, Issue 1
A Product Partition Model With Regression on Covariates
journal, January 2011
- Müller, Peter; Quintana, Fernando; Rosner, Gary L.
- Journal of Computational and Graphical Statistics, Vol. 20, Issue 1
Semiparametric Bayesian classification with longitudinal markers
journal, March 2007
- Cruz-Mesía, Rolando De la; Quintana, Fernando A.; Müller, Peter
- Journal of the Royal Statistical Society: Series C (Applied Statistics), Vol. 56, Issue 2
A “Density-Based” Algorithm for Cluster Analysis Using Species Sampling Gaussian Mixture Models
journal, October 2014
- Argiento, Raffaele; Cremaschi, Andrea; Guglielmi, Alessandra
- Journal of Computational and Graphical Statistics, Vol. 23, Issue 4
The Representation of Partition Structures
journal, October 1978
- Kingman, J. F. C.
- Journal of the London Mathematical Society, Vol. s2-18, Issue 2
Sampling the Dirichlet Mixture Model with Slices
journal, January 2007
- Walker, Stephen G.
- Communications in Statistics - Simulation and Computation, Vol. 36, Issue 1
Identifying Mixtures of Mixtures Using Bayesian Estimation
journal, October 2016
- Malsiner-Walli, Gertraud; Frühwirth-Schnatter, Sylvia; Grün, Bettina
- Journal of Computational and Graphical Statistics, Vol. 26, Issue 2
A scalable bootstrap for massive data
journal, March 2014
- Kleiner, Ariel; Talwalkar, Ameet; Sarkar, Purnamrita
- Journal of the Royal Statistical Society: Series B (Statistical Methodology), Vol. 76, Issue 4
Works referencing / citing this record:
Bayesian Double Feature Allocation for Phenotyping With Electronic Health Records
journal, December 2019
- Ni, Yang; Müller, Peter; Ji, Yuan
- Journal of the American Statistical Association