DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Distance Metrics and Clustering Methods for Mixed-type Data

Journal Article · · International Statistical Review
DOI: https://doi.org/10.1111/insr.12274 · OSTI ID:1459931

In spite of the abundance of clustering techniques and algorithms, clustering mixed interval (continuous) and categorical (nominal and/or ordinal) scale data remain a challenging problem. In order to identify the most effective approaches for clustering mixed–type data, we use both theoretical and empirical analyses to present a critical review of the strengths and weaknesses of the methods identified in the literature. Here, the guidelines on approaches to use under different scenarios are provided, along with potential directions for future research.

Research Organization:
Sandia National Laboratories (SNL-NM), Albuquerque, NM (United States)
Sponsoring Organization:
USDOE National Nuclear Security Administration (NNSA)
Grant/Contract Number:
AC04-94AL85000
OSTI ID:
1459931
Report Number(s):
SAND--2018-7091J; {665360,"Journal ID: ISSN 0306-7734"}
Journal Information:
International Statistical Review, Journal Name: International Statistical Review Journal Issue: 1 Vol. 87; ISSN 0306-7734
Country of Publication:
United States
Language:
English

References (118)

Resampling approach for cluster model selection journal March 2011
On Information and Sufficiency journal March 1951
Variable selection for model-based clustering using the integrated complete-data likelihood journal May 2016
Shoot flammability of vascular plants is phylogenetically conserved and related to habitat fire-proneness and growth form journal April 2020
Distance between populations using mixed continuous and categorical variables journal January 1983
Linear Fuzzy Clustering of Mixed Databases Based on Cluster-wise Optimal Scaling of Categorical Variables conference June 2007
The clustering of mixed-mode data: a comparison of possible approaches journal January 1990
Multivariate Density Estimation book August 1992
Multivariate Correlation Models with Mixed Discrete and Continuous Variables journal June 1961
Determining the number of clusters using information entropy for mixed data journal June 2012
On the null distribution of distance between two groups, using mixed continuous and categorical variables journal December 1984
Model based clustering for mixed data: clustMD journal February 2016
Estimating the Mahalanobis Distance from Mixed Continuous and Discrete Data journal June 2000
Multivariate Tests for Clusters journal September 1979
Top 10 algorithms in data mining journal December 2007
Classification with discrete and continuous variables via general mixed-data models journal May 2011
Order selection in finite mixture models: complete or observed likelihood information criteria? journal June 2015
Algorithm AS 136: A K-Means Clustering Algorithm journal January 1979
Simultaneous model-based clustering and visualization in the Fisher discriminative subspace journal April 2011
Clustering via nonparametric density estimation journal February 2007
Mean shift: a robust approach toward feature space analysis journal May 2002
The Elements of Statistical Learning book January 2009
The Elements of Statistical Learning book January 2009
An EM algorithm for a mixture model of count data journal May 1993
Parsimonious skew mixture models for model-based clustering and classification journal March 2014
Pattern Clustering by Multivariate Mixture Analysis journal April 1970
Gifi Methods for Optimal Scaling in R : The Package homals journal January 2009
Stability-Based Validation of Clustering Solutions journal June 2004
Model-Based Clustering, Discriminant Analysis, and Density Estimation journal June 2002
Identifiability of parameters in latent structure models with many observed variables journal December 2009
Extending mixtures of factor models using the restricted multivariate skew-normal distribution preprint January 2013
Exploratory latent structure analysis using both identifiable and unidentifiable models journal January 1974
Robust mixture modeling using multivariate skew t distributions journal May 2009
Mixtures of Continuous and Categorical Variables in Discriminant Analysis journal September 1980
What are the true clusters? journal October 2015
Mixtures of Shifted AsymmetricLaplace Distributions journal June 2014
Estimating the Cluster Tree of a Density by Analyzing the Minimal Spanning Tree of a Sample journal May 2003
A k-means type clustering algorithm for subspace clustering of mixed numeric and categorical datasets journal May 2011
An examination of procedures for determining the number of clusters in a data set journal June 1985
Clustering mixed data: Clustering mixed data journal May 2011
Flexible parametric bootstrap for testing homogeneity against clustering and assessing the number of clusters journal June 2015
How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis journal August 1998
Mixtures of skew-t factor analyzers journal September 2014
Advances in multidimensional integration journal December 2002
A Study of the Comparability of External Criteria for Hierarchical Cluster Analysis journal October 1986
A semiparametric method for clustering mixed data journal July 2016
A Gamma mixture model better accounts for among site rate heterogeneity journal September 2005
Comparing clusterings: an axiomatic view conference January 2005
The location model for mixtures of categorical and continuous variables journal January 1993
Exploratory Latent Structure Analysis Using Both Identifiable and Unidentifiable Models journal August 1974
Mixtures of Shifted Asymmetric Laplace Distributions text January 2012
How to find an appropriate clustering for mixed-type variables with application to socio-economic stratification: How to Find an Appropriate Clustering journal April 2013
Contributions to the Mathematical Theory of Evolution journal January 1894
Model selection for probabilistic clustering using cross-validated likelihood journal January 2000
A General Coefficient of Similarity and Some of Its Properties journal December 1971
Comparing partitions journal December 1985
A new look at the statistical model identification journal December 1974
Variable selection for model-based clustering using the integrated complete-data likelihood text January 2015
Friendship stability in adolescence is associated with ventral striatum responses to vicarious rewards journal January 2021
Mixture separation for mixed-mode data journal March 1996
Flexible parametric bootstrap for testing homogeneity against clustering and assessing the number of clusters text January 2015
Finite mixtures of multivariate skew t-distributions: some recent and new results journal October 2012
Estimating the Dimension of a Model journal March 1978
kamila : Clustering Mixed-Type Data in R and Hadoop journal January 2018
Skillful statistical models to predict seasonal wind speed and solar radiation in a Yangtze River estuary case study journal May 2020
Model-Based Clustering book January 2011
A mixture of generalized hyperbolic distributions: A MIXTURE OF GENERALIZED HYPERBOLIC DISTRIBUTIONS journal February 2015
Mixtures of Skew-t Factor Analyzers text January 2013
Model-Based Clustering book January 2020
Maximum Likelihood from Incomplete Data Via the EM Algorithm journal September 1977
Enhancing the selection of a model-based clustering with external categorical variables journal June 2014
A Population Background for Nonparametric Density-Based Clustering journal November 2015
FlexMix Version 2: Finite Mixtures with Concomitant Variables and Varying and Constant Parameters journal January 2008
Generalization of the Mahalanobis Distance in the Mixed Case journal May 1995
Maximum likelihood estimation for multivariate skew normal mixture models journal February 2009
Model-based clustering using copulas with applications journal July 2015
A generalized Mahalanobis distance for mixed data journal January 2005
Finding Groups in Data: An Introduction to Cluster Analysis book March 1990
Data clustering: 50 years beyond K-means journal June 2010
Ellipsoidally symmetric extensions of the general location model for mixed categorical and continuous data journal September 1998
Machine learning-based prediction of glioma margin from 5-ALA induced PpIX fluorescence spectroscopy journal January 2020
Resampling Method for Unsupervised Estimation of Cluster Validity journal November 2001
Model-based clustering, classification, and discriminant analysis of data with mixed type journal November 2012
A Population Background for Nonparametric Density-Based Clustering text January 2014
Comparing clusterings—an information based distance journal May 2007
Clustering Methods Based on Likelihood Ratio Criteria journal June 1971
Clustering via Nonparametric Density Estimation: The R Package pdfCluster journal January 2014
poLCA : An R Package for Polytomous Variable Latent Class Analysis journal January 2011
Quadratic distances on probabilities: A unified foundation text January 2008
Scalable algorithms for clustering large datasets with mixed type attributes journal January 2005
Mixtures of Continuous and Categorical Variables in Discriminant Analysis: A Hypothesis-Testing Approach journal December 1982
The Elements of Statistical Learning book January 2001
Cluster Validation by Prediction Strength journal September 2005
Optimization of probiotic therapeutics using machine learning in an artificial human gastrointestinal tract journal January 2021
Model-based clustering with non-elliptically contoured distributions journal June 2008
Applications of beta-mixture models in bioinformatics journal February 2005
A finite mixture model for the clustering of mixed-mode data journal April 1988
Categorical-and-numerical-attribute data clustering based on a unified similarity metric without knowing cluster number journal August 2013
Challenges of Big Data analysis journal February 2014
A monte carlo study of thirty internal criterion measures for cluster analysis journal June 1981
Assessing a mixture model for clustering with the integrated completed likelihood journal July 2000
A mixture of generalized latent variable models for mixed mode and heterogeneous data journal November 2011
Model-Based Clustering journal October 2016
A robust and scalable clustering algorithm for mixed type attributes in large database environment
  • Chiu, Tom; Fang, DongPing; Chen, John
  • Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '01 https://doi.org/10.1145/502512.502549
conference January 2001
Statistical Modelling of Data on Teaching Styles
  • Aitkin, Murray; Anderson, Dorothy; Hinde, John
  • Journal of the Royal Statistical Society. Series A (General), Vol. 144, Issue 4 https://doi.org/10.2307/2981826
journal January 1981
Simulating Data to Study Performance of Finite Mixture Modeling and Clustering Algorithms journal January 2010
On Using Principal Components Before Separating a Mixture of Two Multivariate Normal Distributions journal January 1983
Quadratic distances on probabilities: A unified foundation journal April 2008
Linear Fuzzy Clustering of Mixed Databases Based on Cluster-wise Optimal Scaling of Categorical Variables conference June 2007
A Sober Look at Clustering Stability book January 2006
klaR Analyzing German Business Cycles book January 2005
Distance functions for categorical and mixed variables journal May 2008
Model-Based Gaussian and Non-Gaussian Clustering journal September 1993
Extending mixtures of factor models using the restricted multivariate skew-normal distribution journal January 2016
A latent variables approach for clustering mixed binary and continuous variables within a Gaussian mixture model journal November 2011
Estimating the number of clusters in a data set via the gap statistic
  • Tibshirani, Robert; Walther, Guenther; Hastie, Trevor
  • Journal of the Royal Statistical Society: Series B (Statistical Methodology), Vol. 63, Issue 2, p. 411-423 https://doi.org/10.1111/1467-9868.00293
journal May 2001
One-Sample Likelihood Ratio Tests for Mixed Data journal January 2007
Model selection for mixture-based clustering for ordinal data journal December 2016

Cited By (1)

Distance‐based clustering of mixed data
  • van de Velden, Michel; Iodice D'Enza, Alfonso; Markos, Angelos
  • Wiley Interdisciplinary Reviews: Computational Statistics, Vol. 11, Issue 3 https://doi.org/10.1002/wics.1456
journal November 2018