DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Distance Metrics and Clustering Methods for Mixed-type Data

Abstract

In spite of the abundance of clustering techniques and algorithms, clustering mixed interval (continuous) and categorical (nominal and/or ordinal) scale data remain a challenging problem. In order to identify the most effective approaches for clustering mixed–type data, we use both theoretical and empirical analyses to present a critical review of the strengths and weaknesses of the methods identified in the literature. Here, the guidelines on approaches to use under different scenarios are provided, along with potential directions for future research.

Authors:
ORCiD logo [1];  [1];  [2]
  1. Univ. at Buffalo, Buffalo, NY (United States)
  2. Arenadotio, New York, NY (United States)
Publication Date:
Research Org.:
Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
Sponsoring Org.:
USDOE National Nuclear Security Administration (NNSA)
OSTI Identifier:
1459931
Report Number(s):
SAND-2018-7091J
Journal ID: ISSN 0306-7734; 665360
Grant/Contract Number:  
AC04-94AL85000
Resource Type:
Accepted Manuscript
Journal Name:
International Statistical Review
Additional Journal Information:
Journal Volume: 87; Journal Issue: 1; Journal ID: ISSN 0306-7734
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING; discretisation; dummy coding; Gower's distance; k-mean clustering; machine learning; Mahalanobis distance; mixture model; multivariate data analysis; unsupervised learning

Citation Formats

Foss, Alexander H., Markatou, Marianthi, and Ray, Bonnie. Distance Metrics and Clustering Methods for Mixed-type Data. United States: N. p., 2018. Web. doi:10.1111/insr.12274.
Foss, Alexander H., Markatou, Marianthi, & Ray, Bonnie. Distance Metrics and Clustering Methods for Mixed-type Data. United States. https://doi.org/10.1111/insr.12274
Foss, Alexander H., Markatou, Marianthi, and Ray, Bonnie. Thu . "Distance Metrics and Clustering Methods for Mixed-type Data". United States. https://doi.org/10.1111/insr.12274. https://www.osti.gov/servlets/purl/1459931.
@article{osti_1459931,
title = {Distance Metrics and Clustering Methods for Mixed-type Data},
author = {Foss, Alexander H. and Markatou, Marianthi and Ray, Bonnie},
abstractNote = {In spite of the abundance of clustering techniques and algorithms, clustering mixed interval (continuous) and categorical (nominal and/or ordinal) scale data remain a challenging problem. In order to identify the most effective approaches for clustering mixed–type data, we use both theoretical and empirical analyses to present a critical review of the strengths and weaknesses of the methods identified in the literature. Here, the guidelines on approaches to use under different scenarios are provided, along with potential directions for future research.},
doi = {10.1111/insr.12274},
journal = {International Statistical Review},
number = 1,
volume = 87,
place = {United States},
year = {Thu Jun 21 00:00:00 EDT 2018},
month = {Thu Jun 21 00:00:00 EDT 2018}
}

Journal Article:
Free Publicly Available Full Text
Publisher's Version of Record

Citation Metrics:
Cited by: 24 works
Citation information provided by
Web of Science

Figures / Tables:

Figure 1 Figure 1: Simulation results: performance of k-modes clustering and latent class analysis (LCA) for various quantile splits of the data. A mixed-type data set was generated with two underlying clusters, and the interval scale variable was discretised using a quantile split with the number of bins shown along the x-axis.more » Performance measured by adjusted Rand index (ARI) is shown on the y-axis. Standard error of the mean ARI was less than 0.01 in all conditions. Maximum-likelihood estimation using the expectation—maximization algorithm with a Gaussian-multinomial mixture model with no discretisation of the interval scale variable performed with a mean ARI of 0.99 (standard error of 0.0004).« less

Save / Share:

Works referenced in this record:

Model-based clustering using copulas with applications
journal, July 2015


Contributions to the Mathematical Theory of Evolution
journal, January 1894

  • Pearson, K.
  • Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, Vol. 185, Issue 0
  • DOI: 10.1098/rsta.1894.0003

Distance functions for categorical and mixed variables
journal, May 2008


Comparing clusterings—an information based distance
journal, May 2007


On the null distribution of distance between two groups, using mixed continuous and categorical variables
journal, December 1984


poLCA : An R Package for Polytomous Variable Latent Class Analysis
journal, January 2011

  • Linzer, Drew A.; Lewis, Jeffrey B.
  • Journal of Statistical Software, Vol. 42, Issue 10
  • DOI: 10.18637/jss.v042.i10

Scalable algorithms for clustering large datasets with mixed type attributes
journal, January 2005

  • He, Zengyou; Xu, Xiaofei; Deng, Shengchun
  • International Journal of Intelligent Systems, Vol. 20, Issue 10
  • DOI: 10.1002/int.20108

Model-Based Gaussian and Non-Gaussian Clustering
journal, September 1993

  • Banfield, Jeffrey D.; Raftery, Adrian E.
  • Biometrics, Vol. 49, Issue 3
  • DOI: 10.2307/2532201

Estimating the Cluster Tree of a Density by Analyzing the Minimal Spanning Tree of a Sample
journal, May 2003


Cluster Validation by Prediction Strength
journal, September 2005

  • Tibshirani, Robert; Walther, Guenther
  • Journal of Computational and Graphical Statistics, Vol. 14, Issue 3
  • DOI: 10.1198/106186005X59243

Robust mixture modeling using multivariate skew t distributions
journal, May 2009


Model-Based Clustering
journal, October 2016


Maximum Likelihood from Incomplete Data Via the EM Algorithm
journal, September 1977

  • Dempster, A. P.; Laird, N. M.; Rubin, D. B.
  • Journal of the Royal Statistical Society: Series B (Methodological), Vol. 39, Issue 1
  • DOI: 10.1111/j.2517-6161.1977.tb01600.x

Clustering mixed data: Clustering mixed data
journal, May 2011

  • Hunt, Lynette; Jorgensen, Murray
  • Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, Vol. 1, Issue 4
  • DOI: 10.1002/widm.33

kamila : Clustering Mixed-Type Data in R and Hadoop
journal, January 2018

  • Foss, Alexander H.; Markatou, Marianthi
  • Journal of Statistical Software, Vol. 83, Issue 13
  • DOI: 10.18637/jss.v083.i13

One-Sample Likelihood Ratio Tests for Mixed Data
journal, January 2007


On Using Principal Components Before Separating a Mixture of Two Multivariate Normal Distributions
journal, January 1983

  • Chang, Wei-Chien
  • Applied Statistics, Vol. 32, Issue 3
  • DOI: 10.2307/2347949

A latent variables approach for clustering mixed binary and continuous variables within a Gaussian mixture model
journal, November 2011


A finite mixture model for the clustering of mixed-mode data
journal, April 1988


A Gamma mixture model better accounts for among site rate heterogeneity
journal, September 2005


Mixtures of Shifted AsymmetricLaplace Distributions
journal, June 2014

  • Franczak, Brian C.; Browne, Ryan P.; McNicholas, Paul D.
  • IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 36, Issue 6
  • DOI: 10.1109/TPAMI.2013.216

Model-based clustering with non-elliptically contoured distributions
journal, June 2008


The Elements of Statistical Learning
book, January 2009


Distance between populations using mixed continuous and categorical variables
journal, January 1983


Quadratic distances on probabilities: A unified foundation
journal, April 2008


Model-based clustering, classification, and discriminant analysis of data with mixed type
journal, November 2012

  • Browne, Ryan P.; McNicholas, Paul D.
  • Journal of Statistical Planning and Inference, Vol. 142, Issue 11
  • DOI: 10.1016/j.jspi.2012.05.001

Statistical Modelling of Data on Teaching Styles
journal, January 1981

  • Aitkin, Murray; Anderson, Dorothy; Hinde, John
  • Journal of the Royal Statistical Society. Series A (General), Vol. 144, Issue 4
  • DOI: 10.2307/2981826

Estimating the number of clusters in a data set via the gap statistic
journal, May 2001

  • Tibshirani, Robert; Walther, Guenther; Hastie, Trevor
  • Journal of the Royal Statistical Society: Series B (Statistical Methodology), Vol. 63, Issue 2, p. 411-423
  • DOI: 10.1111/1467-9868.00293

A new look at the statistical model identification
journal, December 1974


Top 10 algorithms in data mining
journal, December 2007

  • Wu, Xindong; Kumar, Vipin; Ross Quinlan, J.
  • Knowledge and Information Systems, Vol. 14, Issue 1
  • DOI: 10.1007/s10115-007-0114-2

A Population Background for Nonparametric Density-Based Clustering
journal, November 2015


Enhancing the selection of a model-based clustering with external categorical variables
journal, June 2014

  • Baudry, Jean-Patrick; Cardoso, Margarida; Celeux, Gilles
  • Advances in Data Analysis and Classification, Vol. 9, Issue 2
  • DOI: 10.1007/s11634-014-0177-3

Order selection in finite mixture models: complete or observed likelihood information criteria?
journal, June 2015

  • Hui, Francis K. C.; Warton, David I.; Foster, Scott D.
  • Biometrika, Vol. 102, Issue 3
  • DOI: 10.1093/biomet/asv027

A k-means type clustering algorithm for subspace clustering of mixed numeric and categorical datasets
journal, May 2011


What are the true clusters?
journal, October 2015


Stability-Based Validation of Clustering Solutions
journal, June 2004


Parsimonious skew mixture models for model-based clustering and classification
journal, March 2014


Exploratory Latent Structure Analysis Using Both Identifiable and Unidentifiable Models
journal, August 1974


Mean shift: a robust approach toward feature space analysis
journal, May 2002

  • Comaniciu, D.; Meer, P.
  • IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 24, Issue 5
  • DOI: 10.1109/34.1000236

Algorithm AS 136: A K-Means Clustering Algorithm
journal, January 1979

  • Hartigan, J. A.; Wong, M. A.
  • Applied Statistics, Vol. 28, Issue 1
  • DOI: 10.2307/2346830

Clustering via Nonparametric Density Estimation: The R Package pdfCluster
journal, January 2014

  • Azzalini, Adelchi; Menardi, Giovanna
  • Journal of Statistical Software, Vol. 57, Issue 11
  • DOI: 10.18637/jss.v057.i11

Mixtures of Continuous and Categorical Variables in Discriminant Analysis
journal, September 1980


The clustering of mixed-mode data: a comparison of possible approaches
journal, January 1990


A robust and scalable clustering algorithm for mixed type attributes in large database environment
conference, January 2001

  • Chiu, Tom; Fang, DongPing; Chen, John
  • Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '01
  • DOI: 10.1145/502512.502549

Multivariate Correlation Models with Mixed Discrete and Continuous Variables
journal, June 1961


Classification with discrete and continuous variables via general mixed-data models
journal, May 2011


Resampling Method for Unsupervised Estimation of Cluster Validity
journal, November 2001


Estimating the Dimension of a Model
journal, March 1978


Gifi Methods for Optimal Scaling in R : The Package homals
journal, January 2009

  • Leeuw, Jan de; Mair, Patrick
  • Journal of Statistical Software, Vol. 31, Issue 4
  • DOI: 10.18637/jss.v031.i04

Challenges of Big Data analysis
journal, February 2014

  • Fan, Jianqing; Han, Fang; Liu, Han
  • National Science Review, Vol. 1, Issue 2
  • DOI: 10.1093/nsr/nwt032

Extending mixtures of factor models using the restricted multivariate skew-normal distribution
journal, January 2016


Resampling approach for cluster model selection
journal, March 2011


Data clustering: 50 years beyond K-means
journal, June 2010


A General Coefficient of Similarity and Some of Its Properties
journal, December 1971


Mixtures of skew-<mml:math altimg="si111.gif" display="inline" overflow="scroll" xmlns:xocs="http://www.elsevier.com/xml/xocs/dtd" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.elsevier.com/xml/ja/dtd" xmlns:ja="http://www.elsevier.com/xml/ja/dtd" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:tb="http://www.elsevier.com/xml/common/table/dtd" xmlns:sb="http://www.elsevier.com/xml/common/struct-bib/dtd" xmlns:ce="http://www.elsevier.com/xml/common/dtd" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:cals="http://www.elsevier.com/xml/common/cals/dtd" xmlns:sa="http://www.elsevier.com/xml/common/struct-aff/dtd"><mml:mi>t</mml:mi></mml:math> factor analyzers
journal, September 2014

  • Murray, Paula M.; Browne, Ryan P.; McNicholas, Paul D.
  • Computational Statistics & Data Analysis, Vol. 77
  • DOI: 10.1016/j.csda.2014.03.012

Flexible parametric bootstrap for testing homogeneity against clustering and assessing the number of clusters
journal, June 2015


Maximum likelihood estimation for multivariate skew normal mixture models
journal, February 2009


Assessing a mixture model for clustering with the integrated completed likelihood
journal, July 2000

  • Biernacki, C.; Celeux, G.; Govaert, G.
  • IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 22, Issue 7
  • DOI: 10.1109/34.865189

Model-Based Clustering, Discriminant Analysis, and Density Estimation
journal, June 2002

  • Fraley, Chris; Raftery, Adrian E.
  • Journal of the American Statistical Association, Vol. 97, Issue 458
  • DOI: 10.1198/016214502760047131

Comparing partitions
journal, December 1985

  • Hubert, Lawrence; Arabie, Phipps
  • Journal of Classification, Vol. 2, Issue 1
  • DOI: 10.1007/BF01908075

Clustering via nonparametric density estimation
journal, February 2007


Comparing clusterings: an axiomatic view
conference, January 2005

  • Meilǎ, Marina
  • Proceedings of the 22nd international conference on Machine learning - ICML '05
  • DOI: 10.1145/1102351.1102424

A semiparametric method for clustering mixed data
journal, July 2016


Variable selection for model-based clustering using the integrated complete-data likelihood
journal, May 2016


How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis
journal, August 1998


Clustering Methods Based on Likelihood Ratio Criteria
journal, June 1971

  • Scott, A. J.; Symons, M. J.
  • Biometrics, Vol. 27, Issue 2
  • DOI: 10.2307/2529003

Finite mixtures of multivariate skew t-distributions: some recent and new results
journal, October 2012


Multivariate Tests for Clusters
journal, September 1979


FlexMix Version 2: Finite Mixtures with Concomitant Variables and Varying and Constant Parameters
journal, January 2008

  • Grün, Bettina; Leisch, Friedrich
  • Journal of Statistical Software, Vol. 28, Issue 4
  • DOI: 10.18637/jss.v028.i04

How to find an appropriate clustering for mixed-type variables with application to socio-economic stratification: How to Find an Appropriate Clustering
journal, April 2013

  • Hennig, Christian; Liao, Tim F.
  • Journal of the Royal Statistical Society: Series C (Applied Statistics), Vol. 62, Issue 3
  • DOI: 10.1111/j.1467-9876.2012.01066.x

Identifiability of parameters in latent structure models with many observed variables
journal, December 2009

  • Allman, Elizabeth S.; Matias, Catherine; Rhodes, John A.
  • The Annals of Statistics, Vol. 37, Issue 6A
  • DOI: 10.1214/09-AOS689

An examination of procedures for determining the number of clusters in a data set
journal, June 1985

  • Milligan, Glenn W.; Cooper, Martha C.
  • Psychometrika, Vol. 50, Issue 2
  • DOI: 10.1007/BF02294245

A Study of the Comparability of External Criteria for Hierarchical Cluster Analysis
journal, October 1986


On Information and Sufficiency
journal, March 1951

  • Kullback, S.; Leibler, R. A.
  • The Annals of Mathematical Statistics, Vol. 22, Issue 1
  • DOI: 10.1214/aoms/1177729694

The location model for mixtures of categorical and continuous variables
journal, January 1993


Model selection for mixture-based clustering for ordinal data
journal, December 2016

  • Fernández, D.; Arnold, R.
  • Australian & New Zealand Journal of Statistics, Vol. 58, Issue 4
  • DOI: 10.1111/anzs.12179

An EM algorithm for a mixture model of count data
journal, May 1993


Applications of beta-mixture models in bioinformatics
journal, February 2005


Simultaneous model-based clustering and visualization in the Fisher discriminative subspace
journal, April 2011


Mixture separation for mixed-mode data
journal, March 1996

  • Lawrence, C. J.; Krzanowski, W. J.
  • Statistics and Computing, Vol. 6, Issue 1
  • DOI: 10.1007/BF00161577

A mixture of generalized hyperbolic distributions: A MIXTURE OF GENERALIZED HYPERBOLIC DISTRIBUTIONS
journal, February 2015

  • Browne, Ryan P.; McNicholas, Paul D.
  • Canadian Journal of Statistics, Vol. 43, Issue 2
  • DOI: 10.1002/cjs.11246

Pattern Clustering by Multivariate Mixture Analysis
journal, April 1970


Determining the number of clusters using information entropy for mixed data
journal, June 2012


Advances in multidimensional integration
journal, December 2002


Generalization of the Mahalanobis Distance in the Mixed Case
journal, May 1995


Estimating the Mahalanobis Distance from Mixed Continuous and Discrete Data
journal, June 2000


Linear Fuzzy Clustering of Mixed Databases Based on Cluster-wise Optimal Scaling of Categorical Variables
conference, June 2007

  • Honda, Katsuhiro; Uesugi, Ryo; Ichihashi, Hidetomo
  • 2007 IEEE International Fuzzy Systems Conference
  • DOI: 10.1109/fuzzy.2007.4295398

Simulating Data to Study Performance of Finite Mixture Modeling and Clustering Algorithms
journal, January 2010

  • Maitra, Ranjan; Melnykov, Volodymyr
  • Journal of Computational and Graphical Statistics, Vol. 19, Issue 2
  • DOI: 10.1198/jcgs.2009.08054

A mixture of generalized latent variable models for mixed mode and heterogeneous data
journal, November 2011

  • Cai, Jing-Heng; Song, Xin-Yuan; Lam, Kwok-Hap
  • Computational Statistics & Data Analysis, Vol. 55, Issue 11
  • DOI: 10.1016/j.csda.2011.05.011

Model based clustering for mixed data: clustMD
journal, February 2016

  • McParland, Damien; Gormley, Isobel Claire
  • Advances in Data Analysis and Classification, Vol. 10, Issue 2
  • DOI: 10.1007/s11634-016-0238-x

Exploratory latent structure analysis using both identifiable and unidentifiable models
journal, January 1974


A generalized Mahalanobis distance for mixed data
journal, January 2005


Friendship stability in adolescence is associated with ventral striatum responses to vicarious rewards
journal, January 2021

  • Schreuders, Elisabeth; Braams, Barbara R.; Crone, Eveline A.
  • Nature Communications, Vol. 12, Issue 1
  • DOI: 10.1038/s41467-020-20042-1

Shoot flammability of vascular plants is phylogenetically conserved and related to habitat fire-proneness and growth form
journal, April 2020


Machine learning-based prediction of glioma margin from 5-ALA induced PpIX fluorescence spectroscopy
journal, January 2020


Skillful statistical models to predict seasonal wind speed and solar radiation in a Yangtze River estuary case study
journal, May 2020


Optimization of probiotic therapeutics using machine learning in an artificial human gastrointestinal tract
journal, January 2021


Flexible parametric bootstrap for testing homogeneity against clustering and assessing the number of clusters
text, January 2015

  • Hennig, C.; Lin, Chien-Ju
  • Apollo - University of Cambridge Repository
  • DOI: 10.17863/cam.30531

The Elements of Statistical Learning
book, January 2001


Quadratic distances on probabilities: A unified foundation
text, January 2008


Mixtures of Shifted Asymmetric Laplace Distributions
text, January 2012


Mixtures of Skew-t Factor Analyzers
text, January 2013


Extending mixtures of factor models using the restricted multivariate skew-normal distribution
preprint, January 2013


Works referencing / citing this record:

Distance‐based clustering of mixed data
journal, November 2018

  • van de Velden, Michel; Iodice D'Enza, Alfonso; Markos, Angelos
  • Wiley Interdisciplinary Reviews: Computational Statistics, Vol. 11, Issue 3
  • DOI: 10.1002/wics.1456