skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Evaluating the Effects of Missing Values and Mixed Data Types on Social Sequence Clustering Using t-SNE Visualization

Abstract

© 2019 Association for Computing Machinery. The goal of this work is to investigate the impact of missing values in clustering joint categorical social sequences. Identifying patterns in sociodemographic longitudinal data is important in a number of social science settings. However, performing analytical operations, such as clustering on life course trajectories, is challenging due to the categorical and multidimensional nature of the data, their mixed data types, and corruption by missing and inconsistent values. Data quality issues were investigated previously on single variable sequences. To understand their effects on multivariate sequence analysis, we employ a dataset of mixed data types and missing values, a dissimilarity measure designed for joint categorical sequence data, together with dimensionality reduction methodologies in a systematic design of sequence clustering experiments. Given the categorical nature of our data, we employ an “edit” distance using optimal matching. Because each data record has multiple variables of different types, we investigate the impact of mixing these variables in a single dissimilarity measure. Between variables with binary values and those with multiple nominal values, we find that the ability to overcome missing data problems is more difficult in the nominal domain than in the binary domain. Additionally, alignment of leadingmore » missing values can result in systematic biases in dissimilarity matrices and subsequently introduce both artificial clusters and unrealistic interpretations of associated data domains. We demonstrate the usage of t-distributed stochastic neighborhood embedding to visually guide mitigation of such biases by tuning the missing value substitution cost parameter or determining an optimal sequence span.« less

Authors:
 [1];  [2];  [2];  [2];  [2];  [2]
  1. Youngstown State University, Youngstown, OH
  2. Lawrence Berkeley National Laboratory, Berkeley, CA
Publication Date:
Research Org.:
Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
Sponsoring Org.:
USDOE Office of Science (SC)
OSTI Identifier:
1526589
DOE Contract Number:  
AC02-05CH11231
Resource Type:
Journal Article
Journal Name:
Journal of Data and Information Quality
Additional Journal Information:
Journal Volume: 11; Journal Issue: 2; Journal ID: ISSN 1936-1955
Country of Publication:
United States
Language:
English

Citation Formats

Lazar, Alina, Jin, Ling, Spurlock, C. Anna, Wu, Kesheng, Sim, Alex, and Todd, Annika. Evaluating the Effects of Missing Values and Mixed Data Types on Social Sequence Clustering Using t-SNE Visualization. United States: N. p., 2019. Web. doi:10.1145/3301294.
Lazar, Alina, Jin, Ling, Spurlock, C. Anna, Wu, Kesheng, Sim, Alex, & Todd, Annika. Evaluating the Effects of Missing Values and Mixed Data Types on Social Sequence Clustering Using t-SNE Visualization. United States. doi:10.1145/3301294.
Lazar, Alina, Jin, Ling, Spurlock, C. Anna, Wu, Kesheng, Sim, Alex, and Todd, Annika. Wed . "Evaluating the Effects of Missing Values and Mixed Data Types on Social Sequence Clustering Using t-SNE Visualization". United States. doi:10.1145/3301294.
@article{osti_1526589,
title = {Evaluating the Effects of Missing Values and Mixed Data Types on Social Sequence Clustering Using t-SNE Visualization},
author = {Lazar, Alina and Jin, Ling and Spurlock, C. Anna and Wu, Kesheng and Sim, Alex and Todd, Annika},
abstractNote = {© 2019 Association for Computing Machinery. The goal of this work is to investigate the impact of missing values in clustering joint categorical social sequences. Identifying patterns in sociodemographic longitudinal data is important in a number of social science settings. However, performing analytical operations, such as clustering on life course trajectories, is challenging due to the categorical and multidimensional nature of the data, their mixed data types, and corruption by missing and inconsistent values. Data quality issues were investigated previously on single variable sequences. To understand their effects on multivariate sequence analysis, we employ a dataset of mixed data types and missing values, a dissimilarity measure designed for joint categorical sequence data, together with dimensionality reduction methodologies in a systematic design of sequence clustering experiments. Given the categorical nature of our data, we employ an “edit” distance using optimal matching. Because each data record has multiple variables of different types, we investigate the impact of mixing these variables in a single dissimilarity measure. Between variables with binary values and those with multiple nominal values, we find that the ability to overcome missing data problems is more difficult in the nominal domain than in the binary domain. Additionally, alignment of leading missing values can result in systematic biases in dissimilarity matrices and subsequently introduce both artificial clusters and unrealistic interpretations of associated data domains. We demonstrate the usage of t-distributed stochastic neighborhood embedding to visually guide mitigation of such biases by tuning the missing value substitution cost parameter or determining an optimal sequence span.},
doi = {10.1145/3301294},
journal = {Journal of Data and Information Quality},
issn = {1936-1955},
number = 2,
volume = 11,
place = {United States},
year = {2019},
month = {3}
}

Works referenced in this record:

Optimal Matching Methods for Historical Sequences
journal, January 1986

  • Abbott, Andrew; Forrest, John
  • Journal of Interdisciplinary History, Vol. 16, Issue 3
  • DOI: 10.2307/204500

Standardization of pathways to adulthood? an analysis of Dutch cohorts born between 1850 and 1900
journal, November 2010

  • Bras, Hilde; Liefbroer, Aart C.; Elzinga, Cees H.
  • Demography, Vol. 47, Issue 4
  • DOI: 10.1007/BF03213737

A review on time series data mining
journal, February 2011


1. Multichannel Sequence Analysis Applied to Social Science Data
journal, July 2010


Mixed Methods Research: A Research Paradigm Whose Time Has Come
journal, October 2004


Sociology and political arithmetic: some principles of a new policy science1
journal, March 2004


Clustering of time series data—a survey
journal, November 2005


A general method applicable to the search for similarities in the amino acid sequence of two proteins
journal, March 1970


Joint Sequence Analysis: Association and Clustering
journal, January 2015


Holistic trajectories: a study of combined employment, housing and family careers by using multiple-sequence analysis
journal, January 2007


Migration and reproduction in an urbanizing context. Family life courses in 19th century Antwerp and Geneva
journal, April 2013

  • Schumacher, Reto; Matthijs, Koen; Moreels, Sarah
  • Revue Quetelet/Quetelet Journal, Vol. 1, Issue 1
  • DOI: 10.14428/rqj2013.01.01.03

Residential Trajectories: Using Optimal Alignment to Reveal The Structure of Residential Mobility
journal, May 2004


What matters in differences between life trajectories: a comparative review of sequence dissimilarity measures
journal, July 2015

  • Studer, Matthias; Ritschard, Gilbert
  • Journal of the Royal Statistical Society: Series A (Statistics in Society), Vol. 179, Issue 2
  • DOI: 10.1111/rssa.12125

The String-to-String Correction Problem
journal, January 1974


The de-standardization of the life course: Are men and women equal?
journal, March 2009


Policy and sociology
journal, March 2004