skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Unsupervised Group Discovery and LInk Prediction in Relational Datasets: a nonparametric Bayesian approach

Abstract

Clustering represents one of the most common statistical procedures and a standard tool for pattern discovery and dimension reduction. Most often the objects to be clustered are described by a set of measurements or observables e.g. the coordinates of the vectors, the attributes of people. In a lot of cases however the available observations appear in the form of links or connections (e.g. communication or transaction networks). This data contains valuable information that can in general be exploited in order to discover groups and better understand the structure of the dataset. Since in most real-world datasets, several of these links are missing, it is also useful to develop procedures that can predict those unobserved connections. In this report we address the problem of unsupervised group discovery in relational datasets. A fundamental issue in all clustering problems is that the actual number of clusters is unknown a priori. In most cases this is addressed by running the model several times assuming a different number of clusters each time and selecting the value that provides the best fit based on some criterion (ie Bayes factor in the case of Bayesian techniques). It is easily understood that it would be preferable to developmore » techniques that are able to number of clusters is essentially learned from that data along with the rest of model parameters. For that purpose, we adopt a nonparametric Bayesian framework which provides a very flexible modeling environment in which the size of the model i.e. the number of clusters, can adapt to the available data and readily accommodate outliers. The latter is particularly important since several groups of interest might consist of a small number of members and would most likely be smeared out by traditional modeling techniques. Finally, the proposed framework combines all the advantages of standard Bayesian techniques such as integration of prior knowledge in a principled manner, seamless accommodation of missing data, quantification of confidence in the output etc. In the first section of this report, we review the Infinite Relational Model (IRM) which serves as the basis for further developments. The IRM assumes that each object belongs to a single group. In subsequent sections we discuss two mixed-membership models i.e. models which can account for the fact that an object can belong to several groups simultaneously in which case we are also interested in the degree of membership. For that purpose it is perhaps more natural to talk with respect to identities rather than groups. In particular we assume that each object has an unknown identity which can consist of one or more components. The terms groups and identities would therefore be considered equivalent in subsequent sections. A sub-section is also devoted to variational techniques which have the potential of accelerating the inference process. Finally we discuss possible extensions to dynamic settings in which the available data includes timestamps and the goal is to find how group sizes and group memberships evolve in time. Even though the majority of the presentation is restricted to objects of a single type (domain) and pairwise, binary links of a single type, it is shown that the framework proposed can be extended to links of various types between several domains.« less

Authors:
Publication Date:
Research Org.:
Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)
Sponsoring Org.:
USDOE
OSTI Identifier:
908093
Report Number(s):
UCRL-TR-230743
TRN: US200722%%419
DOE Contract Number:
W-7405-ENG-48
Resource Type:
Technical Report
Country of Publication:
United States
Language:
English
Subject:
99 GENERAL AND MISCELLANEOUS//MATHEMATICS, COMPUTING, AND INFORMATION SCIENCE; COMMUNICATIONS; DIMENSIONS; FORECASTING; VECTORS; STATISTICS

Citation Formats

Koutsourelakis, P. Unsupervised Group Discovery and LInk Prediction in Relational Datasets: a nonparametric Bayesian approach. United States: N. p., 2007. Web. doi:10.2172/908093.
Koutsourelakis, P. Unsupervised Group Discovery and LInk Prediction in Relational Datasets: a nonparametric Bayesian approach. United States. doi:10.2172/908093.
Koutsourelakis, P. Thu . "Unsupervised Group Discovery and LInk Prediction in Relational Datasets: a nonparametric Bayesian approach". United States. doi:10.2172/908093. https://www.osti.gov/servlets/purl/908093.
@article{osti_908093,
title = {Unsupervised Group Discovery and LInk Prediction in Relational Datasets: a nonparametric Bayesian approach},
author = {Koutsourelakis, P},
abstractNote = {Clustering represents one of the most common statistical procedures and a standard tool for pattern discovery and dimension reduction. Most often the objects to be clustered are described by a set of measurements or observables e.g. the coordinates of the vectors, the attributes of people. In a lot of cases however the available observations appear in the form of links or connections (e.g. communication or transaction networks). This data contains valuable information that can in general be exploited in order to discover groups and better understand the structure of the dataset. Since in most real-world datasets, several of these links are missing, it is also useful to develop procedures that can predict those unobserved connections. In this report we address the problem of unsupervised group discovery in relational datasets. A fundamental issue in all clustering problems is that the actual number of clusters is unknown a priori. In most cases this is addressed by running the model several times assuming a different number of clusters each time and selecting the value that provides the best fit based on some criterion (ie Bayes factor in the case of Bayesian techniques). It is easily understood that it would be preferable to develop techniques that are able to number of clusters is essentially learned from that data along with the rest of model parameters. For that purpose, we adopt a nonparametric Bayesian framework which provides a very flexible modeling environment in which the size of the model i.e. the number of clusters, can adapt to the available data and readily accommodate outliers. The latter is particularly important since several groups of interest might consist of a small number of members and would most likely be smeared out by traditional modeling techniques. Finally, the proposed framework combines all the advantages of standard Bayesian techniques such as integration of prior knowledge in a principled manner, seamless accommodation of missing data, quantification of confidence in the output etc. In the first section of this report, we review the Infinite Relational Model (IRM) which serves as the basis for further developments. The IRM assumes that each object belongs to a single group. In subsequent sections we discuss two mixed-membership models i.e. models which can account for the fact that an object can belong to several groups simultaneously in which case we are also interested in the degree of membership. For that purpose it is perhaps more natural to talk with respect to identities rather than groups. In particular we assume that each object has an unknown identity which can consist of one or more components. The terms groups and identities would therefore be considered equivalent in subsequent sections. A sub-section is also devoted to variational techniques which have the potential of accelerating the inference process. Finally we discuss possible extensions to dynamic settings in which the available data includes timestamps and the goal is to find how group sizes and group memberships evolve in time. Even though the majority of the presentation is restricted to objects of a single type (domain) and pairwise, binary links of a single type, it is shown that the framework proposed can be extended to links of various types between several domains.},
doi = {10.2172/908093},
journal = {},
number = ,
volume = ,
place = {United States},
year = {Thu May 03 00:00:00 EDT 2007},
month = {Thu May 03 00:00:00 EDT 2007}
}

Technical Report:

Save / Share:
  • We introduce a conservative test for quantifying the consistency of two or more datasets. The test is based on the Bayesian answer to the question, ''How much more probable is it that all my data were generated from the same model system than if each dataset were generated from an independent set of model parameters''. The behavior of the resulting odds ratio is demonstrated with a two-dimensional toy example. We make explicit the connection between evidence ratios and the differences in peak-chisquared values, the later of which is more widely used and more cheaply calculated. Calculating evidence ratios for threemore » cosmological datasets (recent CMB data (WMAP, ACBAR, CBI, VSA), SDSS and the most recent SNe Type 1A data) we find that concordance is favored and the tightening of constraints on cosmological parameters is indeed justified.« less
  • The challenges facing the Department of Homeland Security (DHS) require not only multi-dimensional, but also multi-scale data analysis. In particular, the ability to seamlessly move from summary information, such as trends, into detailed analysis of individual entities, while critical for law enforcement, typically requires manually transferring information among multiple tools. Such time-consuming and error prone processes significantly hamper the analysts' ability to quickly explore data and identify threats. As part of a DHS Science and Technology effort, we have been developing and deploying for Immigration and Customs Enforcement the CubeLink system integrating information between relational data cubes and link analyticalmore » semantic graphs. In this paper we describe CubeLink in terms of the underlying components, their integration, and the formal mapping from multidimensional data analysis into link analysis. In so doing, we provide a formal basis for one particular form of automatic schema-ontology mapping from OLAP data cubes to semantic graphs databases, and point the way towards future ``intelligent'' OLAP data cubes equipped with meta-data about their dimensional typing.« less
  • Abstract not provided.
  • The work of Currin et al. and others in developing fast predictive approximations'' of computer models is extended for the case in which derivatives of the output variable of interest with respect to input variables are available. In addition to describing the calculations required for the Bayesian analysis, the issue of experimental design is also discussed, and an algorithm is described for constructing maximin distance'' designs. An example is given based on a demonstration model of eight inputs and one output, in which predictions based on a maximin design, a Latin hypercube design, and two compromise'' designs are evaluated andmore » compared. 12 refs., 2 figs., 6 tabs.« less
  • We introduce a PET reconstruction algorithm following a nonparametric Bayesian (NPB) approach. In contrast with Expectation Maximization (EM), the proposed technique does not rely on any space discretization. Namely, the activity distribution--normalized emission intensity of the spatial poisson process--is considered as a spatial probability density and observations are the projections of random emissions whose distribution has to be estimated. This approach is nonparametric in the sense that the quantity of interest belongs to the set of probability measures on R{sup k} (for reconstruction in k-dimensions) and it is Bayesian in the sense that we define a prior directly on thismore » spatial measure. In this context, we propose to model the nonparametric probability density as an infinite mixture of multivariate normal distributions. As a prior for this mixture we consider a Dirichlet Process Mixture (DPM) with a Normal-Inverse Wishart (NIW) model as base distribution of the Dirichlet Process. As in EM-family reconstruction, we use a data augmentation scheme where the set of hidden variables are the emission locations for each observed line of response in the continuous object space. Thanks to the data augmentation, we propose a Markov Chain Monte Carlo (MCMC) algorithm (Gibbs sampler) which is able to generate draws from the posterior distribution of the spatial intensity. A difference with EM is that one step of the Gibbs sampler corresponds to the generation of emission locations while only the expected number of emissions per pixel/voxel is used in EM. Another key difference is that the estimated spatial intensity is a continuous function such that there is no need to compute a projection matrix. Finally, draws from the intensity posterior distribution allow the estimation of posterior functionnals like the variance or confidence intervals. Results are presented for simulated data based on a 2D brain phantom and compared to Bayesian MAP-EM.« less