Unsupervised Group Discovery and LInk Prediction in Relational Datasets: a nonparametric Bayesian approach
Abstract
Clustering represents one of the most common statistical procedures and a standard tool for pattern discovery and dimension reduction. Most often the objects to be clustered are described by a set of measurements or observables e.g. the coordinates of the vectors, the attributes of people. In a lot of cases however the available observations appear in the form of links or connections (e.g. communication or transaction networks). This data contains valuable information that can in general be exploited in order to discover groups and better understand the structure of the dataset. Since in most realworld datasets, several of these links are missing, it is also useful to develop procedures that can predict those unobserved connections. In this report we address the problem of unsupervised group discovery in relational datasets. A fundamental issue in all clustering problems is that the actual number of clusters is unknown a priori. In most cases this is addressed by running the model several times assuming a different number of clusters each time and selecting the value that provides the best fit based on some criterion (ie Bayes factor in the case of Bayesian techniques). It is easily understood that it would be preferable to developmore »
 Authors:
 Publication Date:
 Research Org.:
 Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)
 Sponsoring Org.:
 USDOE
 OSTI Identifier:
 908093
 Report Number(s):
 UCRLTR230743
TRN: US200722%%419
 DOE Contract Number:
 W7405ENG48
 Resource Type:
 Technical Report
 Country of Publication:
 United States
 Language:
 English
 Subject:
 99 GENERAL AND MISCELLANEOUS//MATHEMATICS, COMPUTING, AND INFORMATION SCIENCE; COMMUNICATIONS; DIMENSIONS; FORECASTING; VECTORS; STATISTICS
Citation Formats
Koutsourelakis, P. Unsupervised Group Discovery and LInk Prediction in Relational Datasets: a nonparametric Bayesian approach. United States: N. p., 2007.
Web. doi:10.2172/908093.
Koutsourelakis, P. Unsupervised Group Discovery and LInk Prediction in Relational Datasets: a nonparametric Bayesian approach. United States. doi:10.2172/908093.
Koutsourelakis, P. Thu .
"Unsupervised Group Discovery and LInk Prediction in Relational Datasets: a nonparametric Bayesian approach". United States.
doi:10.2172/908093. https://www.osti.gov/servlets/purl/908093.
@article{osti_908093,
title = {Unsupervised Group Discovery and LInk Prediction in Relational Datasets: a nonparametric Bayesian approach},
author = {Koutsourelakis, P},
abstractNote = {Clustering represents one of the most common statistical procedures and a standard tool for pattern discovery and dimension reduction. Most often the objects to be clustered are described by a set of measurements or observables e.g. the coordinates of the vectors, the attributes of people. In a lot of cases however the available observations appear in the form of links or connections (e.g. communication or transaction networks). This data contains valuable information that can in general be exploited in order to discover groups and better understand the structure of the dataset. Since in most realworld datasets, several of these links are missing, it is also useful to develop procedures that can predict those unobserved connections. In this report we address the problem of unsupervised group discovery in relational datasets. A fundamental issue in all clustering problems is that the actual number of clusters is unknown a priori. In most cases this is addressed by running the model several times assuming a different number of clusters each time and selecting the value that provides the best fit based on some criterion (ie Bayes factor in the case of Bayesian techniques). It is easily understood that it would be preferable to develop techniques that are able to number of clusters is essentially learned from that data along with the rest of model parameters. For that purpose, we adopt a nonparametric Bayesian framework which provides a very flexible modeling environment in which the size of the model i.e. the number of clusters, can adapt to the available data and readily accommodate outliers. The latter is particularly important since several groups of interest might consist of a small number of members and would most likely be smeared out by traditional modeling techniques. Finally, the proposed framework combines all the advantages of standard Bayesian techniques such as integration of prior knowledge in a principled manner, seamless accommodation of missing data, quantification of confidence in the output etc. In the first section of this report, we review the Infinite Relational Model (IRM) which serves as the basis for further developments. The IRM assumes that each object belongs to a single group. In subsequent sections we discuss two mixedmembership models i.e. models which can account for the fact that an object can belong to several groups simultaneously in which case we are also interested in the degree of membership. For that purpose it is perhaps more natural to talk with respect to identities rather than groups. In particular we assume that each object has an unknown identity which can consist of one or more components. The terms groups and identities would therefore be considered equivalent in subsequent sections. A subsection is also devoted to variational techniques which have the potential of accelerating the inference process. Finally we discuss possible extensions to dynamic settings in which the available data includes timestamps and the goal is to find how group sizes and group memberships evolve in time. Even though the majority of the presentation is restricted to objects of a single type (domain) and pairwise, binary links of a single type, it is shown that the framework proposed can be extended to links of various types between several domains.},
doi = {10.2172/908093},
journal = {},
number = ,
volume = ,
place = {United States},
year = {Thu May 03 00:00:00 EDT 2007},
month = {Thu May 03 00:00:00 EDT 2007}
}

We introduce a conservative test for quantifying the consistency of two or more datasets. The test is based on the Bayesian answer to the question, ''How much more probable is it that all my data were generated from the same model system than if each dataset were generated from an independent set of model parameters''. The behavior of the resulting odds ratio is demonstrated with a twodimensional toy example. We make explicit the connection between evidence ratios and the differences in peakchisquared values, the later of which is more widely used and more cheaply calculated. Calculating evidence ratios for threemore »

Hybrid Multidimensional Relational and Link Analytical Knowledge Discovery for Law Enforcement
The challenges facing the Department of Homeland Security (DHS) require not only multidimensional, but also multiscale data analysis. In particular, the ability to seamlessly move from summary information, such as trends, into detailed analysis of individual entities, while critical for law enforcement, typically requires manually transferring information among multiple tools. Such timeconsuming and error prone processes significantly hamper the analysts' ability to quickly explore data and identify threats. As part of a DHS Science and Technology effort, we have been developing and deploying for Immigration and Customs Enforcement the CubeLink system integrating information between relational data cubes and link analyticalmore » 

Bayesian design and analysis of computer experiments: Use of derivatives in surface prediction
The work of Currin et al. and others in developing fast predictive approximations'' of computer models is extended for the case in which derivatives of the output variable of interest with respect to input variables are available. In addition to describing the calculations required for the Bayesian analysis, the issue of experimental design is also discussed, and an algorithm is described for constructing maximin distance'' designs. An example is given based on a demonstration model of eight inputs and one output, in which predictions based on a maximin design, a Latin hypercube design, and two compromise'' designs are evaluated andmore » 
A Nonparametric Bayesian Approach For Emission Tomography Reconstruction
We introduce a PET reconstruction algorithm following a nonparametric Bayesian (NPB) approach. In contrast with Expectation Maximization (EM), the proposed technique does not rely on any space discretization. Namely, the activity distributionnormalized emission intensity of the spatial poisson processis considered as a spatial probability density and observations are the projections of random emissions whose distribution has to be estimated. This approach is nonparametric in the sense that the quantity of interest belongs to the set of probability measures on R{sup k} (for reconstruction in kdimensions) and it is Bayesian in the sense that we define a prior directly on thismore »