Unsupervised Group Discovery and LInk Prediction in Relational Datasets: a nonparametric Bayesian approach

Koutsourelakis, P

doi:10.2172/908093

Title: Unsupervised Group Discovery and LInk Prediction in Relational Datasets: a nonparametric Bayesian approach

Technical Report · Thu May 03 00:00:00 EDT 2007

DOI:https://doi.org/10.2172/908093· OSTI ID:908093

Koutsourelakis, P

Clustering represents one of the most common statistical procedures and a standard tool for pattern discovery and dimension reduction. Most often the objects to be clustered are described by a set of measurements or observables e.g. the coordinates of the vectors, the attributes of people. In a lot of cases however the available observations appear in the form of links or connections (e.g. communication or transaction networks). This data contains valuable information that can in general be exploited in order to discover groups and better understand the structure of the dataset. Since in most real-world datasets, several of these links are missing, it is also useful to develop procedures that can predict those unobserved connections. In this report we address the problem of unsupervised group discovery in relational datasets. A fundamental issue in all clustering problems is that the actual number of clusters is unknown a priori. In most cases this is addressed by running the model several times assuming a different number of clusters each time and selecting the value that provides the best fit based on some criterion (ie Bayes factor in the case of Bayesian techniques). It is easily understood that it would be preferable to develop techniques that are able to number of clusters is essentially learned from that data along with the rest of model parameters. For that purpose, we adopt a nonparametric Bayesian framework which provides a very flexible modeling environment in which the size of the model i.e. the number of clusters, can adapt to the available data and readily accommodate outliers. The latter is particularly important since several groups of interest might consist of a small number of members and would most likely be smeared out by traditional modeling techniques. Finally, the proposed framework combines all the advantages of standard Bayesian techniques such as integration of prior knowledge in a principled manner, seamless accommodation of missing data, quantification of confidence in the output etc. In the first section of this report, we review the Infinite Relational Model (IRM) which serves as the basis for further developments. The IRM assumes that each object belongs to a single group. In subsequent sections we discuss two mixed-membership models i.e. models which can account for the fact that an object can belong to several groups simultaneously in which case we are also interested in the degree of membership. For that purpose it is perhaps more natural to talk with respect to identities rather than groups. In particular we assume that each object has an unknown identity which can consist of one or more components. The terms groups and identities would therefore be considered equivalent in subsequent sections. A sub-section is also devoted to variational techniques which have the potential of accelerating the inference process. Finally we discuss possible extensions to dynamic settings in which the available data includes timestamps and the goal is to find how group sizes and group memberships evolve in time. Even though the majority of the presentation is restricted to objects of a single type (domain) and pairwise, binary links of a single type, it is shown that the framework proposed can be extended to links of various types between several domains.

View Technical Report

Cite

Export

Save

Research Organization:: Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)

Sponsoring Organization:: USDOE

DOE Contract Number:: W-7405-ENG-48

OSTI ID:: 908093

Report Number(s):: UCRL-TR-230743; TRN: US200722%%419

Country of Publication:: United States

Language:: English

Similar Records

Measurement of the electroweak top quark production cross section and the CKM matrix element Vtb with the D0 experiment

Thesis/Dissertation · Mon Jun 29 00:00:00 EDT 2009 · OSTI ID:908093

Kirsch, Matthias

A Survey of Probabilistic Models for Relational Data

Technical Report · Fri Oct 13 00:00:00 EDT 2006 · OSTI ID:908093

Koutsourelakis, P S

Search for the Standard Model Higgs boson produced in association with a W Boson in the isolated-track charged-lepton channel using the Collider Detector at Fermilab

Thesis/Dissertation · Mon Aug 01 00:00:00 EDT 2011 · OSTI ID:908093

Buzatu, Adrian

Related Subjects

99 GENERAL AND MISCELLANEOUS//MATHEMATICS, COMPUTING, AND INFORMATION SCIENCE
COMMUNICATIONS
DIMENSIONS
FORECASTING
VECTORS
STATISTICS

Title: Unsupervised Group Discovery and LInk Prediction in Relational Datasets: a nonparametric Bayesian approach

Citation Formats

Similar Records

Related Subjects