Advanced Search

Browse by Discipline

Scientific Societies

E-print Alerts

Add E-prints

E-print Network

  Advanced Search  

Smart-Sample: An Efficient Algorithm for Clustering Large High-Dimensional Datasets

Summary: Smart-Sample: An Efficient Algorithm for Clustering
Large High-Dimensional Datasets
Dudu Lazarov, Gil David, Amir Averbuch
School of Computer Science, Tel-Aviv University
Tel-Aviv 69978, Israel
Finding useful related patterns in a dataset is an important task in many interesting
applications. In particular, one common need in many algorithms, is the ability to
separate a given dataset into a small number of clusters. Each cluster represents a
subset of data-points from the dataset, which are considered similar. In some cases,
it is also necessary to distinguish data points that are not part of a pattern from the
other data-points.
This paper introduces a new data clustering method named smart-sample and com-
pares its performance to several clustering methodologies. We show that smart-sample
clusters successfully large high-dimensional datasets. In addition, smart-sample out-
performs other methodologies in terms of running-time.
A variation of the smart-sample algorithm, which guarantees efficiency in terms of
I/O, is also presented. We describe how to achieve an approximation of the in-memory
smart-sample algorithm using a constant number of scans with a single sort operation
on the disk.


Source: Averbuch, Amir - School of Computer Science, Tel Aviv University


Collections: Computer Technologies and Information Sciences