| | |
Summary: Exploiting Parallelism to Support Scalable Hierarchical
Clustering
Rebecca Cathey, Eric C. Jensen, Steven M. Beitzel, Ophir Frieder, David Grossman
Information Retrieval Laboratory
Department of Computer Science
Illinois Institute of Technology
10 W. 31st Street
Chicago, IL 60616
{cathey,jensen,beitzel,frieder,grossman}@ir.iit.edu
Abstract
A distributed memory parallel version of the group average Hierarchical Agglomerative Clustering algorithm is pro-
posed to enable scaling the document clustering problem to large collections. Using standard message passing opera-
tions reduces interprocess communication while maintaining efficient load balancing. In a series of experiments using
a subset of a standard TREC test collection, our parallel hierarchical clustering algorithm is shown to be scalable in
terms of processors efficiently used and the collection size. Results show that our algorithm performs close to the
expected O(n2
/p) time on p processors, rather than the worst-case O(n3
/p) time . Furthermore, the O(n2
/p) memory
complexity per node allows larger collections to be clustered as the number of nodes increases. While partitioning
|