Home

About

Advanced Search

Browse by Discipline

Scientific Societies

E-print Alerts

Add E-prints

E-print Network
FAQHELPSITE MAPCONTACT US


  Advanced Search  

 
Exploiting Parallelism to Support Scalable Hierarchical Rebecca Cathey, Eric C. Jensen, Steven M. Beitzel, Ophir Frieder, David Grossman
 

Summary: Exploiting Parallelism to Support Scalable Hierarchical
Clustering
Rebecca Cathey, Eric C. Jensen, Steven M. Beitzel, Ophir Frieder, David Grossman
Information Retrieval Laboratory
Department of Computer Science
Illinois Institute of Technology
10 W. 31st Street
Chicago, IL 60616
{cathey,jensen,beitzel,frieder,grossman}@ir.iit.edu
Abstract
A distributed memory parallel version of the group average Hierarchical Agglomerative Clustering algorithm is pro-
posed to enable scaling the document clustering problem to large collections. Using standard message passing opera-
tions reduces interprocess communication while maintaining efficient load balancing. In a series of experiments using
a subset of a standard TREC test collection, our parallel hierarchical clustering algorithm is shown to be scalable in
terms of processors efficiently used and the collection size. Results show that our algorithm performs close to the
expected O(n2
/p) time on p processors, rather than the worst-case O(n3
/p) time . Furthermore, the O(n2
/p) memory
complexity per node allows larger collections to be clustered as the number of nodes increases. While partitioning

  

Source: Argamon, Shlomo - Department of Computer Science, Illinois Institute of Technology

 

Collections: Computer Technologies and Information Sciences