Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Large-Scale Multi-Dimensional Document Clustering on GPU Clusters

Conference ·
OSTI ID:986781

Document clustering plays an important role in data mining systems. Recently, a flocking-based document clustering algorithm has been proposed to solve the problem through simulation resembling the flocking behavior of birds in nature. This method is superior to other clustering algorithms, including k-means, in the sense that the outcome is not sensitive to the initial state. One limitation of this approach is that the algorithmic complexity is inherently quadratic in the number of documents. As a result, execution time becomes a bottleneck with large number of documents. In this paper, we assess the benefits of exploiting the computational power of Beowulf-like clusters equipped with contemporary Graphics Processing Units (GPUs) as a means to significantly reduce the runtime of flocking-based document clustering. Our framework scales up to over one million documents processed simultaneously in a sixteennode GPU cluster. Results are also compared to a four-node cluster with higher-end GPUs. On these clusters, we observe 30X-50X speedups, which demonstrates the potential of GPU clusters to efficiently solve massive data mining problems. Such speedups combined with the scalability potential and accelerator-based parallelization are unique in the domain of document-based data mining, to the best of our knowledge.

Research Organization:
Oak Ridge National Laboratory (ORNL)
Sponsoring Organization:
ORNL LDRD Seed-Money
DOE Contract Number:
AC05-00OR22725
OSTI ID:
986781
Country of Publication:
United States
Language:
English

Similar Records

Flocking-based Document Clustering on the Graphics Processing Unit
Conference · Mon Dec 31 23:00:00 EST 2007 · OSTI ID:932628

A Flocking Based algorithm for Document Clustering Analysis
Journal Article · Sat Dec 31 23:00:00 EST 2005 · Journal of System Architecture · OSTI ID:1003223

Graphics Processing Unit Enhanced Parallel Document Flocking Clustering
Conference · Thu Dec 31 23:00:00 EST 2009 · OSTI ID:986787