Advanced Search

Browse by Discipline

Scientific Societies

E-print Alerts

Add E-prints

E-print Network

  Advanced Search  

Shared Memory Parallelization of Data Mining Algorithms: Techniques, Programming

Summary: Shared Memory Parallelization of Data Mining
Algorithms: Techniques, Programming
Interface, and Performance
Ruoming Jin, Ge Yang, and Gagan Agrawal, Member, IEEE Computer Society
Abstract--With recent technological advances, shared memory parallel machines have become more scalable, and offer large main
memories and high bus bandwidths. They are emerging as good platforms for data warehousing and data mining. In this paper, we
focus on shared memory parallelization of data mining algorithms. We have developed a series of techniques for parallelization of data
mining algorithms, including full replication, full locking, fixed locking, optimized full locking, and cache-sensitive locking. Unlike
previous work on shared memory parallelization of specific data mining algorithms, all of our techniques apply to a large number of
popular data mining algorithms. In addition, we propose a reduction-object-based interface for specifying a data mining algorithm. We
show how our runtime system can apply any of the techniques we have developed starting from a common specification of the
algorithm. We have carried out a detailed evaluation of the parallelization techniques and the programming interface. We have
experimented with apriori and fp-tree-based association mining, k-means clustering, k-nearest neighbor classifier, and decision tree
construction. The main results from our experiments are as follows. 1) Among full replication, optimized full locking, and cache-
sensitive locking, there is no clear winner. Each of these three techniques can outperform others depending upon machine and dataset
parameters. These three techniques perform significantly better than the other two techniques. 2) Good parallel efficiency is achieved
for each of the four algorithms we experimented with, using our techniques and runtime system. 3) The overhead of the interface is
within 10 percent in almost all cases. 4) In the case of decision tree construction, combining different techniques turned out to be
crucial for achieving high performance.
Index Terms--Shared memory parallelization, programming interfaces, association mining, clustering, decision tree construction.


Source: Agrawal, Gagan - Department of Computer Science and Engineering, Ohio State University


Collections: Computer Technologies and Information Sciences