Summary: Shared Memory Parallelization of Data Mining
Algorithms: Techniques, Programming
Interface, and Performance
Ruoming Jin, Ge Yang, and Gagan Agrawal, Member, IEEE Computer Society
Abstract--Decision tree construction is a well-studied data mining problem. In this paper, we focus on shared memory parallelization
of decision tree construction. In our previous work, we have developed a middleware and a set of parallelization techniques applicable
to a variety of data mining algorithms. The specific techniques we have developed include full replication, optimized full locking, and
cache-sensitive locking. This paper reports on using our framework and these techniques for developing a shared memory parallel
implementation of the RainForest approach originally proposed by Gehrke et al.. Our work has lead to two important observations.
First, we are able to parallelize a decision tree construction algorithm in a way that is very similar to the parallelization of association
mining and clustering algorithms. Second, our experiments show that applying a combination of techniques results in the best
performance. Specifically, using replication for all attributes at upper levels of the tree and for categorical attributes at all levels, and
locking for continuous attributes at deeper levels resulted in the highest speedups.
Index Terms--Shared memory parallelization, programming interfaces, association mining, clustering, decision tree construction.
WITH the availability of large data sets in application
areas like bioinformatics, medical informatics, scien-
tific data analysis, financial analysis, telecommunications,
retailing, and marketing, it is becoming increasingly