Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Fast and Accurate Support Vector Machines on Large Scale Systems

Conference ·

Support Vector Machines (SVM) is a supervised Machine Learning and Data Mining (MLDM) algorithm, which has become ubiquitous largely due to its high accuracy and obliviousness to dimensionality. The objective of SVM is to find an optimal boundary --- also known as hyperplane --- which separates the samples (examples in a dataset) of different classes by a maximum margin. Usually, very few samples contribute to the definition of the boundary. However, existing parallel algorithms use the entire dataset for finding the boundary, which is sub-optimal for performance reasons. In this paper, we propose a novel distributed memory algorithm to eliminate the samples which do not contribute to the boundary definition in SVM. We propose several heuristics, which range from early (aggressive) to late (conservative) elimination of the samples, such that the overall time for generating the boundary is reduced considerably. In a few cases, a sample may be eliminated (shrunk) pre-emptively --- potentially resulting in an incorrect boundary. We propose a scalable approach to synchronize the necessary data structures such that the proposed algorithm maintains its accuracy. We consider the necessary trade-offs of single/multiple synchronization using in-depth time-space complexity analysis. We implement the proposed algorithm using MPI and compare it with libsvm--- de facto sequential SVM software --- which we enhance with OpenMP for multi-core/many-core parallelism. Our proposed approach shows excellent efficiency using up to 4096 processes on several large datasets such as UCI HIGGS Boson dataset and Offending URL dataset.

Research Organization:
Pacific Northwest National Laboratory (PNNL), Richland, WA (US)
Sponsoring Organization:
USDOE
DOE Contract Number:
AC05-76RL01830
OSTI ID:
1236333
Report Number(s):
PNNL-SA-110940
Country of Publication:
United States
Language:
English

Similar Records

Machine Learning Toolkit for Extreme Scale
Software · Mon Mar 31 00:00:00 EDT 2014 · OSTI ID:1231737

Machine Learning Toolkit for Extreme Scale
Software · Sat Mar 29 20:00:00 EDT 2014 · OSTI ID:code-2638

Randomized Sampling for Large Data Applications of SVM
Conference · Sat Dec 31 23:00:00 EST 2011 · OSTI ID:1059336

Related Subjects