skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Parallel k-means++ for Multiple Shared-Memory Architectures

Abstract

In recent years k-means++ has become a popular initialization technique for improved k-means clustering. To date, most of the work done to improve its performance has involved parallelizing algorithms that are only approximations of k-means++. In this paper we present a parallelization of the exact k-means++ algorithm, with a proof of its correctness. We develop implementations for three distinct shared-memory architectures: multicore CPU, high performance GPU, and the massively multithreaded Cray XMT platform. We demonstrate the scalability of the algorithm on each platform. In addition we present a visual approach for showing which platform performed k-means++ the fastest for varying data sizes.

Authors:
;
Publication Date:
Research Org.:
Pacific Northwest National Lab. (PNNL), Richland, WA (United States)
Sponsoring Org.:
USDOE
OSTI Identifier:
1334876
Report Number(s):
PNNL-SA-116273
DOE Contract Number:
AC05-76RL01830
Resource Type:
Conference
Resource Relation:
Conference: 45th International Conference on Parallel Processing (ICPP 2016), August 16-19, 2016, Philadelphia, PA, 93-102
Country of Publication:
United States
Language:
English
Subject:
data clustering; GPGPU; multithreaded architecture; OpenMP

Citation Formats

Mackey, Patrick S., and Lewis, Robert R. Parallel k-means++ for Multiple Shared-Memory Architectures. United States: N. p., 2016. Web. doi:10.1109/ICPP.2016.18.
Mackey, Patrick S., & Lewis, Robert R. Parallel k-means++ for Multiple Shared-Memory Architectures. United States. doi:10.1109/ICPP.2016.18.
Mackey, Patrick S., and Lewis, Robert R. 2016. "Parallel k-means++ for Multiple Shared-Memory Architectures". United States. doi:10.1109/ICPP.2016.18.
@article{osti_1334876,
title = {Parallel k-means++ for Multiple Shared-Memory Architectures},
author = {Mackey, Patrick S. and Lewis, Robert R.},
abstractNote = {In recent years k-means++ has become a popular initialization technique for improved k-means clustering. To date, most of the work done to improve its performance has involved parallelizing algorithms that are only approximations of k-means++. In this paper we present a parallelization of the exact k-means++ algorithm, with a proof of its correctness. We develop implementations for three distinct shared-memory architectures: multicore CPU, high performance GPU, and the massively multithreaded Cray XMT platform. We demonstrate the scalability of the algorithm on each platform. In addition we present a visual approach for showing which platform performed k-means++ the fastest for varying data sizes.},
doi = {10.1109/ICPP.2016.18},
journal = {},
number = ,
volume = ,
place = {United States},
year = 2016,
month = 9
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share:
  • Abstract not provided.
  • Abstract not provided.
  • In this paper we examine the use of a shared memory programming model to address the problem of portability of application codes between distributed memory and shared memory architectures. We do this with an extension of the Parallel C Preprocessor. The extension, borrowed from Split-C and AC, uses type qualifiers instead of storage class modifiers to declare variables that are shared among processors. The type qualifier declaration supports an abstract shared memory facility on distributed memory machines while making direct use of hardware support on shared memory architectures. Our benchmarking study spans a wide range of shared memory and distributedmore » memory platforms. Benchmarks include Gaussian elimination with back substitution, a two-dimensional fast Fourier transform, and a matrix-matrix multiply. We find that the type-qualifier-based shared memory programming model is capable of efficiently spanning both distributed memory and shared memory architectures. Although the resulting shared memory programming model is portable, it does not remove the need to arrange for overlapped or blocked remote memory references on platforms that require these tuning measures in order to obtain good performance.« less
  • An efficient three dimensional unstructured Euler solver is parallelized on a Cray Y-MP C90 shared memory computer and on an Intel Touchstone Delta distributed memory computer. This paper relates the experiences gained and describes the software tools and hardware used in this study. Performance comparisons between two differing architectures are made.... Unstructured, Parallel, Shared memory, CRAY.
  • String matching is at the core of many critical applications, including network intrusion detection systems, search engines, virus scanners, spam filters, DNA and protein sequencing, and data mining. For all of these applications string matching requires a combination of (sometimes all) the following characteristics: high and/or predictable performance, support for large data sets and flexibility of integration and customization. Many software based implementations targeting conventional cache-based microprocessors fail to achieve high and predictable performance requirements, while Field-Programmable Gate Array (FPGA) implementations and dedicated hardware solutions fail to support large data sets (dictionary sizes) and are difficult to integrate and customize.more » The advent of multicore, multithreaded, and GPU-based systems is opening the possibility for software based solutions to reach very high performance at a sustained rate. This paper compares several software-based implementations of the Aho-Corasick string searching algorithm for high performance systems. We discuss the implementation of the algorithm on several types of shared-memory high-performance architectures (Niagara 2, large x86 SMPs and Cray XMT), distributed memory with homogeneous processing elements (InfiniBand cluster of x86 multicores) and heterogeneous processing elements (InfiniBand cluster of x86 multicores with NVIDIA Tesla C10 GPUs). We describe in detail how each solution achieves the objectives of supporting large dictionaries, sustaining high performance, and enabling customization and flexibility using various data sets.« less