skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Parallel k-means Clustering of Geospatial Data Sets Using Manycore CPU Architectures

Abstract

The increasing availability of high-resolution geospatiotemporal data sets from sources such as observatory networks, remote sensing platforms, and computational Earth system models has opened new possibilities for knowledge discovery and mining of weather, climate, ecological, and other geoscientific data sets fused from disparate sources. Many of the standard tools used on individual workstations are impractical for the analysis and synthesis of data sets of this size; however, new algorithmic approaches that can effectively utilize the complex memory hierarchies and the extremely high levels of parallelism available in state-of-the-art high-performance computing platforms can enable such analysis. Here, we describe pKluster, an open-source tool we have developed for accelerated k-means clustering of geospatial and geospatiotemporal data, and discuss algorithmic modifications and code optimizations we have made to enable it to effectively use parallel machines based on novel CPU architectures—such as the Intel “Knights Landing” Xeon Phi and Skylake Xeon processors—with many cores and hardware threads, and employing significant single instruction, multiple data (SIMD) parallelism. We outline some applicationsof the code in ecology and climate science contexts and present a detailed discussion of the performance of the code for one such application, LiDAR-derived vertical vegetation structure classification.

Authors:
 [1];  [2]; ORCiD logo [1]; ORCiD logo [1]; ORCiD logo [1];  [3]
  1. ORNL
  2. Intel Corporation
  3. United States Department of Agriculture (USDA), United States Forest Service (USFS)
Publication Date:
Research Org.:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Sponsoring Org.:
USDOE
OSTI Identifier:
1491319
DOE Contract Number:  
AC05-00OR22725
Resource Type:
Conference
Resource Relation:
Conference: IEEE International Conference on Data Mining Workshops - Singapore, , Singapore - 11/17/2018 10:00:00 AM-11/20/2018 10:00:00 AM
Country of Publication:
United States
Language:
English

Citation Formats

Mills, Richard T., Sripathi, Vamsi, Kumar, Jitendra, Sreepathi, Sarat, Hoffman, Forrest M., and Hargrove, William Walter. Parallel k-means Clustering of Geospatial Data Sets Using Manycore CPU Architectures. United States: N. p., 2018. Web.
Mills, Richard T., Sripathi, Vamsi, Kumar, Jitendra, Sreepathi, Sarat, Hoffman, Forrest M., & Hargrove, William Walter. Parallel k-means Clustering of Geospatial Data Sets Using Manycore CPU Architectures. United States.
Mills, Richard T., Sripathi, Vamsi, Kumar, Jitendra, Sreepathi, Sarat, Hoffman, Forrest M., and Hargrove, William Walter. Thu . "Parallel k-means Clustering of Geospatial Data Sets Using Manycore CPU Architectures". United States. https://www.osti.gov/servlets/purl/1491319.
@article{osti_1491319,
title = {Parallel k-means Clustering of Geospatial Data Sets Using Manycore CPU Architectures},
author = {Mills, Richard T. and Sripathi, Vamsi and Kumar, Jitendra and Sreepathi, Sarat and Hoffman, Forrest M. and Hargrove, William Walter},
abstractNote = {The increasing availability of high-resolution geospatiotemporal data sets from sources such as observatory networks, remote sensing platforms, and computational Earth system models has opened new possibilities for knowledge discovery and mining of weather, climate, ecological, and other geoscientific data sets fused from disparate sources. Many of the standard tools used on individual workstations are impractical for the analysis and synthesis of data sets of this size; however, new algorithmic approaches that can effectively utilize the complex memory hierarchies and the extremely high levels of parallelism available in state-of-the-art high-performance computing platforms can enable such analysis. Here, we describe pKluster, an open-source tool we have developed for accelerated k-means clustering of geospatial and geospatiotemporal data, and discuss algorithmic modifications and code optimizations we have made to enable it to effectively use parallel machines based on novel CPU architectures—such as the Intel “Knights Landing” Xeon Phi and Skylake Xeon processors—with many cores and hardware threads, and employing significant single instruction, multiple data (SIMD) parallelism. We outline some applicationsof the code in ecology and climate science contexts and present a detailed discussion of the performance of the code for one such application, LiDAR-derived vertical vegetation structure classification.},
doi = {},
journal = {},
number = ,
volume = ,
place = {United States},
year = {2018},
month = {11}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share: