skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: A Parallel EM Algorithm for Model-Based Clustering Applied to the Exploration of Large Spatio-Temporal Data

Abstract

We develop a parallel EM algorithm for multivariate Gaussian mixture models and use it to perform model-based clustering of a large climate data set. Three variants of the EM algorithm are reformulated in parallel and a new variant that is faster is presented. All are implemented using the single program, multiple data (SPMD) programming model, which is able to take advantage of the combined collective memory of large distributed computer architectures to process larger data sets. Displays of the estimated mixture model rather than the data allow us to explore multivariate relationships in a way that scales to arbitrary size data. We study the performance of our methodology on simulated data and apply our methodology to a high resolution climate dataset produced by the community atmosphere model (CAM5). This article has supplementary material online.

Authors:
 [1];  [1];  [1];  [2];  [2]
  1. ORNL
  2. Lawrence Berkeley National Laboratory (LBNL)
Publication Date:
Research Org.:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF)
Sponsoring Org.:
USDOE Office of Science (SC)
OSTI Identifier:
1110857
DOE Contract Number:
DE-AC05-00OR22725
Resource Type:
Journal Article
Resource Relation:
Journal Name: Technometrics; Journal Volume: 55; Journal Issue: 4
Country of Publication:
United States
Language:
English

Citation Formats

Chen, Wei-Chen, Ostrouchov, George, Pugmire, Dave, Prabhat,, and Wehner, Michael. A Parallel EM Algorithm for Model-Based Clustering Applied to the Exploration of Large Spatio-Temporal Data. United States: N. p., 2013. Web. doi:10.1080/00401706.2013.826146.
Chen, Wei-Chen, Ostrouchov, George, Pugmire, Dave, Prabhat,, & Wehner, Michael. A Parallel EM Algorithm for Model-Based Clustering Applied to the Exploration of Large Spatio-Temporal Data. United States. doi:10.1080/00401706.2013.826146.
Chen, Wei-Chen, Ostrouchov, George, Pugmire, Dave, Prabhat,, and Wehner, Michael. Tue . "A Parallel EM Algorithm for Model-Based Clustering Applied to the Exploration of Large Spatio-Temporal Data". United States. doi:10.1080/00401706.2013.826146.
@article{osti_1110857,
title = {A Parallel EM Algorithm for Model-Based Clustering Applied to the Exploration of Large Spatio-Temporal Data},
author = {Chen, Wei-Chen and Ostrouchov, George and Pugmire, Dave and Prabhat, and Wehner, Michael},
abstractNote = {We develop a parallel EM algorithm for multivariate Gaussian mixture models and use it to perform model-based clustering of a large climate data set. Three variants of the EM algorithm are reformulated in parallel and a new variant that is faster is presented. All are implemented using the single program, multiple data (SPMD) programming model, which is able to take advantage of the combined collective memory of large distributed computer architectures to process larger data sets. Displays of the estimated mixture model rather than the data allow us to explore multivariate relationships in a way that scales to arbitrary size data. We study the performance of our methodology on simulated data and apply our methodology to a high resolution climate dataset produced by the community atmosphere model (CAM5). This article has supplementary material online.},
doi = {10.1080/00401706.2013.826146},
journal = {Technometrics},
number = 4,
volume = 55,
place = {United States},
year = {Tue Jan 01 00:00:00 EST 2013},
month = {Tue Jan 01 00:00:00 EST 2013}
}
  • This work was originally motivated the need to identify and track blob filaments, a feature often associated with instability in magnetically confined fusion plasma. Understanding and mitigating such instability would improve fusion reactors and make fusion a truly inexhaustible source of clean energy. Similar spatio-temporal features are important in many other applications, for example, ignition kernels in combustion and tumor cells in a medical image. This work presents an approach for extracting these features by dividing the overall task into three steps: local identification of feature cells, grouping feature cells into extended feature, and tracking movement of feature through overlappingmore » in space. Through our extensive work in parallelization, we demonstrate that this approach can effectively make use of a large number of compute nodes to detect and track blobs in fusion plasma. On a set of 4.3TB data, we observed linear speedup on 1024 processes and completed blob detection in less than three milliseconds using Edison, a Cray XC30 system at NERSC.« less
  • Cited by 1
  • A proliferation of data from vast networks of remote sensing platforms (satellites, unmanned aircraft systems (UAS), airborne etc.), observational facilities (meteorological, eddy covariance etc.), state-of-the-art sensors, and simulation models offer unprecedented opportunities for scientific discovery. Unsupervised classification is a widely applied data mining approach to derive insights from such data. However, classification of very large data sets is a complex computational problem that requires efficient numerical algorithms and implementations on high performance computing (HPC) platforms. Additionally, increasing power, space, cooling and efficiency requirements has led to the deployment of hybrid supercomputing platforms with complex architectures and memory hierarchies like themore » Titan system at Oak Ridge National Laboratory. The advent of such accelerated computing architectures offers new challenges and opportunities for big data analytics in general and specifically, large scale cluster analysis in our case. Although there is an existing body of work on parallel cluster analysis, those approaches do not fully meet the needs imposed by the nature and size of our large data sets. Moreover, they had scaling limitations and were mostly limited to traditional distributed memory computing platforms. We present a parallel Multivariate Spatio-Temporal Clustering (MSTC) technique based on k-means cluster analysis that can target hybrid supercomputers like Titan. We developed a hybrid MPI, CUDA and OpenACC implementation that can utilize both CPU and GPU resources on computational nodes. We describe performance results on Titan that demonstrate the scalability and efficacy of our approach in processing large ecological data sets.« less
  • We propose a model-based approach for clustering time series regression data in an unsupervised machine learning framework to identify groups under the assumption that each mixture component follows a Gaussian autoregressive regression model of order p. Given the number of groups, the traditional maximum likelihood approach of estimating the parameters using the expectation-maximization (EM) algorithm can be employed, although it is computationally demanding. The somewhat fast tune to the EM folk song provided by the Alternating Expectation Conditional Maximization (AECM) algorithm can alleviate the problem to some extent. In this article, we develop an alternative partial expectation conditional maximization algorithmmore » (APECM) that uses an additional data augmentation storage step to efficiently implement AECM for finite mixture models. Results on our simulation experiments show improved performance in both fewer numbers of iterations and computation time. The methodology is applied to the problem of clustering mutual funds data on the basis of their average annual per cent returns and in the presence of economic indicators.« less