skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: A Distributed OpenCL Framework using Redundant Computation and Data Replication

Abstract

Applications written solely in OpenCL or CUDA cannot execute on a cluster as a whole. Most previous approaches that extend these programming models to clusters are based on a common idea: designating a centralized host node and coordinating the other nodes with the host for computation. However, the centralized host node is a serious performance bottleneck when the number of nodes is large. In this paper, we propose a scalable and distributed OpenCL framework called SnuCL-D for large-scale clusters. SnuCL-D's remote device virtualization provides an OpenCL application with an illusion that all compute devices in a cluster are confined in a single node. To reduce the amount of control-message and data communication between nodes, SnuCL-D replicates the OpenCL host program execution and data in each node. We also propose a new OpenCL host API function and a queueing optimization technique that significantly reduce the overhead incurred by the previous centralized approaches. To show the effectiveness of SnuCL-D, we evaluate SnuCL-D with a microbenchmark and eleven benchmark applications on a large-scale CPU cluster and a medium-scale GPU cluster.

Authors:
 [1];  [1];  [1];  [1]
  1. Seoul National University, Korea
Publication Date:
Research Org.:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Sponsoring Org.:
USDOE
OSTI Identifier:
1295154
DOE Contract Number:
AC05-00OR22725
Resource Type:
Conference
Resource Relation:
Conference: ACM SIGPLAN Conference on Programming Language Design and Implementation, Santa Barbara, CA, USA, 20160613, 20160617
Country of Publication:
United States
Language:
English

Citation Formats

Kim, Junghyun, Gangwon, Jo, Jaehoon, Jung, and Lee, Jaejin. A Distributed OpenCL Framework using Redundant Computation and Data Replication. United States: N. p., 2016. Web.
Kim, Junghyun, Gangwon, Jo, Jaehoon, Jung, & Lee, Jaejin. A Distributed OpenCL Framework using Redundant Computation and Data Replication. United States.
Kim, Junghyun, Gangwon, Jo, Jaehoon, Jung, and Lee, Jaejin. Fri . "A Distributed OpenCL Framework using Redundant Computation and Data Replication". United States. doi:. https://www.osti.gov/servlets/purl/1295154.
@article{osti_1295154,
title = {A Distributed OpenCL Framework using Redundant Computation and Data Replication},
author = {Kim, Junghyun and Gangwon, Jo and Jaehoon, Jung and Lee, Jaejin},
abstractNote = {Applications written solely in OpenCL or CUDA cannot execute on a cluster as a whole. Most previous approaches that extend these programming models to clusters are based on a common idea: designating a centralized host node and coordinating the other nodes with the host for computation. However, the centralized host node is a serious performance bottleneck when the number of nodes is large. In this paper, we propose a scalable and distributed OpenCL framework called SnuCL-D for large-scale clusters. SnuCL-D's remote device virtualization provides an OpenCL application with an illusion that all compute devices in a cluster are confined in a single node. To reduce the amount of control-message and data communication between nodes, SnuCL-D replicates the OpenCL host program execution and data in each node. We also propose a new OpenCL host API function and a queueing optimization technique that significantly reduce the overhead incurred by the previous centralized approaches. To show the effectiveness of SnuCL-D, we evaluate SnuCL-D with a microbenchmark and eleven benchmark applications on a large-scale CPU cluster and a medium-scale GPU cluster.},
doi = {},
journal = {},
number = ,
volume = ,
place = {United States},
year = {Fri Jan 01 00:00:00 EST 2016},
month = {Fri Jan 01 00:00:00 EST 2016}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share:
  • As the core count of HPC machines continue to grow in size, issues such as fault tolerance and reliability are becoming limiting factors for application scalability. Current techniques to ensure progress across faults, for example coordinated checkpoint-restart, are unsuitable for machines of this scale due to their predicted high overheads. In this study, we present the design and implementation of a novel system for ensuring reliability which uses transparent, rank-level, redundant computation. Using this system, we show the overheads involved in redundant computation for a number of real-world HPC applications. Additionally, we relate the communication characteristics of an application tomore » the overheads observed.« less
  • PVM is an inexpensive, but extremely effective tool which allows a researcher to use workstations as nodes in a parallel processing environment to perform largescale computations. The numerical approximation and visualization of seismic waves propagating in the earth strains today`s largest supercomputers. The authors present timings and visualization for large earth models run on a ring of IBM RS/6000`s which illustrate PVM`s capability of handling large-scale problems.
  • Global Arrays (GA) is a software system from Pacific Northwest National Laboratory that enables an efficient, portable, and parallel shared memory programming interface to manipulate distributed defense arrays. Using a combination of GA and NumPy, we have reimplemented NumPy as a distributed drop-in replacement called Global Arrays in NumPy (GAiN). Scalability studies will be presented showing the utility of developing serial NumPy codes which can later run on more capable clusters or supercomputers.
  • The authors present a visualization tool for the monitoring and debugging of codes run in a parallel and distributed computing environment, called Lilith Lights. This tool can be used both for debugging parallel codes as well as for resource management of clusters. It was developed under Lilith, a framework for creating scalable software tools for distributed computing. The use of Lilith provides scalable, non-invasive debugging, as opposed to other commonly used software debugging and visualization tools. Furthermore, by implementing the visualization tool in software rather than in hardware (as available on some MPPs), Lilith Lights is easily transferable to othermore » machines, and well adapted for use on distributed clusters of machines. The information provided in a clustered environment can further be used for resource management of the cluster. In this paper, they introduce Lilith Lights, discussing its use on the Computational Plant cluster at Sandia National Laboratories, show its design and development under the Lilith framework, and present metrics for resource use and performance.« less
  • The current trend towards multicore/manycore and accelerated architectures presents challenges, both in portability, and also in the choices that developers must make on how to use the resources that these architectures provide. This paper explores some of the possibilities that are enabled by the Open Computing Language (OpenCL), and proposes a programming model that allows developers and scientists to more fully subscribe hybrid compute nodes, while, at the same time, reducing the impact of system failure.