skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Evaluating and Optimizing OpenCL Base64 Data Unpacking Kernel with FPGA

Abstract

Development of applications using OpenCL targeting FPGAs is an emerging approach on heterogeneous computing systems. This paper uses the data unpacking algorithm in Base64 encoding as a case study to present programming and optimization techniques, and experimental results of the OpenCL-based implementations on an FPGA. We explain the algorithm and evaluate the performance of the kernel implementations with Intel's FPGA OpenCL SDK. The experimental results show kernel vectorization and duplication are two optimization techniques that can improve the kernel performance. The performance of kernel duplication is also closely related to the local work size. Our experiment shows 16-lane vectorization increases the bandwidth by a factor of 2 to 10 for large input data sizes. Moreover, the performance of kernel duplication using 16 compute units is 40% to 1.5% less than that of kernel vectorization depending on the input size. Tuning the local work size can improve the kernel performance by a factor of 3 to 23. For this kernel, using local memory is not an effective technique to improve the kernel performance because input data is not reused. A combination of vectorization and duplication achieves the highest performance of 12.3 GiB/s. Compared to an Intel Xeon E5 CPU and anmore » Nvidia Tesla K80 GPU, the performance of the kernel on the Arria 10 FPGA is 6.7X faster than the CPU and 3X slower than the GPU. The performance per watt on the FPGA is 20.5X higher than the CPU and 1.19X lower than the GPU.« less

Authors:
; ;
Publication Date:
Research Org.:
Argonne National Lab. (ANL), Argonne, IL (United States)
Sponsoring Org.:
Argonne National Laboratory - Argonne Leadership Computing Facility
OSTI Identifier:
1481854
DOE Contract Number:  
AC02-06CH11357
Resource Type:
Conference
Resource Relation:
Conference: 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing, 03/21/18 - 03/23/18, Cambridge, UK
Country of Publication:
United States
Language:
English
Subject:
Base64 Encoding; FPGA; OpenCL

Citation Formats

Jin, Zheming, Johnson, Iris, and Finkel, Hal. Evaluating and Optimizing OpenCL Base64 Data Unpacking Kernel with FPGA. United States: N. p., 2018. Web. doi:10.1109/PDP2018.2018.00046.
Jin, Zheming, Johnson, Iris, & Finkel, Hal. Evaluating and Optimizing OpenCL Base64 Data Unpacking Kernel with FPGA. United States. doi:10.1109/PDP2018.2018.00046.
Jin, Zheming, Johnson, Iris, and Finkel, Hal. Mon . "Evaluating and Optimizing OpenCL Base64 Data Unpacking Kernel with FPGA". United States. doi:10.1109/PDP2018.2018.00046.
@article{osti_1481854,
title = {Evaluating and Optimizing OpenCL Base64 Data Unpacking Kernel with FPGA},
author = {Jin, Zheming and Johnson, Iris and Finkel, Hal},
abstractNote = {Development of applications using OpenCL targeting FPGAs is an emerging approach on heterogeneous computing systems. This paper uses the data unpacking algorithm in Base64 encoding as a case study to present programming and optimization techniques, and experimental results of the OpenCL-based implementations on an FPGA. We explain the algorithm and evaluate the performance of the kernel implementations with Intel's FPGA OpenCL SDK. The experimental results show kernel vectorization and duplication are two optimization techniques that can improve the kernel performance. The performance of kernel duplication is also closely related to the local work size. Our experiment shows 16-lane vectorization increases the bandwidth by a factor of 2 to 10 for large input data sizes. Moreover, the performance of kernel duplication using 16 compute units is 40% to 1.5% less than that of kernel vectorization depending on the input size. Tuning the local work size can improve the kernel performance by a factor of 3 to 23. For this kernel, using local memory is not an effective technique to improve the kernel performance because input data is not reused. A combination of vectorization and duplication achieves the highest performance of 12.3 GiB/s. Compared to an Intel Xeon E5 CPU and an Nvidia Tesla K80 GPU, the performance of the kernel on the Arria 10 FPGA is 6.7X faster than the CPU and 3X slower than the GPU. The performance per watt on the FPGA is 20.5X higher than the CPU and 1.19X lower than the GPU.},
doi = {10.1109/PDP2018.2018.00046},
journal = {},
number = ,
volume = ,
place = {United States},
year = {2018},
month = {1}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share: