Exploring Integer Sum Reduction using Atomics on Intel CPU
Atomic functions are useful in updating a shared variable by multiple threads, barrier synchronizations, constructing complex data structures, and building high-level frameworks. In this paper, we focus on the evaluation and analysis of integer sum reduction, a common data parallel primitive. We convert the sequential reduction into parallel OpenCL implementations on the CPU. We also develop three micro kernels, which allow us to understand the relationships between the kernel performance and the operations involved in reduction. The results of the micro kernels show that increasing the work-group size linearly can linearly improve the kernel performance. There is a sweet spot in the relationship between the work-group size and barrier synchronization overhead. The performance of the atomics over local memory are not sensitive to the work-group size. The sum reduction kernel with vectorized memory accesses can improve the performance of the baseline kernel for a wide range of work-group sizes. However, the vectorization efficiency shrinks with the growing work-group size. We also find that the vendor’s default OpenCL kernel optimization does not improve the kernel performance. On average, disabling the optimization can reduce the execution time of the kernel with vectorized memory accesses by 15%. We attribute the performance drop to the fact that the default kernel optimizations instantiate a large number of atomics over global memory when implicitly vectorizing the kernel computation.
- Research Organization:
- Argonne National Laboratory (ANL)
- Sponsoring Organization:
- USDOE Office of Science
- DOE Contract Number:
- AC02-06CH11357
- OSTI ID:
- 1515074
- Country of Publication:
- United States
- Language:
- English
Similar Records
Evaluating the Performance of Integer Sum Reduction on an Intel GPU
Evaluating and Optimizing OpenCL Base64 Data Unpacking Kernel with FPGA