The recommended work-group size for kernels is multiple of 16, which equals the SIMD width for the float
and int
data type. The automatic vectorization module packs the work-items into SIMD packets of 16 items (for double as well), and processed the rest (“tail”) of the work-group in a scalar way. In other words, a work-group with the size of 2*SIMD_WIDTH
executes faster than, the one with the size of 2* SIMD_WIDTH-1
.
For example, a work-group of 64 elements is assigned to one hardware thread. The thread iterates over work-items in a loop of 4 iterations, with 16-wide vector instructions within each iteration. In some cases, the compiler may decide to loop (unroll) by 32 elements instead to expose more instruction-level parallelism.
Let the OpenCL™ implementation automatically determine the optimal work-group size for a given kernel and global work size. To do so, pass NULL for a pointer to the work-group size when calling clEnqueueNDRangeKernel
. Consider experimenting with setting the work-groups size explicitly in the following cases:
When experimenting with the work-group size, you need to be aware of the following:
CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE
parameter by calling to clGetKernelWorkGroupInfo
, and set the work-group size accordingly.