Work-Group Size Considerations

The recommended work-group size for kernels is multiple of 4, 8, or 16, depending on Single Instruction Multiple Data (SIMD) width for the float and int data type supported by CPU. The automatic vectorization module packs the work-items into SIMD packets of 4/8/16 items (for double as well) and processed the rest (“tail”) of the work group in a scalar way. In other words, a work-group with the size of 2*SIMD_WIDTH executes faster than a work-group with the size of 2*SIMD_WIDTH-1.

For example, a work group of 64 elements is assigned to one hardware thread. The thread iterates over work-items in a loop of 4 iterations with 16-wide vector instructions within each iteration. In some cases, the compiler may decide to loop (unroll) by 32 elements instead to expose more instruction-level parallelism.

It is recommended to let the OpenCL™ implementation automatically determine the optimal work-group size for a kernel: pass NULL for a pointer to the work-group size when calling clEnqueueNDRangeKernel.

If you want to experiment with work-group size, you need to consider the following: