Work-Group Size Considerations for Intel® Xeon Phi™ Coprocessors

The recommended work-group size for kernels is multiple of 16, which equals the SIMD width for the float and int data type. The automatic vectorization module packs the work-items into SIMD packets of 16 items (for double as well), and processed the rest (“tail”) of the work-group in a scalar way. In other words, a work-group with the size of 2*SIMD_WIDTH executes faster than, the one with the size of 2* SIMD_WIDTH-1.

For example, a work-group of 64 elements is assigned to one hardware thread. The thread iterates over work-items in a loop of 4 iterations, with 16-wide vector instructions within each iteration. In some cases, the compiler may decide to loop (unroll) by 32 elements instead to expose more instruction-level parallelism.

Let the OpenCL™ implementation automatically determine the optimal work-group size for a given kernel and global work size. To do so, pass NULL for a pointer to the work-group size when calling clEnqueueNDRangeKernel. Consider experimenting with setting the work-groups size explicitly in the following cases:

When experimenting with the work-group size, you need to be aware of the following:

See Also

Threading: Achieving Parallelism Between Work-Groups
Vectorization: SIMD Processing Within a Work-group