As work-groups are independent, they are executed concurrently by different
(hardware) threads. Therefore, the number of work-groups should be not
less than the number of compute units. To query the exact number of compute
units use clGetDeviceInfo
with the CL_DEVICE_MAX_COMPUTE_UNITS
parameter. A larger number of work-groups results in more flexibility
in scheduling, especially for kernels with small amount of computations,
for which the amount of work-groups can be up to 5-10 times more than
the number of compute units.
Also notice that in the opposite case, when the number of work-groups
is relatively small, in compare to, for example the value of CL_DEVICE_MAX_COMPUTE_UNITS
,
then even a small change in the work-groups amount can result in a significant
performance change.
For example, if you run a number of work-groups that equals to CL_DEVICE_MAX_COMPUTE_UNITS
,
then each compute unit process exactly one work-group. So in ideal conditions
all threads finish at the same time. Now consider the case, when work-group
size is changed, so that CL_DEVICE_MAX_COMPUTE_UNITS+1
work-groups
are executed instead. In such case, one thread does two times more job
than the others, which might double the overall execution time. Some inherent
threads divergence might hide the effect. The negative effect of “outstanding”
work-groups is less and less pronounced as the number of work-groups grows,
since imbalance is decreasing at a same pace.
You should keep the number of work-groups bigger than the number of compute units, or at least equal to the number of compute units.
To achieve better performance and parallelism between work-groups, ensure that execution of a work-group takes more than 100,000 clock cycles. A smaller value increases the proportion of switching overhead compared to actual work.