Threading: Achieving Parallelism Between Work-Groups move merge work group level parall

As work-groups are independent, they are executed concurrently by different (hardware) threads. Therefore, the number of work-groups should be not less than the number of compute units. To query the exact number of compute units use clGetDeviceInfo with the CL_DEVICE_MAX_COMPUTE_UNITS parameter. A larger number of work-groups results in more flexibility in scheduling, especially for kernels with small amount of computations, for which the amount of work-groups can be up to 5-10 times more than the number of compute units.

Also notice that in the opposite case, when the number of work-groups is relatively small, in compare to, for example the value of CL_DEVICE_MAX_COMPUTE_UNITS, then even a small change in the work-groups amount can result in a significant performance change.

For example, if you run a number of work-groups that equals to CL_DEVICE_MAX_COMPUTE_UNITS, then each compute unit process exactly one work-group. So in ideal conditions all threads finish at the same time. Now consider the case, when work-group size is changed, so that CL_DEVICE_MAX_COMPUTE_UNITS+1 work-groups are executed instead. In such case, one thread does two times more job than the others, which might double the overall execution time. Some inherent threads divergence might hide the effect. The negative effect of “outstanding” work-groups is less and less pronounced as the number of work-groups grows, since imbalance is decreasing at a same pace.

You should keep the number of work-groups bigger than the number of compute units, or at least equal to the number of compute units.

To achieve better performance and parallelism between work-groups, ensure that execution of a work-group takes more than 100,000 clock cycles. A smaller value increases the proportion of switching overhead compared to actual work.