Utilizing Software Prefetching

Compiler inserts prefetch instructions into the resulting assembly based on kernel source code analysis. This capability of the compiler is called automatic prefetching. Automatic prefetching works for memory access for the variables in the global and constant address spaces. It is limited to the following scenarios of well-detected memory access pattern, when it is generally beneficial to prefetch data in advance:

The prefetch distance is computed by the compiler, based on the amount of work inside the loop. To benefit from the automatic compiler-generated prefetches, consider the following:

Consider the following examples of automatic software prefetching:

__kernel void my_kernel1 (__global int* a, __global int* b, __global int* c,
                         __global int* d, __global int* e, __global int* f, int m)
{
size_t id = get_global_id(0);
a[id] = b[id] + c[id]; /* a, b and c can be prefetched (if level 1 or more aggressive prefetching level is enabled, below)*/
if (a[id] > 0) //condition doesn’t block the prefetching for the next lines
{
c[id] = d[id]; /* d can be prefetched (if level 2 or more aggressive prefetching level      is enabled, below)*/
d[id] = m; // m is not prefetched
}
e[a[id]] = f[b[id]]; // e and f can be prefetched (if level 3 or more aggressive prefetching level is enabled, below)
}
__kernel void my_kernel2 (__global int* a, __global int* b, __global int* c,
                         __global int* e, __global int* f, int m) 
{
size_t id = get_global_id(0);
for (int i=0; i < m; i++) {
sum += a[id]* b[i] + c[i]; /* b and c can be prefetched (if level 1 or more aggressive prefetching level is enabled, below), a is not prefetched*/
sum += e[id] * f[b[i]]; // e and f are not prefetched
} 
d[id] = sum;
}

Automatic prefetching availability differs according to the level:

The default level is two, which provides the best performance in most cases. In some cases, level three may provide performance gains. In other cases, for example, when the code is not memory-bound but computational intensive, it might be beneficial to disable auto prefetching by setting it to level zero, or restricting it to the obvious cases by setting it to level one. See the above examples for accesses and approximate associated level.

Auto prefetching level is set for each OpenCL program separately by use of the options argument of the clBuildProgram() API call. For example, the following call disables generating auto-prefetching for the program pointed by p_program:

err = clBuildProgram(p_program, 0, NULL, "-auto-prefetch-level=0”,…);

If you identify the accesses that suffer from cache misses despite the automatic prefetching, you can add prefetches to your kernel code using the prefetch() built-ins. Consider using the Intel® VTune™ Amplifier XE for such purpose.

Adding “explicit” or manual prefetches using the prefetch() built-ins can be especially useful in the following cases:

Also take into consideration that: