Compiler inserts prefetch instructions into the resulting assembly based on kernel source code analysis. This capability of the compiler is called automatic prefetching. Automatic prefetching works for memory access for the variables in the global and constant address spaces. It is limited to the following scenarios of well-detected memory access pattern, when it is generally beneficial to prefetch data in advance:
id
, id+1
, id+2
, and so on work-items in the following example:
__kernel void my_kernel0 (__global int* a, __global int* b, __global int* c) { size_t id = get_global_id(0); a[id] = b[id] + c[id]; }
The prefetch distance is computed by the compiler, based on the amount of work inside the loop. To benefit from the automatic compiler-generated prefetches, consider the following:
Consider the following examples of automatic software prefetching:
__kernel void my_kernel1 (__global int* a, __global int* b, __global int* c, __global int* d, __global int* e, __global int* f, int m) { size_t id = get_global_id(0); a[id] = b[id] + c[id]; /* a, b and c can be prefetched (if level 1 or more aggressive prefetching level is enabled, below)*/ if (a[id] > 0) //condition doesn’t block the prefetching for the next lines { c[id] = d[id]; /* d can be prefetched (if level 2 or more aggressive prefetching level is enabled, below)*/ d[id] = m; // m is not prefetched } e[a[id]] = f[b[id]]; // e and f can be prefetched (if level 3 or more aggressive prefetching level is enabled, below) }
__kernel void my_kernel2 (__global int* a, __global int* b, __global int* c, __global int* e, __global int* f, int m) { size_t id = get_global_id(0); for (int i=0; i < m; i++) { sum += a[id]* b[i] + c[i]; /* b and c can be prefetched (if level 1 or more aggressive prefetching level is enabled, below), a is not prefetched*/ sum += e[id] * f[b[i]]; // e and f are not prefetched }
d[id] = sum;
}
Automatic prefetching availability differs according to the level:
The default level is two, which provides the best performance in most cases. In some cases, level three may provide performance gains. In other cases, for example, when the code is not memory-bound but computational intensive, it might be beneficial to disable auto prefetching by setting it to level zero, or restricting it to the obvious cases by setting it to level one. See the above examples for accesses and approximate associated level.
Auto prefetching level is set for each OpenCL program separately by use of the options
argument of the clBuildProgram()
API call. For example, the following call disables generating auto-prefetching for the program pointed by p_program
:
err = clBuildProgram(p_program, 0, NULL, "-auto-prefetch-level=0”,…);
If you identify the accesses that suffer from cache misses despite the automatic prefetching, you can add prefetches to your kernel code using the prefetch()
built-ins. Consider using the Intel® VTune™ Amplifier XE for such purpose.
Adding “explicit” or manual prefetches using the prefetch()
built-ins can be especially useful in the following cases:
x = a[b[ (int)get_global_id(0)]]in this case you can prefetch
a[b[get_global_id(0)+N]]
, where N
is the number of work-items by which you want to prefetch ahead.if (f[i] != 0) x = a[i];you can again prefetch N work-items ahead without any condition:
prefetch (a[i+N, 4)
Also take into consideration that: