Intel® C++ Compiler 18.0 Developer Guide and Reference
This topic only applies when targeting Intel® Graphics Technology.
Intel® Cilk™ Plus is a deprecated feature in the Intel® C++ Compiler 18.0. An alternative for offloading to the processor graphics is planned for a future release. For more information see Migrate Your Application to use OpenMP* or Intel® TBB Instead of Intel® Cilk™ Plus.
First level parallel loop nests, those marked with _Thread_group, may appear as the parallel loop for functions declared with the attribute __declspec(gfx_kernel)
Only the following constructs are allowed inside the first level parallel loop nest:
SLM data declaration with optional initializer
Assignment to the __thread_group_local data
second level parallel loop nests
Calls to the thread barrier intrinsic
serial code (see a dedicated section for the definition of the serial code and associated restrictions)
Chunk size is guaranteed to be 1 for all dimensions, so each thread group executes exactly one iteration of the first level loop nest.
Serial code is any code inside a first-level parallel loop nest, that is not syntactically included in any second-level parallel loop nest. For example, lines 4-5:
01 __declspec(target(gfx_kernel)) void slm_enabled_kernel(int *data, int param) {
02 _Cilk_for _Thread_group (...) {
03 ...
04 int lo = param/2; //serial code
05 int up = param*2; //serial code
06
07 _Cilk_for (int j = lo; j < up; j++) {
08 ...
09 }
10 }
11 }
Serial code, of the kind described, is executed by the master thread of each thread group. When a parallel construct is met, such as the nested _Cilk_for loop after the serial code, the master thread splits execution among other threads in a group.
Here is an excerpt from matrix multiplication code that uses SLM, and illustrates the serial code requirement:
01 _Cilk_for _Thread_group (int tg_y = 0; tg_y < Y; tg_y += SLM_TILE_Y) {
02 _Cilk_for _Thread_group (int tg_x = 0; tg_x < X; tg_x += SLM_TILE_X) {
03 // declare "supertiles" of each matrix to be allocated in SLM
04 __thread_group_local float slm_atile[SLM_TILE_Y][SLM_TILE_K];
05 __thread_group_local float slm_btile[SLM_TILE_K][SLM_TILE_X];
06 __thread_group_local float slm_ctile[SLM_TILE_Y][SLM_TILE_X];
07
08 // initialize the result supertile (in parallel)
09 _Cilk_for (int i = 0; i < SLM_TILE_Y; i++)
10 _Cilk_for (int j = 0; j < SLM_TILE_X; j++)
11 slm_ctile[i][j] = 0.0;
12
13 // calculate the dot product of current A's supertile row and
14 // B's supertile column:
15 for (int super_k = 0; super_k < K; super_k += SLM_TILE_K) {
16 // Parallel execution
17 // cache A's and B's "supertiles" in SLM (in parallel)
18 slm_atile[:][:] = A[tg_y:SLM_TILE_Y][super_k:SLM_TILE_K];
19 slm_btile[:][:] = B[super_k:SLM_TILE_Y][tg_x:SLM_TILE_X];
20
21 // all threads wait till copy is done
22 _gfx_gpgpu_thread_barrier();
23
24 // parallel execution
25 // now multiply the supertiles as usual matrices using tiled
26 // matrix multiplication algorithm (in parallel)
27 _Cilk_for (int t_y = 0; t_y < SLM_TILE_Y; t_y += TILE_Y) {
28 _Cilk_for (int t_x = 0; t_x < SLM_TILE_X; t_x += TILE_X) {
Lines 1-2 are the first-level parallel loop nest, line 13 is the serial code, and lines 16-17 and lines 25-26 are the second-level parallel loop nests. The serial code is a loop over supertiles which calculates their dot product. This calculation is done by every thread group and this cycle is not parallelized between threads in a thread group.
Every thread in a group executes the same serial code. Code is not allowed to give different results in different threads within the same thread group and whose results could be visible outside current thread. The serial code restrictions are:
Only local variables and formal parameters (for async kernels) or #pragma offload parameters (for offload blocks) can be accessed; so, for example, access to static variables or thread group local variables is not allowed
You can only offload perfect loop nests to the processor graphics. This is also true for two-level parallelism, where the first-level nest must be perfect. This implies that the local variables mentioned are those declared inside the first-level nest.
Function calls are not allowed.
Memory updates, such as those through pointer parameter dereference, are not allowed.
Local variables used in second-level parallel nests but defined outside of the second level parallel loops are treated as firstprivate. If such a variable is live, that is, its value is used, after the loop nest, no updates of this variable are allowed within the loop nest.
Second-level parallel loop nests can be a perfect _Cilk_for loop nest
The loops within the nest must be perfectly nested
Second-level parallel loop nests must be textually included into the first-level nest. They cannot reside in a function called from the first-level nest
At least one second level parallel loop nest must be present.
The following syntax restrictions and semantics apply to thread group local data:
You must declare variables as local, immediately nested in a first-level parallel loop nest
__thread_group_local is always mapped to SLM, so the total size of the data cannot exceed the available SLM
Lifetime:
__thread_group_local data is allocated upon the start of a thread group, immediately before any of the group’s threads start execution, and de-allocated upon thread group end, immediately after executing the last thread.
Initializers are allowed and will become assigned to the SLM data
Initializers are executed by the master thread only.
Without initialization there is no defined initial value.
Variables that can be declared __thread_group_local are limited to scalars, arrays of scalars, and PODS.
Kind |
Example |
Restrictions |
---|---|---|
Variable declaration |
|
OK. Valid SLM data declaration |
Variable declaration |
|
Invalid SLM data declaration. Must be nested within the thread group _Cilk_for nest. |
Pointer declaration |
|
OK. Declaration of a pointer allocated in SLM. |
Pointer declaration |
|
OK. Declaration of a pointer to SLM-allocated data. |
Class object |
|
Invalid; may be supported in the future with a language extension. |
Return type |
|
OK. Return type is pointer. |
Return type |
|
OK. Type qualifiers on rval are don't make sense and are allowed/ignored. |
Structure field |
|
Declaration is OK. Not usable in some contexts. |
Structure field |
|
Not allowed. Entire variable must be __thread_local_group. |
Structure |
|
OK. Generally it is expected that SLM data is arrays but any data is allowed. |
Parallel loops |
|
Pragmatically this will be diagnosed as an error if foo is not inlined, but is OK from a language perspective. |