Intel® C++ Compiler 18.0 Developer Guide and Reference
This topic only applies when targeting Intel® Graphics Technology.
Intel® Cilk™ Plus is a deprecated feature in the Intel® C++ Compiler 18.0. An alternative for offloading to the processor graphics is planned for a future release. For more information see Migrate Your Application to use OpenMP* or Intel® TBB Instead of Intel® Cilk™ Plus.
The code inside _Cilk_for loops or _Cilk_for loop nests following #pragma offload target(gfx) and in functions qualified with #pragma offload target(gfx) or #pragma offload target(gfx_kernel) is compiled to both the target and the CPU. In addition to target attribute, the functions can be qualified with vector attributes using __declspec(target(gfx)) (Windows* and Linux*) or __attribute__((target(gfx))) (Linux* only). Using target(gfx_kernel) gives both host and target versions, but the target version cannot be called from the offload region. Rather, it must be passed as an argument to the asynchronous offload API, which is discussed in Asynchronous Offloading.
You can place #pragma offload target(gfx) only before a parallel loop, a perfect parallel loop nest, or an Intel® Cilk™ Plus array notation statement. The parallel loop must be expressed using a _Cilk_for loop.
#pragma offload can contain the following clauses when programming for Intel® Graphics Technology:
target(gfx) – A required clause for heterogeneous execution of code sections offloaded to the target.
if (condition) – The code will be executed on the target only if the condition is true.
in|out|inout|pin(variable_list: length(length_variable_in_elements))
in, out, or inout – The variables are copied between the CPU and the target memory.
pin – The variables are shared between the CPU and the target.
You must include the length clause for pointers. This clause indicates the size of data to copy to or from the target, or to share with the target, in elements of the type being referenced by the pointer. For pointers to arrays, the size is in elements of the array being referenced.
signal(address_expression) – Makes the offload asynchronous: The CPU thread that initiates the offload continues execution without waiting for offload completion. address_expression identifies the offload task so that a CPU thread can use #pragma offload target(gfx) wait(address_expression) to wait for this offload task to complete at any convenient point in the program.
wait(address_expression) – Pauses execution of the CPU thread invoking the wait until all offload tasks associated with the given address, via any prior #pragma offload target(gfx) signal constructs, have completed, and subsequently executes the offload block that follows the pragma.
#pragma offload_wait target(gfx) wait (address_expression) has the same effect as #pragma offload target(gfx) wait(address_expression) except that no offload block can follow.
#pragma offload_transfer enables you to transfer data between the host and target without offloading any computation. This pragma supports the same clauses as #pragma offload. You can use signal and wait clauses to organize asynchronous data transfer.
Using pin substantially reduces the cost of offloading because instead of copying data to or from memory accessible by the target, the pin clause organizes sharing the same physical memory area between the host and the target, which is much faster. For kernels that perform substantial work on a relatively small data size, such as O(N2)), this optimization is not important.
Howeversd, it makes OS lock pinned memory pages making them non-swappable, so excessive pinning may cause overall system performance degradation.
Although by default the compiler builds an application that runs on both the host CPU and target, you can also compile the same source code to run on just the CPU, using the negative form of the [Q]offload compiler option.
unsigned parArrayRHist[256][256],
parArrayGHist[256][256], parArrayBHist[256][256];
#pragma offload target(gfx) if (do_offload) \
pin(inputImage: length(imageSize)) \
out(parArrayRHist, parArrayGHist, parArrayBHist)
__Cilk_for (int ichunk = 0; ichunk < chunkCount; ichunk++){
…
}
In the example above, the generated CPU code and the runtime do the following:
Determine if the target is available on the system.
If either the target is unavailable or do_offload is evaluated to false, the for loop executes on the CPU.
Otherwise the runtime does the following:
pin the imageSize * sizeof(inputImage[0]) bytes referenced by the pointer inputImage, organize sharing of that memory with the target, without copying data to or from the target memory.
Create the target memory areas for parArrayRHist, parArrayGHist, and parArrayBHist.
Split the iteration space of the for loop to N chunks, where N is less than or equal to chunkCount. The choice of a particular value for N is done by the offload runtime and depends on such factors as iteration space configuration, such as bounds or strides, and the maximum value that can be controlled by environment variables, as demonstrated below in the document.
Create a task with N target threads, each assigned with its own iteration space chunk.
Enqueue the task for execution on the target.
Wait for completion of the task’s execution on the target.
Copy parArrayRHist, parArrayGHist, and parArrayBHist from the target memory to the CPU memory, thereby ensuring that the results are immediately visible to all CPU threads.
float (* A)[k] = (float (*)[k])matA;
float (* B)[n] = (float (*)[n])matB;
float (* C)[n] = (float (*)[n])matC;
#pragma offload target(gfx) if (do_offload) \
pin(A: length(m*k)), pin(B: length(k*n)), pin(C: length(m*n))
__Cilk_for (int r = 0; r < m; r += TILE_m) {
__Cilk_for (int c = 0; c < n; c += TILE_n) {
…
}
}
In the example above:
Using perfectly nested __Cilk_for loops allows the compiler to collapse the nested loops. So the iteration space of the offloaded loop nest is 2 dimensional, encompassing both the r and the c loops, and each target thread is allotted a two dimensional iteration space chunk for parallel execution.
Although A, B and C are defined as pointers to arrays, length is specified in elements of the float-type arrays referred to by the pointers.
In this example, initialization of the in1 array is offloaded to the processor graphics, and in parallel the host initalizes in2. Then in1 and in2 are used in the computation of the out performed on the host.
To ensure that the initialization of in1 is complete, the omp taskwait pragma appears before computing out.
When you invoke the compiler, you must include the following compiler options to enable offloading to the processor graphics with the help of OpenMP* syntax: /Qopenmp /Qopenmp-offload=gfx (Windows*) or -qopenmp -qopenmp-offload=gfx (Linux*)
int* in1 = (int*)malloc(SIZE * sizeof(int));
int* in2 = (int*)malloc(SIZE * sizeof(int));
int* out = (int*)malloc(SIZE * sizeof(int));
#pragma omp target map(tofrom: in1[0:SIZE]) nowait
#pragma omp parallel for
for (int i = 0; i < SIZE; i++) {
in1[i] = 1;
}
#pragma omp parallel for
for (int i = 0; i < SIZE; i++) {
in2[i] = i;
}
#pragma omp taskwait
#pragma omp parallel for
for (int i = 0; i < SIZE; i++) {
out[i] = in1[i] + in2[i];
}
This example uses code that is roughly equivalent to the OpenMP example above, but it is written using #pragma offload syntax. To ensure that the initialization of in1 is complete, the offload_wait pragma appears before computing out.
int* in1 = (int*)malloc(SIZE * sizeof(int));
int* in2 = (int*)malloc(SIZE * sizeof(int));
int* out = (int*)malloc(SIZE * sizeof(int));
#pragma offload target(gfx) pin(in1: length(SIZE)) signal(in1)
_Cilk_for (int i = 0; i < SIZE; i++) {
in1[i] = SIZE - i;
}
_Cilk_for (int i = 0; i < SIZE; i++) {
in2[i] = i;
}
#pragma offload_wait target(gfx) wait(in1)
_Cilk_for (int i = 0; i < SIZE; i++) {
out[i] = in1[i] + in2[i];
}