Tutorial 3. Enqueuing Multiple Kernels¶
You may have noticed that Enqueue function takes an array of kernels. So you can enqueue multiple kernels.
Enqueuing two independent kernels¶
The following code-block is extracted from multi_kernels.
In this example, two kernels are launched independently (no specific execution order). The linear kernel processes the top-half of the image, and the sepia kernel processes the bottom-half of the image.
First, create the linear kernel, notice the thread-count and thread-space are only for the half of the image.
// Creates the linear kernel.
// Param program: CM Program from which the kernel is created.
// Param "linear": The kernel name which should be no more than 256 bytes
// including the null terminator.
CmKernel *kernel_linear = nullptr;
cm_result_check(device->CreateKernel(program, "linear", kernel_linear));
// Each CmKernel can be executed by multiple concurrent threads.
// Here, for "linear" kernel, each thread works on a block of 6x8 pixels.
// The thread width is equal to input image width divided by 8.
// The thread height is equal to input image height divided by 6. For this
// kernel only half of the image is processed; therefore, the thread height
// is divided by two.
int thread_width = width/8;
int thread_height = (height/6)/2;
// Creates a CmThreadSpace object.
// There are two usage models for the thread space. One is to define the
// dependency between threads to run in the GPU. The other is to define a
// thread space where each thread can get a pair of coordinates during
// kernel execution. For this example, we use the latter usage model.
CmThreadSpace *thread_space_linear = nullptr;
cm_result_check(device->CreateThreadSpace(thread_width,
thread_height,
thread_space_linear));
// Associates a thread space to this kernel.
cm_result_check(kernel_linear->AssociateThreadSpace(thread_space_linear));
Second, create the sepia kernel, notice the thread-count and thread-space are also for the half of the image. Also the image height is passed into the sepia kernel. Sepia kernel is modified to process the bottom-half of the image.
// Creates the second kernel "sepia".
CmKernel *kernel_sepia = nullptr;
cm_result_check(device->CreateKernel(program, "sepia" , kernel_sepia));
// For "sepia" kernel, each thread works on a block of 8x8 pixels.
// The thread width is equal to input image width divided by 8.
// The thread height is equal to input image height divided by 8. For this
// kernel only half of the image is processed; therefore, the thread height
// is divided by two.
thread_width = width/8;
thread_height = (height/8)/2;
// Creates thread space for kernel "sepia".
CmThreadSpace *thread_space_sepia = nullptr;
cm_result_check(device->CreateThreadSpace(thread_width,
thread_height,
thread_space_sepia));
// Associates the thread space to kernel "sepia".
cm_result_check(kernel_sepia->AssociateThreadSpace(thread_space_sepia));
Finally add both kernels to the kernel-array, and enqueue.
// Creates a CmTask object.
// The CmTask object is a container for CmKernel pointers. It is used to
// enqueue the kernels for execution.
CmTask *task = nullptr;
cm_result_check(device->CreateTask(task));
// Adds a CmKernel pointer to CmTask.
// This task has two kernels, "linear" and "sepia".
cm_result_check(task->AddKernel(kernel_linear));
cm_result_check(task->AddKernel(kernel_sepia));
// Creates a task queue.
// The CmQueue is an in-order queue. Tasks get executed according to the
// order they are enqueued. The next task does not start execution until the
// current task finishes.
CmQueue *queue = nullptr;
cm_result_check(device->CreateQueue(queue));
// Launches the task on the GPU. Enqueue is a non-blocking call, i.e. the
// function returns immediately without waiting for the GPU to start or
// finish execution of the task. The runtime will query the HW status. If
// the hardware is not busy, the runtime will submit the task to the
// driver/HW; otherwise, the runtime will submit the task to the driver/HW
// at another time.
// An event, "sync_event", is created to track the status of the task.
CmEvent *sync_event = nullptr;
cm_result_check(queue->Enqueue(task, sync_event));
Enqueuing two kernels with sync¶
The following code-block is extracted from BufferTest_EnqueueWithSync.
In order to force an execution order among multiple kernels in the kernel array, you need to add synchronization.
// Creates a CmTask object.
// The CmTask object is a container for CmKernel pointers. It is used to
// enqueue the kernels for execution.
CmTask *task = nullptr;
cm_result_check(device->CreateTask(task));
for (int i = 0; i < KERNEL_NUM_PER_TASK; i++) {
// Associates a thread space to this kernel.
cm_result_check(kernel[i]->AssociateThreadSpace(thread_space));
// When a CmBuffer is created by the CmDevice a SurfaceIndex object is
// created. This object contains a unique index value that is mapped
// to the CmBuffer.
// Uses the output CmBuffer of previous kernel as the input CmBuffer of
// this kernel.
SurfaceIndex *input_surface_idx = nullptr;
SurfaceIndex *output_surface_idx = nullptr;
if (i == 0) {
// Gets the input CmBuffer index.
input_surface_idx = nullptr;
buffer->GetIndex(input_surface_idx);
// Gets the output CmBuffer index.
output_surface_idx = nullptr;
output_surface[i]->GetIndex(output_surface_idx);
} else {
// Gets the input CmBuffer index.
input_surface_idx = nullptr;
output_surface[i - 1]->GetIndex(input_surface_idx);
// Gets the output CmBuffer index.
output_surface_idx = nullptr;
output_surface[i]->GetIndex(output_surface_idx);
}
// Sets a per kernel argument.
// Sets the input CmBuffer index as the first argument of the kernel.
// Sets the output CmBuffer index as the second argument of the kernel.
cm_result_check(kernel[i]->SetKernelArg(0,
sizeof(SurfaceIndex),
input_surface_idx));
cm_result_check(kernel[i]->SetKernelArg(1,
sizeof(SurfaceIndex),
output_surface_idx));
// Adds a CmKernel pointer to CmTask.
// This task has 16 kernels.
cm_result_check(task->AddKernel(kernel[i]));
// Inserts a synchronization pointer between two kernels(except for
// the last one).
// The 2nd kernel only will be executed after the 1st kernel finishes
// execution.
if (i < (KERNEL_NUM_PER_TASK - 1)) {
cm_result_check(task->AddSync());
}
}