Tutorial 3. Enqueuing Multiple Kernels

You may have noticed that Enqueue function takes an array of kernels. So you can enqueue multiple kernels.

Enqueuing two independent kernels

The following code-block is extracted from multi_kernels.

In this example, two kernels are launched independently (no specific execution order). The linear kernel processes the top-half of the image, and the sepia kernel processes the bottom-half of the image.

First, create the linear kernel, notice the thread-count and thread-space are only for the half of the image.

  // Creates the linear kernel.
  // Param program: CM Program from which the kernel is created.
  // Param "linear": The kernel name which should be no more than 256 bytes
  // including the null terminator.
  CmKernel *kernel_linear = nullptr;
  cm_result_check(device->CreateKernel(program, "linear", kernel_linear));

  // Each CmKernel can be executed by multiple concurrent threads.
  // Here, for "linear" kernel, each thread works on a block of 6x8 pixels.
  // The thread width is equal to input image width divided by 8.
  // The thread height is equal to input image height divided by 6. For this
  // kernel only half of the image is processed; therefore, the thread height
  // is divided by two.
  int thread_width  = width/8;
  int thread_height = (height/6)/2;

  // Creates a CmThreadSpace object.
  // There are two usage models for the thread space. One is to define the
  // dependency between threads to run in the GPU. The other is to define a
  // thread space where each thread can get a pair of coordinates during
  // kernel execution. For this example, we use the latter usage model.
  CmThreadSpace *thread_space_linear = nullptr;
  cm_result_check(device->CreateThreadSpace(thread_width,
                                            thread_height,
                                            thread_space_linear));

  // Associates a thread space to this kernel.
  cm_result_check(kernel_linear->AssociateThreadSpace(thread_space_linear));

Second, create the sepia kernel, notice the thread-count and thread-space are also for the half of the image. Also the image height is passed into the sepia kernel. Sepia kernel is modified to process the bottom-half of the image.

  // Creates the second kernel "sepia".
  CmKernel *kernel_sepia = nullptr;
  cm_result_check(device->CreateKernel(program, "sepia" , kernel_sepia));

  // For "sepia" kernel, each thread works on a block of 8x8 pixels.
  // The thread width is equal to input image width divided by 8.
  // The thread height is equal to input image height divided by 8. For this
  // kernel only half of the image is processed; therefore, the thread height
  // is divided by two.
  thread_width = width/8;
  thread_height = (height/8)/2;

  // Creates thread space for kernel "sepia".
  CmThreadSpace *thread_space_sepia = nullptr;
  cm_result_check(device->CreateThreadSpace(thread_width,
                                            thread_height,
                                            thread_space_sepia));

  // Associates the thread space to kernel "sepia".
  cm_result_check(kernel_sepia->AssociateThreadSpace(thread_space_sepia));

Finally add both kernels to the kernel-array, and enqueue.

  // Creates a CmTask object.
  // The CmTask object is a container for CmKernel pointers. It is used to
  // enqueue the kernels for execution.
  CmTask *task = nullptr;
  cm_result_check(device->CreateTask(task));

  // Adds a CmKernel pointer to CmTask.
  // This task has two kernels, "linear" and "sepia".
  cm_result_check(task->AddKernel(kernel_linear));
  cm_result_check(task->AddKernel(kernel_sepia));

  // Creates a task queue.
  // The CmQueue is an in-order queue. Tasks get executed according to the
  // order they are enqueued. The next task does not start execution until the
  // current task finishes.
  CmQueue *queue = nullptr;
  cm_result_check(device->CreateQueue(queue));

  // Launches the task on the GPU. Enqueue is a non-blocking call, i.e. the
  // function returns immediately without waiting for the GPU to start or
  // finish execution of the task. The runtime will query the HW status. If
  // the hardware is not busy, the runtime will submit the task to the
  // driver/HW; otherwise, the runtime will submit the task to the driver/HW
  // at another time.
  // An event, "sync_event", is created to track the status of the task.
  CmEvent *sync_event = nullptr;
  cm_result_check(queue->Enqueue(task, sync_event));

Enqueuing two kernels with sync

The following code-block is extracted from BufferTest_EnqueueWithSync.

In order to force an execution order among multiple kernels in the kernel array, you need to add synchronization.

    // Creates a CmTask object.
    // The CmTask object is a container for CmKernel pointers. It is used to
    // enqueue the kernels for execution.
    CmTask *task = nullptr;
    cm_result_check(device->CreateTask(task));

    for (int i = 0; i < KERNEL_NUM_PER_TASK; i++) {
        // Associates a thread space to this kernel.
        cm_result_check(kernel[i]->AssociateThreadSpace(thread_space));

        // When a CmBuffer is created by the CmDevice a SurfaceIndex object is
        // created. This object contains a unique index value that is mapped
        // to the CmBuffer.
        // Uses the output CmBuffer of previous kernel as the input CmBuffer of
        // this kernel.
        SurfaceIndex *input_surface_idx = nullptr;
        SurfaceIndex *output_surface_idx = nullptr;
        if (i == 0) {
            // Gets the input CmBuffer index.
            input_surface_idx = nullptr;
            buffer->GetIndex(input_surface_idx);
            // Gets the output CmBuffer index.
            output_surface_idx = nullptr;
            output_surface[i]->GetIndex(output_surface_idx);
        } else {
            // Gets the input CmBuffer index.
            input_surface_idx = nullptr;
            output_surface[i - 1]->GetIndex(input_surface_idx);
            // Gets the output CmBuffer index.
            output_surface_idx = nullptr;
            output_surface[i]->GetIndex(output_surface_idx);
        }

        // Sets a per kernel argument.
        // Sets the input CmBuffer index as the first argument of the kernel.
        // Sets the output CmBuffer index as the second argument of the kernel.
        cm_result_check(kernel[i]->SetKernelArg(0,
                                                sizeof(SurfaceIndex),
                                                input_surface_idx));
        cm_result_check(kernel[i]->SetKernelArg(1,
                                                sizeof(SurfaceIndex),
                                                output_surface_idx));

        // Adds a CmKernel pointer to CmTask.
        // This task has 16 kernels.
        cm_result_check(task->AddKernel(kernel[i]));

        // Inserts a synchronization pointer between two kernels(except for
        // the last one).
        // The 2nd kernel only will be executed after the 1st kernel finishes
        // execution.
        if (i < (KERNEL_NUM_PER_TASK - 1)) {
            cm_result_check(task->AddSync());
        }
    }