Tutorial 6. Shared Local Memory and Thread Group

CM also allows users to use shared-local-memory (SLM) that can be shared among a group of threads. On GEN, SLM is carved out of the level-3 cache, and reconfigured to be 16-way banked. A group of threads that share SLM will be dispatched to the same half-slice. The maximum size for SLM is 64KB.

SLM is useful when you want data-sharing among a group of threads. Because it has more banks than L3 and is user-program controlled. It can be more efficient than L3-data-cache for scattered read and write. The following are the typical steps for using SLM and thread-grouping in CM.

These code are extracted from nbody_SLM_release.

Host Program: CreateThreadGroupSpace

One important note: CreateThreadGroupSpace will put GPU thread-dispatching into GPGPU mode, which is different from the media-Walker mode, therefore the thread dependence setting, which is associated with the media-walker, are not available when thread groups are in use.

  // Each CmKernel can be executed by multiple concurrent threads.
  // Calculates the number of threads to spawn on the GPU for this kernel.
  int threads = num_bodies / BODIES_CHUNK;

  // In this case, we want to maximize the group size to get the most
  // data-share, so we need to query the maximum group size that target
  // machine can support.
  size_t size = 4;
  int max_thread_count_per_thread_group = 0;
  cm_result_check(device->GetCaps(
      CAP_USER_DEFINED_THREAD_COUNT_PER_THREAD_GROUP,
      size,
      &max_thread_count_per_thread_group));
  int group_count = (threads + max_thread_count_per_thread_group - 1) /
      max_thread_count_per_thread_group;
  while (threads % group_count != 0) {
    group_count++;
  }

  // Creates a thread group space.
  // This function creates a thread group space specified by the height and
  // width dimensions of the group space, and the height and width dimensions
  // of the thread space within a group.In the GPGPU mode, the host program
  // needs to specify the group space and the thread space within each group.
  // This group and thread space information can be subsequently used to
  // execute a kernel in that space later.
  CmThreadGroupSpace *thread_group_space = nullptr;
  cm_result_check(device->CreateThreadGroupSpace(threads / group_count,
                                                 1,
                                                 group_count,
                                                 1,
                                                 thread_group_space));

Host Program: EnqueueWithGroup

  // Launches the task on the GPU. Enqueue is a non-blocking call, i.e. the
  // function returns immediately without waiting for the GPU to start or
  // finish execution of the task. The runtime will query the HW status. If
  // the hardware is not busy, the runtime will submit the task to the
  // driver/HW; otherwise, the runtime will submit the task to the driver/HW
  // at another time.
  // An event, "sync_event", is created to track the status of the task.
  CmEvent *sync_event = nullptr;
  cm_result_check(cmd_queue->EnqueueWithGroup(task,
                                              sync_event,
                                              thread_group_space));

Kernel Program

Several builtin function worth attention in this programs are cm_slm_init, cm_slm_alloc, and cm_slm_load.

extern "C" _GENX_MAIN_ void cmNBody(SurfaceIndex INPOS, SurfaceIndex INVEL,
                                    SurfaceIndex OUTPOS, SurfaceIndex OUTVEL,
                                    float deltaTime, float damping,
                                    float softeningSquared, int numBodies) {

    // Only 4K bodies fit in SLM
    // 1. Foreach 4K bodies - For a total of 16K bodies
    // 2.   LOAD 4K bodies to SLM: i.e. Read from Memory and Write to SLM
    // 3.   Foreach MB (32 bodies here) - For a total of 4 MBs
    // 4.     READ from Memory: Position of thisThreadBodies
    // 6.     Foreach set of 32 bodies In 4K SLM bodies
    // 7        READ from SLM: Position of 32 bodies
    // 8.       Compute Interaction between thisThreadBodies and the 32
    //            bodies read from SLM; Compute and update force0, force1,
    //            force2 for forces in 3D
    // 9.     READ from Memory: Velocity of thisThreadBodies
    // 10.     Compute New Velocity and New Position of thisThreadBodies
    // 11.     WRITE to Memory: New Velocity of thisThreadBodies
    // 12.     WRITE to Memory: New Position of thisThreadBodies

    cm_slm_init(SLM_SIZE);
    uint bodiesInSLM = cm_slm_alloc(SLM_SIZE);

    gThreadID = cm_linear_global_id();
    force0 = force1 = force2 = 0.0f;

    // 1. Foreach 4K bodies - For a total of 16K bodies
    for (int iSLM = 0; iSLM < 4; iSLM++) {

        // 2. LOAD 4K bodies to SLM: i.e. Read from Memory and Write to SLM

        cm_slm_load(
            bodiesInSLM,     // slmBuffer   : SLM buffer
            INPOS,           // memSurfIndex: Memory SurfaceIndex
            iSLM * SLM_SIZE, // memOffset   : Byte-Offset in Memory Surface
            SLM_SIZE         // loadSize    : Bytes to be Loaded from Memory
            );

        // Each thread needs to process 4 Macro-Blocks (MB):
        //            One MB = 32 Bodies; Total 2 Groups with 64 threads/Group
        //            => #Bodies/Thread = TotalNumBodies/TotalNumThreads
        //                              = 16384/128 = 128 = 4 MBs
        //   - Depending on the number of threads, the number of MBs per
        //     threads can be changed by just changing this loop-count
        //   - For optimization purpose, if there are enough GRFs we can
        //     process more MBs per iteration of this loop - in that case
        //     need to change the loop-stride accordingly; If all MBs can
        //     be processed in the GRF, we can eliminate this loop

        for (int iMB = 0; iMB < 4; iMB++) {
            cmk_Nbody_ForEachMB_ForEachSLMBlock(
                INPOS, deltaTime, softeningSquared, BODIES_PER_SLM, bodiesInSLM,
                iMB);
        } // end foreach(MB)
    }     // end foreach(SLM block)

    for (int iMB = 0; iMB < 4; iMB++) {
        cmk_Nbody_OutputVelPos_ForEachMB(INPOS, INVEL, OUTPOS, OUTVEL,
                                         deltaTime, damping, iMB);
    } // end foreach(MB)
}