Tutorial 6. Shared Local Memory and Thread Group¶
CM also allows users to use shared-local-memory (SLM) that can be shared among a group of threads. On GEN, SLM is carved out of the level-3 cache, and reconfigured to be 16-way banked. A group of threads that share SLM will be dispatched to the same half-slice. The maximum size for SLM is 64KB.
SLM is useful when you want data-sharing among a group of threads. Because it has more banks than L3 and is user-program controlled. It can be more efficient than L3-data-cache for scattered read and write. The following are the typical steps for using SLM and thread-grouping in CM.
These code are extracted from nbody_SLM_release.
Host Program: CreateThreadGroupSpace¶
One important note: CreateThreadGroupSpace will put GPU thread-dispatching into GPGPU mode, which is different from the media-Walker mode, therefore the thread dependence setting, which is associated with the media-walker, are not available when thread groups are in use.
// Each CmKernel can be executed by multiple concurrent threads.
// Calculates the number of threads to spawn on the GPU for this kernel.
int threads = num_bodies / BODIES_CHUNK;
// In this case, we want to maximize the group size to get the most
// data-share, so we need to query the maximum group size that target
// machine can support.
size_t size = 4;
int max_thread_count_per_thread_group = 0;
cm_result_check(device->GetCaps(
CAP_USER_DEFINED_THREAD_COUNT_PER_THREAD_GROUP,
size,
&max_thread_count_per_thread_group));
int group_count = (threads + max_thread_count_per_thread_group - 1) /
max_thread_count_per_thread_group;
while (threads % group_count != 0) {
group_count++;
}
// Creates a thread group space.
// This function creates a thread group space specified by the height and
// width dimensions of the group space, and the height and width dimensions
// of the thread space within a group.In the GPGPU mode, the host program
// needs to specify the group space and the thread space within each group.
// This group and thread space information can be subsequently used to
// execute a kernel in that space later.
CmThreadGroupSpace *thread_group_space = nullptr;
cm_result_check(device->CreateThreadGroupSpace(threads / group_count,
1,
group_count,
1,
thread_group_space));
Host Program: EnqueueWithGroup¶
// Launches the task on the GPU. Enqueue is a non-blocking call, i.e. the
// function returns immediately without waiting for the GPU to start or
// finish execution of the task. The runtime will query the HW status. If
// the hardware is not busy, the runtime will submit the task to the
// driver/HW; otherwise, the runtime will submit the task to the driver/HW
// at another time.
// An event, "sync_event", is created to track the status of the task.
CmEvent *sync_event = nullptr;
cm_result_check(cmd_queue->EnqueueWithGroup(task,
sync_event,
thread_group_space));
Kernel Program¶
Several builtin function worth attention in this programs are cm_slm_init
, cm_slm_alloc
, and cm_slm_load
.
extern "C" _GENX_MAIN_ void cmNBody(SurfaceIndex INPOS, SurfaceIndex INVEL,
SurfaceIndex OUTPOS, SurfaceIndex OUTVEL,
float deltaTime, float damping,
float softeningSquared, int numBodies) {
// Only 4K bodies fit in SLM
// 1. Foreach 4K bodies - For a total of 16K bodies
// 2. LOAD 4K bodies to SLM: i.e. Read from Memory and Write to SLM
// 3. Foreach MB (32 bodies here) - For a total of 4 MBs
// 4. READ from Memory: Position of thisThreadBodies
// 6. Foreach set of 32 bodies In 4K SLM bodies
// 7 READ from SLM: Position of 32 bodies
// 8. Compute Interaction between thisThreadBodies and the 32
// bodies read from SLM; Compute and update force0, force1,
// force2 for forces in 3D
// 9. READ from Memory: Velocity of thisThreadBodies
// 10. Compute New Velocity and New Position of thisThreadBodies
// 11. WRITE to Memory: New Velocity of thisThreadBodies
// 12. WRITE to Memory: New Position of thisThreadBodies
cm_slm_init(SLM_SIZE);
uint bodiesInSLM = cm_slm_alloc(SLM_SIZE);
gThreadID = cm_linear_global_id();
force0 = force1 = force2 = 0.0f;
// 1. Foreach 4K bodies - For a total of 16K bodies
for (int iSLM = 0; iSLM < 4; iSLM++) {
// 2. LOAD 4K bodies to SLM: i.e. Read from Memory and Write to SLM
cm_slm_load(
bodiesInSLM, // slmBuffer : SLM buffer
INPOS, // memSurfIndex: Memory SurfaceIndex
iSLM * SLM_SIZE, // memOffset : Byte-Offset in Memory Surface
SLM_SIZE // loadSize : Bytes to be Loaded from Memory
);
// Each thread needs to process 4 Macro-Blocks (MB):
// One MB = 32 Bodies; Total 2 Groups with 64 threads/Group
// => #Bodies/Thread = TotalNumBodies/TotalNumThreads
// = 16384/128 = 128 = 4 MBs
// - Depending on the number of threads, the number of MBs per
// threads can be changed by just changing this loop-count
// - For optimization purpose, if there are enough GRFs we can
// process more MBs per iteration of this loop - in that case
// need to change the loop-stride accordingly; If all MBs can
// be processed in the GRF, we can eliminate this loop
for (int iMB = 0; iMB < 4; iMB++) {
cmk_Nbody_ForEachMB_ForEachSLMBlock(
INPOS, deltaTime, softeningSquared, BODIES_PER_SLM, bodiesInSLM,
iMB);
} // end foreach(MB)
} // end foreach(SLM block)
for (int iMB = 0; iMB < 4; iMB++) {
cmk_Nbody_OutputVelPos_ForEachMB(INPOS, INVEL, OUTPOS, OUTVEL,
deltaTime, damping, iMB);
} // end foreach(MB)
}