Mapping Basic OpenCL™ Concepts to the Intel® Many Integrated Core Architecture

The execution order of work-items within a work-group, as well as the execution order of work-groups, is implementation-specific. When launching the kernel for execution, the host code defines the grid dimensions, or the global work size. The host code can also define the partitioning to work-groups or leave it to the implementation.

Intel® Many Integrated Core Architecture (Intel® MIC Architecture) combines many cores onto a single chip. Each core runs four hardware threads. Still these cores/threads of one coprocessor constitute a single OpenCL™ device. Separate hardware threads are OpenCL compute units.

The OpenCL standard basic data parallelism enables kernel to execute concurrently on multiple work-items. Intel Xeon Phi™ coprocessor based on the Intel MIC Architecture provides performance acceleration using Single Instruction Multiple Data (SIMD) instruction set. Benefitting from automatic vectorization, which enables processing multiple work-items with SIMD is a key of achieving good performance at a work-group level.

When using SIMD instructions, vector registers store a group of data elements of the same data type, such as float or int. The number of data elements that fit in one register depends on the data type width, for example: Intel® Xeon® processor (formerly known Intel® processor code name Skylake) offers vector register width of 512 bits. Each vector register (zmm) can store sixteen float (or alternatively eight double) or again sixteen 32-bit integer numbers, and these are the most natural data types to work with Intel Xeon processor. Smaller data types still are also processed by 16 elements at a time plus some conversions to be involved.

To utilize wide vector registers Intel® SDK for OpenCL™ Applications contains an implicit vectorization module, which packs several work-items (from the same work-group) to run simultaneously. Depending on the kernel code, this operation might have some limitations.

Work-group is a finest granularity for thread-level parallelism. Different threads pick up different work-groups. Thus per-work-group amount of calculations coupled with right work-group size and resulting number of work-groups that is available for parallel execution are critical factors in achieving good scalability for Intel Xeon processor.

See Also

Using Data Parallelism
Developer Guide for Intel® SDK for OpenCL™ Applications
OpenCL™ 1.2 Specification at https://www.khronos.org/registry/OpenCL/specs/opencl-1.2.pdf