Intel® SDK for OpenCL™ Applications includes an automatic vectorization module as part of the OpenCL program build process. Depending on the kernel code, this operation might have some limitations. When it is beneficial performance-wise, the module automatically packs adjacent work-items (from dimension zero of the ND-range) and executes them with SIMD instructions.
When using SIMD instructions, vector registers store a group of data
elements of the same data type, such as float
or int
.
The number of data elements that fit in one register depends on the data
type width, for example: Intel® Xeon® processor (formerly known Intel®
processor code name Skylake) offers vector register width of 512 bits.
Each vector register (zmm) can store sixteen float (or alternatively eight
double) or sixteen 32-bit integer numbers, and these are the most natural
data types to work with Intel Xeon processor. Smaller data types are also
processed by 16 elements at a time with some conversions.
A work group is the finest granularity for thread-level parallelism. Different threads pick up different work groups. Thus, per-work-group amount of calculations coupled with right work-group size and the resulting number of work groups available for parallel execution are critical factors in achieving good scalability for Intel Xeon processor.
The vectorization module enables you to benefit from vector units without
writing explicit vector code. Also, you do not need for
loops
within kernels to benefit from vectorization. For better results, process
a single data element in the kernel and let the vectorization module take
care of the rest. To get more performance gains from vectorization, make
you OpenCL code as simple as possible.
The vectorization module works best for the kernels that operate on
elements of float
(double
) or int
data types. The performance benefit might be lower for the kernels that
include a complicated control flow.
The vectorization module packs work items for dimension zero of NDRange. Consider the following code example:
___kernel foo(…) for (int i = 0; i < get_local_size(2); i++) for (int j = 0; j < get_local_size(1); j++) for (int k = 0; k < get_local_size(0); k++) Kernel_Body;
After vectorization, the code example of the work group looping over work items appears as follows:
___kernel foo(…) for (int i = 0; i < get_local_size(2); i++) for (int j = 0; j < get_local_size(1); j++) for (int k = 0; k < get_local_size(0); k+=SIMD_WIDTH) VECTORIZED_Kernel_Body;
Also note that the dimension zero is the innermost loop and is vectorized. For more information, refer to the Intel® OpenCL™ Implicit Vectorization Module overview at http://llvm.org/devmtg/2011-11/Rotem_IntelOpenCLSDKVectorizer.pdf and Autovectorization in Intel® SDK for OpenCL™ Applications version 1.5.