Using the OpenCL vector data types is a straightforward way to directly utilize the IntelŽ Architecture vector instruction set. See the Using Vector Data Types section for more information. Consider the following code snippet:
float4 a, b; float4 c = a + b;
After compilation, it resembles the following C snippet in intrinsics:
__m128 a, b; __m128 c = _mm_add_ps(a, b);
Or in assembly:
movaps xmm0, [a] addps xmm0, [b] movaps [c], xmm0
However, in contrast to the code in intrinsics, an OpenCL kernel that
uses float4
data type, transparently benefits from IntelŽ
Advanced Vector Extensions (IntelŽ AVX) if the compiler promotes float4
to float8
. The vectorization module can pack work items automatically,
though it might be sometimes less efficient than manual packing.
If the native size for your kernel requires less than 128 bits and you want to benefit from the explicit vectorization, consider packing work items together manually.
For example, your kernel uses the float2
vector type. It
receives (x
, y
) float coordinates, and shifts
them by (dx
, dy
):
__kernel void shift_by(__global float2* coords, __global float2* deltas) { int tid = get_global_id(0); coords[tid] += deltas[tid]; }
To increase the kernel performance, you can manually pack pairs of work items:
//Assuming the target is IntelŽ AVX enabled CPU __kernel __attribute__((vec_type_hint(float8))) void shift_by(__global float2* coords, __global float2* deltas) { int tid = get_global_id(0); float8 my_coords = (float8)(coords[tid], coords[tid + 1], coords[tid + 2], coords[tid + 3]); float8 my_deltas = (float8)(deltas[tid], deltas[tid + 1], deltas[tid + 2] , deltas[tid + 3]); my_coords += my_deltas; vstore8(my_coords, tid, (__global float*)coords); }
Every work item in this kernel does four times as much job as a work item in the previous kernel. Consequently, they require only one fourth of invocations, reducing the run-time overheads. However, when you use manual packing, you must also change the host code accordingly.
For vectors of 32-bit data types, for example, int4
, int8
,
float4,
and float8
data types use explicit vectorization
to improve the performance. Other data types (for example, char3
)
may cause a behind-the-scene upcast of the input data, which has negative
impact on performance.
For best performance for a given data type, the vector width should
match the underlying SIMD width. This value differs for different architectures.
For example, consider querying the recommended vector width using clGetDeviceInfo
with CL_DEVICE_PREFERRED_VECTOR_WIDTH_INT
parameter. You
get vector width of four for the 2nd Generation IntelŽ Core processors,
but vector width of eight for higher versions of processors. Use int8
so that vector width fits both architectures. Similarly for floating point
data types, you can use float8
data to cover many potential
architectures.
NOTE: Using scalar data types such as
int
or float
is often the most scalable way
to help the compiler do right vectorization job for the specific SIMD
architecture underneath.
You can target to a specific Intel Architecture processor using a conditional
code with an OpenCL C predefined macro __INTEL_OPENCL_CPU_<CPUSIGN>.
The macro tunes the kernel for a specific CPU device microarchitecture.
<CPUSIGN>
is the CPU signature of a device.
You can specify one of the following values for this macro:
__INTEL_OPENCL_CPU_SKL__
- IntelŽ microarchitecture
code name Skylake__INTEL_OPENCL_CPU_SKX__
- IntelŽ microarchitecture
code name Skylake on Intel XeonŽ processor family__INTEL_OPENCL_CPU_BDW__
- IntelŽ microarchitecture
code name Broadwell__INTEL_OPENCL_CPU_BDW_XEON__
- IntelŽ microarchitecture
code name Broadwell on Intel XeonŽ processor family__INTEL_OPENCL_CPU_HSW__
- IntelŽ microarchitecture
code name Haswell__INTEL_OPENCL_CPU_HSW_XEON__
- IntelŽ microarchitecture
code name Haswell on Intel XeonŽ processor family__INTEL_OPENCL_CPU_IVB__
- IntelŽ microarchitecture
code name Ivy Bridge__INTEL_OPENCL_CPU_IVB_XEON__
- IntelŽ microarchitecture
code name Ivy Bridge on Intel XeonŽ processor family__INTEL_OPENCL_CPU_SNB__
- IntelŽ microarchitecture
code name Sandy Bridge__INTEL_OPENCL_CPU_SNB_XEON__
- IntelŽ microarchitecture
code name Sandy Bridge on Intel XeonŽ processor family__INTEL_OPENCL_CPU_WST__
- IntelŽ microarchitecture
code name Westmere__INTEL_OPENCL_CPU_WST_XEON__
- IntelŽ microarchitecture
code name Westmere on Intel XeonŽ processor family__INTEL_OPENCL_CPU_UNKNOWN__
- Unknown microarchitectureTo tune performance for your target CPU, you can use this macro with
intel_vec_len_hint
extension. For example:
// Kernel side. // Force vectorization with to 8 on BDW. // Runtime defines a macro corresponding to the device CPU signature. #ifdef __INTEL_OPENCL_CPU_BDW__ __attribute__((intel_vec_len_hint(8))) #endif //BDW __kernel void memcpy1(__global float* src, __global float* dst) { size_t gid = get_global_id(0); dst[gid] = src[gid]; }
For more information about intel_vec_len_hint
attribute
extension, refer to Vectorizer Knobs.