To maximize use of vector CPUs, consider using vector data types in your kernel code as a more involved performance alternative to the automatic (compiler-aided) vectorization described in the Benefitting from Implicit Vectorization section. This technique enables you to map vector data types directly to the hardware vector registers. Thus, the used data types should match the width of the underlying SIMD instructions.
Consider the following recommendations:
float8
or double4
, so you bind code to
the specific register width of the underlying hardware. This method
provides maximum performance on a specific platform. However, performance
on other platforms and generations of Intel® Core™ processors might
be less than optimal.float16
, to transparently
cover many SIMD hardware register widths. However, using types wider
than the underlying hardware is similar to loop unrolling. This method
might improve performance in some cases, but also increases register
pressure. Consider using uchar16
data type to process
four pixels simultaneously when operating on pixels with eight bits
per component.N
elements. Make sure the size of a grid, which is the number of work-items
required to process the same dataset, does not exceed the N value.NOTE: The int8
data type
improves performance only starting the 4th Generation Intel® Core™ processors.
Using vector data types, you plan the vector-level parallelism yourself instead of relying on the implicit vectorization module. See the Benefitting from Implicit Vectorization section for more information.
This approach is useful in the following scenarios:
The following example demonstrates the multiplication kernel that targets the 256-bit vector units of the 2nd Generation Intel Core processors and higher:
__kernel __attribute__((vec_type_hint(float8))) void edp_mul(__global const float8 *a, __global const float8 *b, __global float8 *result) { int id = get_global_id(0); result[id] = a[id]* b[id]; }
In this example, the data passed to the kernel represents buffers of float8. The calculations are performed on eight elements together.
The attribute added before the kernel, signals the compiler, or the
implementation that this kernel has an optimized vectorized form, so the
implicit vectorization module does not operate on it. Use vec_type_hint
to hint compiler that your kernel already processes data using mostly
vector types. For more details on this attribute, see the section 6.7.2
of the OpenCL™ 1.2 specification at https://www.khronos.org/registry/OpenCL/specs/opencl-1.2.pdf.