Intel® SDK for OpenCL™ Applications includes an implicit vectorization module as part of the program build process. When it is beneficial performance-wise, this module packs several work items and executes them with SIMD instructions. This enables you to benefit from the vector units in the Intel® Architecture processors without writing explicit vector code.
The vectorization module transforms scalar data type operations by adjacent work-items into an equivalent vector operations. When vector operations already exist in the kernel source code, the module scalarizes (breaks down into component operations) and revectorizes them. This improves performance by transforming the memory access pattern of the kernel into a structure of arrays (SOA), which is often more cache-friendly than an array of structures (AOS).
You can find more details in the Intel® OpenCL™ Implicit Vectorization Module overview at http://llvm.org/devmtg/2011-11/Rotem_IntelOpenCLSDKVectorizer.pdf and OpenCL™ Autovectorization in Intel SDK for OpenCL Applications version 1.5.
The implicit vectorization module works best for the kernels that operate
on elements of four-byte width, such as float
or int
data types. You can define the computational width of a kernel using the
OpenCL vec_type_hint
attribute.
Since the default computation width is four-byte, kernels are vectorized
by default. If your kernel uses certain vector, you can specify __attribute__((vec_type_hint(<typen>)))
with typen
of any vector type (for example, float3
or char4
). This attribute indicates to the vectorization
module apply only transformations that are useful for this type.
The performance benefit from the vectorization module might be lower for the kernels that include a complex control flow.
To benefit from vectorization, you do not need the for
loops within kernels. For best results, let the kernel deal with a single
data element and let the vectorization module take care of the rest. The
more straightforward your OpenCL™ code is, the more optimization you get
from vectorization.
Writing the kernel in the plain scalar code is what works best for efficient vectorization. This method of coding avoids many disadvantages potentially associated with explicit (manual) vectorization described in the Using Vector Data Types section.