When comparing an OpenCL™ kernel performance on CPU device with native code performance, make sure that both versions of code are as similar as possible. Consider the following guidelines:
clCreateProgramFromBinary
call.rsqrt(x) is inherently of the higher accuracy than __mm_rsqrt_ps
SSE intrinsic. To use the same accuracy in native code and OpenCL
code, do one of the following:
__mm_rsqrt_ps in your native code with couple
of additional Newton-Raphson iterations to match the precision
of OpenCL™ rsqrt.native_rsqrt in your OpenCL™ kernel, which
maps exactly to the rsqrtps instruction in the final
assembly code.rsqrt, you can
use the relaxed versions of rcp, sqrt,
and so on. Refer to the Developer Guide for
Intel® SDK for OpenCL™ Applications for the full list of supported
functions.