Avoid Needless Synchronization
For better results, avoid explicit command synchronization primitives,
such as clEnqueueMarker
and Barrier
. Explicit
synchronization commands and event tracking result in cross-module round
trips, which decrease performance. The less you use explicit synchronization
commands, the better the performance is.
Use the following techniques to reduce the explicit synchronization:
- Merge kernels whenever possible. It also improves data locality.
- If you need to wait for a kernel to complete execution before reading
the resulting buffer, continue execution until you need the first
buffer with results.
- If an in-order queue expresses the dependency chain correctly,
use it to define a string of dependent kernels. In the in-order execution
model, the commands in a command queue are executed in the order of
submission, with each command running to completion before the next
one begins. This is a typical case for a straightforward processing
pipeline. Consider the following:
- Using the blocking OpenCL™ API is more effective than explicit
synchronization schemes based on OS synchronization primitives.
- If you are optimizing the kernel pipeline, first measure kernels
separately to find the most time-consuming one. Avoid calling
clFinish
or clWaitForEvents
in the final
pipeline version frequently after, for example, each kernel invocation.
Prefer submitting the whole sequence (to the in-order queue) and
issue clFinish
once or wait on the OpenCL event object,
which reduces host-device round trips.
See Also
Reuse
Compilation Results with clCreateProgramWithBinary
Task-Parallel Programming
Model Hints