The application should process data “in-place” and minimize copying memory objects.
For example, transferring data over the PCI Express (PCIe) bus has the highest latency and the lowest bandwidth. Like with any other PCIe device, you should reduce this traffic to minimum.
Also when possible, use the CL_MEM_WRITE_ONLY
and CL_MEM_READ_ONLY
semantics with clCreateBuffer()
. This also enables reducing data transfers across the PCIe bus.
While mapping a buffer to the host by use of clEnqueueMapBuffer
, use the appropriate flags:
CL_MAP_WRITE_INVALIDATE_REGION
if you are going to update data upon mapping.CL_MAP_READ
if you are not going to modify the mapped data.NOTE: Upon some period of low activity the CPU device might enter deep C-states, if aggressive power-saving features are enabled. It might happen during waiting for long DMA transfers to, or from coprocessor, which may result in significant degradation of data transfer bandwidth.
Refer to the "Shared Context for Multiple Intel® Xeon Phi™ Coprocessors" section for important tips on the avoiding implicit data copying by runtime.
If your tasks are independent, consider using out-of-order queue.