Shared Context for Multiple Intel® Xeon Phi™ Coprocessors
You can operate multiple Intel® Xeon Phi™ coprocessors in different ways:
- Running several application instances, which is the easiest way if your application is Message Passing Interface-enabled (MPI). Run a MPI rank per coprocessor to make each instance pick up a different coprocessor. Refer to the multiple coprocessor support section in the Developer Guide for Intel® SDK for OpenCL™ Applications on the description of the
OFFLOAD_DEVICES
environment variable. However this approach lacks efficient data sharing.
- Multi-context approach with a separate context per coprocessor, which is often the most straightforward way to enable an application for multi-device support. You need to add a loop over devices around nearly each OpenCL API call.
- Shared context approach with several coprocessors with using a queue per device. This is the preferable way as it enables certain sharing for kernels, memory objects, and so on. Any context with coprocessors can also include CPU device as well, refer to the “Notes on the Shared Context Support for CPU and Intel Xeon Phi Coprocessor” section.
According to the OpenCL specification, shared context does imply the sharing of memory objects. Thus to avoid any redundant data transfers due to implicit synchronization by the run-time, follow the guidelines below:
- Use non-overlapping sub-buffers to distribute data for processing by different coprocessors upon allocating a memory object in a shared context. For performance reasons, ensure that all commands for the current sub-buffers are completed, and sub-buffers are released before accessing “parent” buffer with API calls or before creating another sub-buffer.
- Use
clEnqueueMigrateMemObjects()
to avoid implicit data copies. Typically, memory objects are implicitly migrated to a device for which enqueued commands are targeted. clEnqueueMigrateMemObjects()
permits this migration to be explicitly performed ahead of the dependent commands.
- If your code is going to overwrite the migrated data anyway, use the
CL_MIGRATE_MEM_OBJECT_CONTENT_UNDEFINED
flag, which helps to avoid redundant data copying.
- Additional optimization is to issue the migration commands to the dedicated queue (to overlap transfer of the next data portion with the processing of the current buffer).
Refer to the section on the “Minimize Data Transfers with Intel® Xeon Phi™ Coprocessors” for the tips on the efficient mapping to the host.
See Also
Developer Guide for Intel® SDK for OpenCL™ Applications