gfx_factory class implements the Factory concept of streaming_node to simplify use of Intel® Graphics Technology for general-purpose computing in a program based on Intel® Threading Building Blocks (Intel® TBB).
The current implementation of gfx_factory does not allow memory buffer objects to be used concurrently. As a result, several streaming nodes customized with gfx_factory cannot be connected with each other directly.
class gfx_factory;
#define TBB_PREVIEW_FLOW_GRAPH_NODES 1 #define TBB_PREVIEW_FLOW_GRAPH_FEATURES 1 #include "tbb/gfx_factory.h"
gfx_factory is responsible for low-level aspects of using Intel® processor graphics (further referred to as the target) from an Intel® TBB flow graph: uploading input data to the target, running a kernel there, and passing the results back to the graph.
gfx_factory is implemented on top of the API provided by the Intel® C++ Compiler to organize queued offload of user-defined kernel functions and data sharing between the CPU and the processor graphics. Intel® C++ Compiler 16.0 or newer is required in order to use gfx_factory.
For additional details about the underlying API, refer to the Intel® C++ Compiler User and Reference Guide section Optimization and Programming Guide > Intel® Graphics Technology > Programming for Intel® Graphics Technology > Overview: API-Based Offloading.
A kernel function to use with gfx_factory is a separate user-defined function with data-parallel sections written using Intel® Cilk™ Plus. The function has to be annotated with __declspec(target(gfx_kernel)) to be converted to a kernel entry point for processor graphics execution:
Example
static __declspec(target(gfx_kernel)) void vector_square(int *v, size_t n) { cilk_for(size_t i = 0; i < n; ++i) { v[i] = v[i] * v[i]; } }
gfx_factory requires the use of the gfx_buffer template class, which is an abstraction over a data array. This class is responsible for sharing an array of data between the host and the target while compute offload kernels are being executed on processor graphics.
template <typename T> class gfx_buffer { public: typedef implementation-defined iterator; typedef implementation-defined const_iterator; typedef std::size_t size_type; gfx_buffer(); gfx_buffer(size_type size); T* data(); const T* data() const; size_type size() const; const_iterator cbegin() const; const_iterator cend() const; iterator begin(); iterator end(); T& operator[](size_type pos); const T& operator[](size_type pos) const; };
The following table provides additional information on the members of this template class.
Member |
Description |
---|---|
iterator; const_iterator; |
Implementation-defined iterator types. |
gfx_buffer(); |
The constructor to create an empty gfx_buffer. |
gfx_buffer(size_type size); |
The constructor to create gfx_buffer of a certain size. The elements are value initializated by calling T() for each. |
T* data(); const T* data() const; |
Return a pointer to the data storage array. |
size_type size() const; |
Return the number of elements in the buffer. |
iterator begin(); const_iterator cbegin() const; |
Return an iterator to the first element of the container. |
iterator end(); const_iterator cend() const; |
Return an iterator to the element following the last element of the container.. |
T& operator[](size_type pos); const T& operator[](size_type pos) const; |
Return a reference to the element at the specified position. |
streaming_node requires a device selector: a functor that selects a device for offloading a particular computation. However, since the underlying API only works with Intel processor graphics, it has no option to select a particular device. Because of this, you have to use the dummy device selector provided by the factory: gfx_factory::dummy_device_selector().
See a simple vector squaring example below.
#include <iostream> #include <cilk/cilk.h> #include "tbb/flow_graph.h" #include "tbb/gfx_factory.h" static __declspec(target(gfx_kernel)) void vector_square(int *v, size_t n) { cilk_for(size_t i = 0; i < n; ++i) { v[i] = v[i] * v[i]; } } int main() { using namespace tbb::flow; typedef tuple< gfx_buffer<int>, size_t > kernel_args; typedef streaming_node< kernel_args, queueing, gfx_factory > gfx_node; graph g; gfx_factory factory(g); gfx_node squaring(g, vector_square, gfx_factory::dummy_device_selector(), factory); function_node< gfx_buffer<int> > validation(g, unlimited, [](const gfx_buffer<int>& buffer) { bool is_correct = std::all_of(buffer.cbegin(), buffer.cend(), [](int i) {return i == 4; }); if (is_correct) { std::cout << "Results are correct." << std::endl; } }); make_edge(output_port<0>(squaring), validation); const size_t array_size = 1000000; gfx_buffer<int> buffer(array_size); std::fill(buffer.begin(), buffer.end(), 2); squaring.set_args(port_ref<0, 1>); input_port<0>(squaring).try_put(buffer); input_port<1>(squaring).try_put(array_size); g.wait_for_all(); }
The gfx_factory class implements the Factory Concept defined by streaming_node.
For details, see streaming_node reference.
namespace tbb { namespace flow { class gfx_factory { public: typedef implementation-defined device_type; typedef implementation-defined kernel_type; gfx_factory(tbb::flow::graph& g); template <typename ...Args> void send_data(device_type device, Args&... args); template <typename ...Args> void send_kernel(device_type device, const kernel_type& kernel, Args&... args); template <typename FinalizeFn, typename ...Args> void finalize(device_type device, FinalizeFn fn, Args&... args); class dummy_device_selector; }; } }
The following table provides additional information on the members of this template class.
Member |
Description |
---|---|
device_type; kernel_type; |
Implementation-defined types. |
gfx_factory(tbb::flow::graph& g); |
Main constructor. Store a reference to the graph for synchronization between the graph and the device. |
template <typename ...Args> void send_data(device_type device, Args&... args); |
Share data with the device. |
template <typename ...Args> void send_kernel(device_type device, const kernel_type& kernel, Args&... args); |
Put kernel into the in-order offload queue. |
template <typename FinalizeFn, typename ...Args> void finalize(device_type device, FinalizeFn fn, Args&... args); |
Finalize the kernel run if no node successors exist. |
class dummy_device_selector; |
Dummy device selector functor. Has to be passed to the streaming_node constructor. |