gfx_factory class implements the Factory concept of streaming_node to simplify use of Intel® Graphics Technology for general-purpose computing in a program based on Intel® Threading Building Blocks (Intel® TBB).
The current implementation of gfx_factory does not allow memory buffer objects to be used concurrently. As a result, several streaming nodes customized with gfx_factory cannot be connected with each other directly.
class gfx_factory;
#define TBB_PREVIEW_FLOW_GRAPH_NODES 1 #define TBB_PREVIEW_FLOW_GRAPH_FEATURES 1 #include "tbb/gfx_factory.h"
gfx_factory is responsible for low-level aspects of using Intel® processor graphics (further referred to as the target) from an Intel® TBB flow graph: uploading input data to the target, running a kernel there, and passing the results back to the graph.
gfx_factory is implemented on top of the API provided by the Intel® C++ Compiler to organize queued offload of user-defined kernel functions and data sharing between the CPU and the processor graphics. Intel® C++ Compiler 16.0 or newer is required in order to use gfx_factory.
For additional details about the underlying API, refer to the Intel® C++ Compiler User and Reference Guide section Optimization and Programming Guide > Intel® Graphics Technology > Programming for Intel® Graphics Technology > Overview: API-Based Offloading.
A kernel function to use with gfx_factory is a separate user-defined function with data-parallel sections written using Intel® Cilk™ Plus. The function has to be annotated with __declspec(target(gfx_kernel)) to be converted to a kernel entry point for processor graphics execution:
Example
static __declspec(target(gfx_kernel))
void vector_square(int *v, size_t n) {
cilk_for(size_t i = 0; i < n; ++i) {
v[i] = v[i] * v[i];
}
}
gfx_factory requires the use of the gfx_buffer template class, which is an abstraction over a data array. This class is responsible for sharing an array of data between the host and the target while compute offload kernels are being executed on processor graphics.
template <typename T>
class gfx_buffer {
public:
typedef implementation-defined iterator;
typedef implementation-defined const_iterator;
typedef std::size_t size_type;
gfx_buffer();
gfx_buffer(size_type size);
T* data();
const T* data() const;
size_type size() const;
const_iterator cbegin() const;
const_iterator cend() const;
iterator begin();
iterator end();
T& operator[](size_type pos);
const T& operator[](size_type pos) const;
};
The following table provides additional information on the members of this template class.
|
Member |
Description |
|---|---|
|
iterator; const_iterator; |
Implementation-defined iterator types. |
|
gfx_buffer(); |
The constructor to create an empty gfx_buffer. |
|
gfx_buffer(size_type size); |
The constructor to create gfx_buffer of a certain size. The elements are value initializated by calling T() for each. |
|
T* data(); const T* data() const; |
Return a pointer to the data storage array. |
|
size_type size() const; |
Return the number of elements in the buffer. |
|
iterator begin(); const_iterator cbegin() const; |
Return an iterator to the first element of the container. |
|
iterator end(); const_iterator cend() const; |
Return an iterator to the element following the last element of the container.. |
|
T& operator[](size_type pos); const T& operator[](size_type pos) const; |
Return a reference to the element at the specified position. |
streaming_node requires a device selector: a functor that selects a device for offloading a particular computation. However, since the underlying API only works with Intel processor graphics, it has no option to select a particular device. Because of this, you have to use the dummy device selector provided by the factory: gfx_factory::dummy_device_selector().
See a simple vector squaring example below.
#include <iostream>
#include <cilk/cilk.h>
#include "tbb/flow_graph.h"
#include "tbb/gfx_factory.h"
static __declspec(target(gfx_kernel))
void vector_square(int *v, size_t n) {
cilk_for(size_t i = 0; i < n; ++i) {
v[i] = v[i] * v[i];
}
}
int main() {
using namespace tbb::flow;
typedef tuple< gfx_buffer<int>, size_t > kernel_args;
typedef streaming_node< kernel_args, queueing, gfx_factory > gfx_node;
graph g;
gfx_factory factory(g);
gfx_node squaring(g, vector_square, gfx_factory::dummy_device_selector(), factory);
function_node< gfx_buffer<int> >
validation(g, unlimited,
[](const gfx_buffer<int>& buffer) {
bool is_correct = std::all_of(buffer.cbegin(), buffer.cend(),
[](int i) {return i == 4; });
if (is_correct) {
std::cout << "Results are correct." << std::endl;
}
});
make_edge(output_port<0>(squaring), validation);
const size_t array_size = 1000000;
gfx_buffer<int> buffer(array_size);
std::fill(buffer.begin(), buffer.end(), 2);
squaring.set_args(port_ref<0, 1>);
input_port<0>(squaring).try_put(buffer);
input_port<1>(squaring).try_put(array_size);
g.wait_for_all();
}
The gfx_factory class implements the Factory Concept defined by streaming_node.
For details, see streaming_node reference.
namespace tbb {
namespace flow {
class gfx_factory {
public:
typedef implementation-defined device_type;
typedef implementation-defined kernel_type;
gfx_factory(tbb::flow::graph& g);
template <typename ...Args>
void send_data(device_type device, Args&... args);
template <typename ...Args>
void send_kernel(device_type device, const kernel_type& kernel, Args&... args);
template <typename FinalizeFn, typename ...Args>
void finalize(device_type device, FinalizeFn fn, Args&... args);
class dummy_device_selector;
};
}
}
The following table provides additional information on the members of this template class.
|
Member |
Description |
|---|---|
|
device_type; kernel_type; |
Implementation-defined types. |
|
gfx_factory(tbb::flow::graph& g); |
Main constructor. Store a reference to the graph for synchronization between the graph and the device. |
|
template <typename ...Args> void send_data(device_type device, Args&... args); |
Share data with the device. |
|
template <typename ...Args> void send_kernel(device_type device, const kernel_type& kernel, Args&... args); |
Put kernel into the in-order offload queue. |
|
template <typename FinalizeFn, typename ...Args> void finalize(device_type device, FinalizeFn fn, Args&... args); |
Finalize the kernel run if no node successors exist. |
|
class dummy_device_selector; |
Dummy device selector functor. Has to be passed to the streaming_node constructor. |