Intel® C++ Compiler 18.0 Developer Guide and Reference

Device-Querying Functions

Queries various characteristics specific to the processor graphics. This topic only applies when targeting Intel® Graphics Technology.

Syntax

GfxGpuPlatform _GFX_get_device_platform(void);

GfxGpuSKU _GFX_get_device_sku(void);

int _GFX_get_device_hardware_thread_count(void);

int _GFX_get_device_min_frequency(void);

int _GFX_get_device_max_frequency(void);

int _GFX_get_device_current_frequency(void);

int _GFX_get_number_of_devices(void);

int _GFX_get_device_subslice_thread_count(void);

Parameters

None.

Description

These functions get various characteristics of the specific processor graphics you are using.

The following table shows the information each function gets:

Function

Description

_GFX_get_device_platform

Gets the type of the installed processor graphics.

_GFX_get_device_sku

Gets the platform of the processor graphics (SKU).

_GFX_get_device_hardware_thread_count

Gets the maximum number of threads that the processor graphics can run in parallel.

_GFX_get_device_min_frequency

Gets the minimum frequency of the processor graphics.

_GFX_get_device_max_frequency

Gets the maximum frequency of the processor graphics.

_GFX_get_device_current_frequency

Gets the current frequency of the processor graphics.

_GFX_get_number_of_devices

Gets the number of processor graphics units available in the system for offloading.

Use this function to write code that is dependent upon the availability of the processor graphics for offloading.

_GFX_get_device_subslice_thread_count

Gets the number of hardware threads per subslice for available Intel processor graphics unit.

This function is useful when calculating a budget for the total size of the __thread_group_local variables used in a kernel, which is a function marked with __declspec(target(gfx_kernel)). Here is the sample calculating minimum thread group size depending on the total size of __thread_group_local variables. Exceeding this number might lead to a runtime shared local memory overflow error.

Return Values

_GFX_get_device_platform

0-Unknown

1-SNB

2-IVB

3-HSW

4-BDW

5-VLV

6-CHV

7-SKL

8-BXT

_GFX_get_device_sku

0-Unknown

1- GT1

2-GT2

3-GT3

4-GT4

5-GTVLV

6-GTVLVPLUS

7-GTCHV

10- GT1_5

_GFX_get_device_hardware_thread_count

An integer. The maximum number of threads that the processor graphics can run in parallel.

_GFX_get_device_min_frequency

An integer. The minimum frequency, in MHz, of the processor graphics.

_GFX_get_device_max_frequency

An integer. The maximum frequency, in MHz, of the processor graphics.

_GFX_get_device_current_frequency

An integer. The current frequency, in MHz, of the processor graphics.

_GFX_get_number_of_devices

An integer. The number of Intel processor graphics units available for offloading. If no devices are available the return value is 0.

_GFX_get_device_subslice_thread_count

An integer. The number of hardware threads per subslice for the Intel processor graphics unit. When multiple units are available, the function assumes they are the same SKU. If this number cannot be determined or in case of other error, the return value is a negative number.

Example: _GFX_get_number_of_devices

This function dynamically determines how to achieve the optimal performance when executing the application, based on whether there are available processor graphics offload capabilities. The application might select a code path that is specifically tuned for execution on the processor graphics when offloading is available. Otherwise, it executes on the CPU.

Intel® Cilk™ Plus is a deprecated feature in the Intel® C++ Compiler 18.0. An alternative for offloading to the processor graphics is planned for a future release. For more information see Migrate Your Application to use OpenMP* or Intel® TBB Instead of Intel® Cilk™ Plus.

// Kernel that is specifically tuned for execution on the processor graphics.
__declspec(target(gfx_kernel))
void gfx_kernel(float *out, const float *in, int len);

// The same kernel tuned for CPU execution
void cpu_kernel(float *out, const float *in, int len);

void run_kernel(float *out, const float *in, int len)
{
    ...

    // Run the kernel tuned for processor graphics when available
    // or a kernel variant specialized for execution on CPU otherwise.
    if (_GFX_get_number_of_devices() > 0) {
        // Enqueue GPU kernel
        _GFX_share(in, sizeof(float) *len);
        _GFX_share(out, sizeof(float) *len);
        _GFX_offload(gfx_kernel, out, in, len);
        if (_GFX_get_last_error() != GFX_SUCCESS) {
            fatal_error();
        }
    }
    else {
        // Spawn CPU version of the kernel
        _Cilk_spawn cpu_kernel(out, in, len);
    }

    ...
}

Example: _GFX_get_device_subslice_thread_count

This example demonstrates how to calculate the number of threads in a thread group using __thead_group_local variables without causing Shared Local Memory overflow at runtime. The calculated number depends on the size of the __thead_group_local memory space used by the kernel. Since __thead_group_local memory is shared between threads in a group and is allocated in SLM, each thread group consumes this amount of SLM. In fact, SLM allocation to a thread group is done in chunks which are powers of 2 mutiplied by 1024: 1024, 2048,... The sample code takes this into account.

// This amount (int bytes) of SLM memory per thread group is reserved by the compiler.
const int c_slm_scratch_frame_size = 16;

// Hardware limitation for the SLM size per a subslice.
const int c_max_slm_per_ss = 1024*64;

// Hardware limitation for a thread group size.
const int c_max_group_size = 64;


// Aligns SLM frame size to a number supported by hardware:
int align_slm_frame_size(int slm_frame_size)
{
    const int block_size = 1024;
    const int max_slm = c_max_slm_per_ss;

    if (slm_frame_size >= max_slm || slm_frame_size <= 0) {
        return slm_frame_size;
    }
    int aligned_size = block_size;

    for (; slm_frame_size > aligned_size; aligned_size *= 2) {}

    return aligned_size;
}

int get_min_thread_group_size(int slm_per_group)
{
    slm_per_group = align_slm_frame_size(slm_per_group);
    int threads_per_ss = _GFX_get_device_subslice_thread_count();

    if (threads_per_ss <= 0) {
        printf("_GFX_get_device_subslice_thread_count failed\n");
        exit(-1);
    }
    int max_groups = std::max(1, c_max_slm_per_ss/slm_per_group);
    int group_size = std::min(c_max_group_size, (int)std::ceil((float)threads_per_ss/max_groups));

    return group_size;
}