Intel® VTune™ Amplifier
Use the GPU In-kernel Profiling to analyze GPU kernel execution per code line and identify performance issues caused by memory latency or inefficient kernel algorithms.
GPU In-Kernel Profiling is temporarily removed from Intel VTune Amplifier 2019 Update 3 to address some defects.
The GPU In-kernel Profiling instruments your code and, depending on your configuration settings, helps identify performance-critical basic blocks or issues caused by memory accesses in the GPU kernels.
GPU In-kernel profiling introduces the following key metrics:
Estimated GPU Cycles: The average number of GPU cycles per one kernel instance.
GPU Instructions Executed per Instance: The average number of GPU instructions executed per one kernel instance.
GPU Instructions Executed per Thread: The average number of GPU instructions executed by one thread per one kernel instance.
GPU In-kernel Profiling is available on the processors based on Intel® microarchitecture code name Broadwell and later.
Since the GPU In-kernel Profiling incurs higher performance overhead than the GPU Compute/Media Hotspots analysis, you may consider first running the GPU Compute/Media Hotspots analysis to identify the hottest GPU computing task (GPU kernel) and then exploring this kernel with the GPU In-kernel Profiling.
To run the GPU In-kernel Profiling analysis:
Prerequisites: Create a project and specify an analysis target and system.
Click the
Configure Analysis button on the
Intel® VTune™ Amplifier toolbar.
The New Amplifier Result tab opens.
From the
HOW pane, click the
Browse button and select
Platform Analysis > GPU In-kernel Profiling.
From the Profiling mode drop-down menu, select a type of issues you want to analyze:
Basic blocks latency option helps you identify issues caused by algorithm inefficiencies. In this mode, VTune Amplifier measures the execution time of all basic blocks. Basic block is a straight-line code sequence that has a single entry point at the begging of the sequence and a single exit point at the end of this sequence. During post-processing, VTune Amplifier calculates the execution time for each instruction in the basic block. So, this mode helps understand which compute instructions are more expensive. See an example.
Memory latency option helps identify latency issues caused by memory accesses. In this mode, VTune Amplifier profiles memory read/synchronization instructions to estimate their impact on the kernel execution time. Consider using this option, if you ran the GPU Compute/Media Hotspots analysis, identified that the GPU kernel is throughput or memory-bound, and want to explore which memory read/synchronization instructions from the same basic block take more time. See an example.
Optionally, if you want to narrow down the analysis to specific kernels (and minimize the overhead), specify the kernels of interest to profile. If required, modify the Instance step for each kernel, which is a sampling interval (in the number of kernels). This option helps reduce profiling overhead.
Click Start to run the analysis.
By default, the GPU In-kernel Profiling result opens in the GPU Compute/Media Hotspots viewpoint. You can start with the Summary window to identify the hottest GPU Computing Task, click it to navigate to the Graphics window and explore metrics collected for this hotspot:
Double-clicking the hot kernel in the Graphics window opens its source code:
The GPU In-kernel Profiling provides a full-scale analysis of the kernel source per code line. The hottest kernel code line is highlighted by default.
To view the performance statistics on GPU instructions executed per kernel instance, switch to the Assembly view:
You have a kernel that performs compute operations:
__kernel void viete_formula_comp(__global float* data)
{
int gid = get_global_id(0);
float c = 0, sum = 0;
for (unsigned i = 0; i < 50; ++i)
{
float t = 0;
float p = (i % 2 ? -1 : 1);
p /= i*2 + 1;
p /= pown(3.f, i);
p -=c;
t = sum + p;
c = (t - sum) - p;
sum = t;
}
data[gid] = sum * sqrt(12.f);
}
To compare these operations, run the GPU In-kernel profiling in the Basic block latency mode and double-click the kernel in the grid to open the Source view:
The Source view analysis highlights the pown() call as the most expensive operation in this kernel.
You have a kernel that performs several memory reads (lines 14, 15 and 20):
__kernel void viete_formula_mem(__global float* data)
{
int gid = get_global_id(0);
float c = 0;
for (unsigned i = 0; i < 50; ++i)
{
float t = 0;
float p = (i % 2 ? -1 : 1);
p /= i*2 + 1;
p /= pown(3.f, i);
p -=c;
t = data[gid] + p;
c = (t - data[gid]) - p;
data[gid] = t;
}
data[gid] *= sqrt(12.f);
}
To identify which read instruction takes the longest time, run the GPU In-kernel Profiling in the Memory latency mode:
The Source view analysis shows that the compiler understands that each thread works only with its own element from the input buffer and generates the code that performs the read only once. The value from the input buffer is stored in the registry and reused in other operations, so the compiler does not generate additional reads.