Analyzing OpenCL™ Applications
Once the analysis completes, you should see a view similar to the following.
If you click on “Bottom-up” and then chose the grouping as selected below
you will be ready to start tuning your application.
Also consider the following:
- For the overall activity (aggregated in the “CPU time” chart on
the figure), zoom and filter to the area of actual kernel execution,
selected as the largest red rectangle in the figure for the example
analysis above.
- The time spent in the
mic_server
consists of:
- [Dynamic Code], which constitutes the kernels.
- Intel® Threading Building Blocks (Intel®
TBB) costs (threading).
- SVML vector math library that is responsible for most heavy
built-ins like math.
- Other functions, for example, Linux* OS kernel routines inside
of
vmlinux
.
Inspect the same trace for top hotpots over all modules, assuming that
you already filtered by mic_server
process. To do so, switch
to the Top-down Tree view:
Here you get the top-list of hotpots from all modules. In this example,
most hotspots belong to dynamic code (notice that specific kernel names
are listed in the call stack). Also there is some contribution for the
Intel TBB library as well, and finally some heavy math (__ocl_svml_b2_sqrt
)
that is attributed to the code from SVML module.
In general, seeing many entries for Intel TBB in the hotpots breakdown
might indicate some inefficiency in work-groups scheduling, for example
small number of them, or work-groups that are too lightweight. Refer to
the “Threading: Achieving Parallelism Between Work-Groups” section for
more information.
If you click a specific kernel, you can inspect the resulting assembly
code. This is useful to locate the expensive instructions for example:
- SVML calls for heavy math built-ins (subject for native or relaxed
math experiment). Refer to the “Use Lower Math Precision” section).
- Specific prefetching instructions (which are costly according to
the trace results) might indicate inefficient in the compiler-generated
prefetchs. See the “Utilizing Software Prefetching” section for more
information.
- If you are observing gather and scatter instructions in the instructions
hotspots, your data layout and access is likely not optimal. See the
“Efficient Data Layout” section for more information.
- Masked instructions in the instructions hotspot regions might indicate
that your code suffers from divergent branches and associated penalties.
See the “Use Branching Accurately” section for more information.
See Also
Threading:
Achieving Parallelism Between Work-Groups
Utilizing Software Prefetching
Efficient Data Layout
Use Lower Math Precision
Use Branching Accurately
Developer
Guide for Intel® SDK for OpenCL™ Applications
Optimization and Performance Tuning for Intel® Xeon
Phi™ Coprocessors, Part 2
Intel® Xeon Phi™ Processor Targets