Intel® VTune™ Amplifier
Use the Threading analysis to identify how efficiently an application uses available processor compute cores and explore inefficiencies in threading runtime usage or contention on synchronization objects that prevent effective processor utilization.
Threading analysis combines and replaces the Concurrency and Locks and Waits analysis types available in previous versions of Intel® VTune™ Amplifier.
One common problem in parallel applications is threads waiting too long on synchronization objects (locks) that are in the critical path of application execution. Performance suffers when waits occur while cores are under-utilized. Threading analysis also shows how much time threading the application spends in threading runtimes either because of busy waits or overhead on parallel work arrangement.
Threading analysis uses user-mode sampling and tracing collection. With this analysis you can estimate the impact each synchronization object has on the application and understand how long the application had to wait on each synchronization object, or in blocking APIs, such as sleep and blocking I/O.
There are two groups of synchronization objects supported by the Intel® VTune™ Amplifier:
objects usually used for synchronization between threads, such as Mutex or Semaphore
objects associated with waits on I/O operations, such as Stream
To configure options for the Threading analysis:
Prerequisites: Create a project and specify an analysis target.
Click the
(standalone GUI)/
(Visual Studio IDE)Configure Analysis button on the
Intel® VTune™ Amplifier toolbar.
The Configure Analysis window opens.
From
HOW pane, click the
Browse button and select
Threading.
Configure the collection options, including the sampling interval.
You may
generate the command line for this configuration using the
Command Line button at the bottom.
Click the
Start button to
run the analysis.
The Threading analysis results appear in the Threading Efficiency viewpoint, which consists of the following windows/panes:
Summary window displays statistics on the overall application execution, identifying CPU time and processor utilization.
Bottom-up window displays hotspot functions in the bottom-up tree, CPU time and CPU utilization per function.
Top-down Tree window displays hotspot functions in the call tree, performance metrics for a function only (Self value) and for a function and its children together (Total value).
Caller/Callee window displays parent and child functions of the selected focus function.
Platform window provides details on CPU and GPU utilization, frame rate, memory bandwidth, and user tasks (if corresponding metrics are collected).
Start on the Summary window to explore the CPU utilization of your application and identify reasons for underutilization connected with synchronization or parallel work arrangement overhead. Click links associated with flagged issues to be taken to more detailed information. For example, clicking a sync object name in the Top Waiting Objects table takes you to that object in the Bottom-up window.
Analyze thread integration synchronization objects with wait and signal stacks and transitions on the timeline. Explore CPU time spent in threading runtimes to classify inefficiencies in their use.
Modify your code to remove CPU utilization bottlenecks and improve the parallelism of your application.
Concentrate your tuning on objects with long Wait time where the system is poorly utilized (red bars) during the wait. Consider adding parallelism, rebalancing, or reducing contention. Ideal utilization (green bars) occurs when the number of running threads equals the number of available logical cores.
Re-run the analysis to verify your optimization with the comparison mode and identify more possible areas for improvement.