Intel® VTune™ Amplifier 201

Custom Analysis Options - Hardware Event-based Sampling

If you create a copy of a predefined analysis type based on the hardware event-based sampling collection, the Intel® VTune™ Amplifier provides options applicable to this collection and by default turns on the options enabled for the original predefined analysis type.

The following hardware event-based sampling collection options are available for a custom configuration:

Use This

To Do This

Analyze I/O waits check box

Analyze the percentage of time each thread and CPU spends in I/O wait state.

Collect I/O API data menu

Choose whether to collect information about I/O calls and related call stacks. This analysis option helps identify where threads are waiting or enables you to compute thread concurrency. The collector instruments APIs, which causes higher overhead and increases result size.

Collect stacks check box

Enable advanced collection of call stacks and thread context switches to analyze performance, parallelism, and power consumption per execution path.

Stack size, in bytes field

Specify the size of a raw stack (in bytes) to process. Zero value means unlimited size. Possible values are numbers between 0 and 2147483647.

Stack type

Choose between software stack and hardware LBR-based stack types. Software stacks have no depth limitations and provide more data while hardware stacks introduce less overhead. Typically, software stack type is recommended unless the collection overhead becomes significant. Note that hardware LBR stack type may not be available on all platforms.

Estimate call counts check box

Obtain statistical estimation of call counts based on the hardware events.

Estimate trip counts check box

Obtain statistical estimation of loop trip counts based on the hardware events.

Events table

  • Specify hardware events to collect using the check boxes in the first column. By default, the table lists all events available for the target platform with events used for the original analysis configuration pre-selected. You may use the Search functionality to find events of interest. To get more details on an event, select it in the table and click the Explain button.

  • Modify the Sample After value for an event to control the number of events after which the VTune Amplifier interrupts the event data collection. The Sample After value depends on the target duration. Based on the duration value, the VTune Amplifier adjusts the Sample After value with a multiplier.

Chipset events field

Specify a comma-separated list of chipset events (up to 5 events) to monitor with the hardware event-based sampling collector.

Linux Ftrace events / Android framework events field

Use the kernel events library to select Linux Ftrace* and Android* framework events to monitor with the collector. The collected data show up as tasks in the Timeline pane. You can also apply the task grouping level to view performance statistics in the grid.

Analyze memory bandwidth check box

Collect events required to compute memory bandwidth.

Analyze PCIe bandwidth check box

Collect the events required to compute PCIe bandwidth. As a result, you will be able to analyze the distribution of the read/write operations on the timeline and identify where your application could be stalled due to approaching the bandwidth limits of the PCIe bus.

In the Device class drop-down menu, you can choose a device class where you need to analyze PCIe bandwidth: processing accelerators, mass storage controller, network controller, or all classes of the devices (default).

Note

This analysis is possible only on the Intel microarchitecture code name Sandy Bridge EP and later.

Analyze memory objects check box (for Linux* targets only)

Enable the instrumentation of memory allocation/de-allocation and map hardware events to memory objects.

Analyze memory consumption check box (for Linux targets only)

Collect and analyze information about memory objects with the highest memory consumption.

Minimal memory object size to track, in bytes spin box (for Linux targets only)

Specify a minimal size of memory allocations to analyze. This option helps reduce runtime overhead of the instrumentation.

Analyze user tasks, events, and counters check box

Analyze tasks, events, and counters specified in your code via the ITT API. This option causes a higher overhead and increases the result size.

Analyze system-wide context switches check box

Analyze detailed scheduling layout for all threads on the system and identify the nature of context switches for a thread (preemption or synchronization).

Capture transactional cycles check box

Collect the events required to analyze transactional success on the Intel® processors supporting Intel Transactional Synchronization Extensions (Intel TSX).

Collect precise clockticks check box

Collect the event that emulates precise clockticks and could be useful, for example, to analyze hotspots in transactions.

Evaluate max DRAM bandwidth check box

Evaluate maximum achievable local DRAM bandwidth before the collection starts. This data is used to scale bandwidth metrics on the timeline and calculate thresholds.

CPU sampling interval, ms field

Specify an interval between collected CPU samples in milliseconds.

Uncore sampling interval, ms field

Specify an interval (in milliseconds) between uncore event samples.

Analyze GPU usage check box (for Linux targets available with Intel® HD Graphics and Intel® Iris® Graphics only)

Analyze GPU usage and frame rate to identify whether your application is GPU or CPU bound.

Note

Select the Collect stacks option to detect context switches and correlate CPU and GPU usage data.

Analyze Processor Graphics hardware events drop-down menu

Analyze performance data from Intel® HD Graphics and Intel® Iris® Graphics based on the predefined groups of GPU metrics.

GPU sampling interval, us field

Specify an interval (in microseconds) between GPU samples.

Trace OpenCL and Intel Media SDK programs (Intel Graphics Driver only) check box

Capture the execution time of OpenCL™ kernels and Intel Media SDK programs on a GPU, identify performance-critical GPU tasks, and analyze the performance per GPU hardware metrics.

Note

Intel Media SDK programs analysis is supported for Linux targets only.

Analyze loops check box

Extend loop analysis to collect advanced loops information such as instruction set usage and display analysis results by loops and functions. If this option is enabled, the VTune Amplifier automatically applies the Loops and functions filtering mode to the data view in the grid and enables the Vector Instruction Set column that shows a vectorization instruction set used for a particular function, loop, and so on.

Managed runtime type to analyze menu

Choose a type of the managed runtime to analyze. Available options are:

  • for Windows targets: combined Java* and .NET* analysis

  • for Linux targets: Java only analysis

Analyze OpenMP regions check box

Instrument the OpenMP* regions in your application to group performance data by regions/work-sharing constructs and detect inefficiencies such as imbalance, lock contention, or overhead on performing scheduling, reduction, and atomic operations. Using this option may cause higher overhead and increase the result size.

GPU Profiling mode drop-down menu

Select a profiling mode to identify basic blocks latency due to algorithm inefficiencies, or memory latency due to memory access issues. This option is typically used for GPU in-kernel profiling.

Use the table to specify the kernels of interest and narrow down the GPU in-kernel analysis to specific kernels minimizing the collection overhead. If required, modify the instance step for each kernel, which is a sampling interval (in the number of kernels).

Event mode drop-down list

Limit event-based sampling collection to USER (user events) or OS(system events) mode. By default, all event types are collected.

Collect context switches check box

Analyze detailed scheduling layout for all threads in your application, explore time spent on a context switch and identify the nature of context switches for a thread (preemption or synchronization).

Note

The types of the context switches (preemption or synchronization) cannot be identified if the analysis uses Perf* based driverless collection.

Use precise multiplexing check box

Enable a fine-grain event multiplexing mode that switches events groups on each sample. This mode provides more reliable statistics for applications with a short execution time. You can also consider applying the precise multiplexing algorithm if the MUX Reliability metric value for your results is low.

Note

You may generate the command line for this configuration using the Command Line... button at the bottom.

See Also