Intel® VTune™ Amplifier 201
Analysis on embedded platforms and accelerators:
New CPU/FPGA Interaction analysis (PREVIEW) to assess the balance between the CPU and FPGA on systems with a discrete Intel® Arria® 10 FPGA running OpenCL™ applications
New Graphics Rendering analysis (PREVIEW) for CPU/GPU utilization of your code running on the Xen* virtualization platform installed on a remote embedded target
Support for the sampling command-line analysis on remote QNX* embedded systems via ethernet connection
A PREVIEW FEATURE may or may not appear in a future production release. It is available for your use in the hopes that you will provide feedback on its usefulness and help determine its future. Data collected with a preview feature is not guaranteed to be backward compatible with future releases. Please send your feedback to parallel.studio.support@intel.com or to intelsystemstudio@intel.com.
HPC workload profiling improvements:
CPU Utilization metric refined to differentiate the utilization on logical vs. physical cores, which is particularly important for HPC applications running on Intel® Xeon® processor family processors
Managed runtime analysis improvements:
Extended JIT profiling for server-side applications running on the LLVM* or HHVM* PHP servers to support the event-based sampling analysis in the attach mode
Extended Java* code analysis with support for OpenJDK* 9 and Oracle* JDK 9
Enabled Advanced Hotspots analysis for .NET* Core applications on Linux and Windows systems in the Launch Application mode
Application Performance Snapshot improvements:
Added the ability to pause/resume collection with MPI_Pcontrol and itt API. The -start-paused option was added to exclude application execution from collection from the start to the first collection resume occurrence.
Enabled selection of which data types are collected to reduce overhead. The choices include MPI tracing, OpenMP tracing, hardware counter based collection, or a combination of the three.
Exposed the CPU Utilization metric by physical cores on processors that support proper hardware events.
Significantly reduced MPI tracing overhead when there are a large number of ranks.
Enriched MPI statistics generated by the aps-report utility by showing information about communicators used in the application and to group and filter collective operations by the communicators.
Improved integration with Intel® Trace Analyzer and Collector by adding the ability to generate profiling configuration files with the aps-report option.
Quality and usability improvements:
Hardware event-based analysis supported for targets running in the Hyper-V* environment on Windows* 10 Fall Creators Update (RedStone3)
Support for new operating systems and IDEs including:
Fedora*
Ubuntu* 17.10
HPC workload profiling improvements:
Application Performance Snapshot extended to use the VTune Amplifier sampling driver and Perf* system-wide profiling capability for reducing collection overhead and enabling Average DRAM and MCDRAM bandwidth measurement
Application Performance Snapshot's MPI tracing extended to cover applications using MPI_Abort
GPU analysis improvements:
GPU Hotspots analysis extended to analyze FPU bound OpenCL™ applications and identify a cause of low occupancy problems
Quality and usability improvements:
Improved accuracy of the Perf*-based driverless sampling collection running on the target system under Xen Hypervisor via enabling the usage of integrated Perf sampling interval
Better management of the EBS collection result size via configuration of the CPU sampling interval. Increasing the sampling interval may be useful for profiles with long durations or profiles that create large results. The Duration time estimate option is deprecated.
Optimized support for performance profiling on embedded devices with the Yocto Project* without prerequisite installation of the Intel System Studio or a complete version of the VTune Amplifier.
New VTune Amplifier 2018 product combines features originally provided by the Intel VTune Amplifier XE and Intel VTune Amplifier for Systems and also introduces the following new options targeted for both host-based and embedded remote target analysis:
Application Performance Snapshot providing a quick look at your application performance and helping understand where your application will benefit from tuning.
Performance metrics include MPI and OpenMP* parallelism, memory access, FPU utilization, and I/O efficiency with recommendations on further in-depth analysis.
New MPI metrics that help identify top 5 MPI functions by average consumed time and that show resident and virtual memory footprints per MPI rank and per compute node.
Support for multiple platforms, including Intel® Xeon® processors code named Skylake.
Performance analysis for targets (native and Java* services and daemons) continuously running in LXC*, Docker* and Mesos* containers via the Attach profiling mode and the Advanced Hotspots analysis
HPC workload profiling improvements:
Enhanced MPI metrics for HPC Performance Characterization analysis that expose scalability bottlenecks for hybrid applications
Summary view extended to show top 5 OpenMP* hotspots (functions and loops) executing serially in the master thread outside any parallel regions
Improved insight into parallelism inefficiencies for applications using Intel Threading Building Blocks (Intel TBB) with extended classification of high Overhead and Spin time
Increased detail and structure for vector efficiency metrics based on FLOP counters in the FPU Utilization section
New MPI Imbalance metric based on MPI Busy Wait time and parallel efficiency for a most awaited rank in the CPU Utilization section
New section presenting the data on the hottest loops and functions with arithmetic operations, which enables you to identify which loops/functions with FPU Usage took the most CPU Time
Optimized command line analysis flow for the hpc-performance with the summary report that shows metrics for CPU, Memory and FPU performance aspects including performance issue descriptions for metrics that exceed the predefined threshold. To hide issue descriptions in the summary report, use a new report-knob show-issues option.
Microarchitecture analysis improvements:
Fullscale driverless Memory Access analysis that now provides Average Latency statistics
Support for locator hardware event metrics for the General Exploration analysis results in the Source/Assembly view that enable you to filter the data by a metric of interest and identify performance-critical code lines/instructions
Summary view of the General Exploration analysis extended to explicitly display the measure for the hardware metrics: Clockticks vs. Pipeline Slots
Detailed presentation of bandwidth bottlenecks with the Memory Access summary command line report that now includes new metrics on bandwidth utilization, such as the platform maximum bandwidth, maximum bandwidth observed during analysis, average bandwidth utilization, and % of Elapsed Time with high bandwidth utilization
More accurate DRAM Bandwidth Bound metric based on uncore events used to display memory usage statistics for the Memory Access and HPC Performance Characterization analyses
Managed runtime analysis improvements:
New Memory Consumption analysis for native and Python* Linux targets that monitors RAM consumption over time and helps identify memory objects allocated and released within the analysis run
Support for the mixed Python* and native code in the Locks and Waits analysis including call stack collection
GPU analysis improvements:
GPU Hotspots analysis extended to detect hottest computing tasks bound by GPU L3 bandwidth or DRAM bandwidth
New GPU In-kernel Profiling that helps analyze GPU kernel execution per code line and identify performance issues caused by memory latency or inefficient kernel algorithms
GPU Hotspots Summary view extended to provide the Packet Queue Depth and Packet Duration histograms for the analysis of DMA packet execution
New Full Compute event group added to the list of predefined GPU hardware event groups collected for Intel® HD Graphics and Intel Iris® Graphics. This group combines metrics from the Overview and Compute Basic presets and allows to see all detected GPU stalled/idle issues in the same view.
Support for performance analysis of a guest Linux* operating system via Kernel-based Virtual Machine (KVM) from a Linux host system with the KVM Guest OS option
Profiling Guided Optimization report generated with the amplxe-pgo-report.sh utility for the Intel® C++ compiler (Linux* only), GCC* and Clang* compiler to improve code optimization
Usability improvements:
New user-friendly GUI design for the Timeline pane , analysis and target configuration windows
Support for hotspot navigation and filtering of stack sampling analysis data by the Total type of values in the Source/Assembly view
Automated installation of the VTune Amplifier collectors on a remote Linux target system. This feature is helpful if you profile a target on a shared resource without VTune Amplifier installed or on an embedded platform where targets may be reset frequently.
Documentation improvements:
VTune Amplifier product help, tutorials, and Release Notes are available online only from the Intel Software Documentation Library in the Intel Developer Zone (IDZ). You can also download an offline version of the product help either from IDZ or from the Intel Software Development Products Registration Center.
New Find Your Analysis guide that helps pick your starting point for analysis based on your use case. The guide is available both online - from the product help - and offline - from the product Welcome page.
New performance analysis cookbook that contains recipes of identifying and solving the most popular performance problems with the help of VTune Amplifier's analysis types
Support for new Intel processors including:
Intel® Xeon Phi™ processors (code name Knights Landing and Knights Mill)
Intel® Xeon® Processor Scalable family
Intel® Atom™ processors codenamed Apollo Lake and Denverton
Intel processors codenamed KabyLake
A full list of supported processors is available from the Release Notes.
Support for new operating systems and IDEs including:
Ubuntu* 17.04
Fedora* 26
Debian* 9.0
Microsoft Windows* 10 Creators Update (RS2)
Microsoft Visual Studio* 2017
Support for cross-OS analysis to all license types. Download installation packages for additional operating systems from registrationcenter.intel.com.
A full list of supported operating systems is available from the Release Notes.