# TECH : tour.TW

## Xe2 and Lunar Lake GPU Deep Dive

TAP

Intel Fellow











First time scaling the engine

2 years of software effort





Xe2

Higher utilization

Improved work distribution

Less SW overhead

# **Xe2**

Improving IP Performance Efficiency



## 2 Architecture Scalability





### 2nd GEN



Compute resources repartitioned in native SIMD16 engines for increased efficiency



8 512-bit Vector Engines
8 2048-bit XMX Engines

**64b atomic ops** support

192KB Shared L1\$ / SLM



### New

# Vector Engine

### SIMD16 native ALUs

Support for SIMD16 and SIMD32 ops

#### X<sup>e</sup> Matrix Extensions

Support for INT2, INT4, INT8, FP16, BF16

### Extended Math and FP64

Transcendentals: SIN, COS, LOG, EXP...

### 3-way co-issue

FP + INT/EM + XMX









intel. TECH. tour.tw



New

# X<sup>e</sup> Matrix Extension Engines

FP16 2048 OPS/clock

INT8 4096 OPS/clock

### Key Peak Metrics for 2<sup>nd</sup> Gen X<sup>e</sup>-core

|               | Number of<br>XVE | SIMD<br>width | MAC/lane | Depth | Ops/MAC | Ops/clock |
|---------------|------------------|---------------|----------|-------|---------|-----------|
| FP32          | 8                | 16            | 1        | 1     | 2       | 256       |
| FP16          | 8                | 16            | 2        | 1     | 2       | 512       |
| DP4a INT8     | 8                | 16            | 4        | 1     | 2       | 1024      |
| XMX FP16/BF16 | 8                | 16            | 2        | 4     | 2       | 2048      |
| XMX INT8      | 8                | 16            | 4        | 4     | 2       | 4096      |
| XMX INT4/INT2 | 8                | 16            | 8        | 4     | 2       | 8192      |



Deep micro and macro analysis of all graphics acceleration functions

**Optimized** to reduce latency, remove stalls and improve HW/SW handshake





#### **Execute indirect**

Natively supported





3x vertex fetch throughput

3x mesh shading performance

with vertex re-use





Out of order sampling with compressed textures

2x throughput for sampling without filtering

**Programmable offsets** 



1.5x HiZ/Z/Stencil cache

Early HiZ culling

of small primitives





2x blending throughput

for high granularity passes

1.33x increase

in pixel color cache

Render target pre-fetch to L2\$





New 8:N compression

Fast clear

for sub-resources





# New PRTU

3 Traversal pipelines

Box intersections

2 Triangle intersections







Improving IP
Performance
Efficiency



Relative IP performance (Higher is better) Normalized to configuration and clock freque





Performance & efficiency optimized front to end







Native hardware support for execute indirect commands

Command Front End



Implementing

# Xe2

Lunar Lake





Deliver on key mobile experiences

Step function in efficiency

Support for latest industry standards



# Lunar

Graphics





# Lunar Lake

Graphics overview



Lunar Lake

# Xe2 GPU

Optimized for performance efficiency



## Xe2

### Lunar Lake Configuration

8 Xe-cores

64 vector engines

2 geometry pipelines

8 samplers

4 pixel backends

8 ray-tracing units

**8**MBL2\$





# Lunar Lake Xe2 GPU Performance







Lunar Lake

# GPU Al Engine

XMX

X<sup>e</sup> Matrix Extensions 67

peak INT8 TOPs





Lunar Lake

# Display Engine



Display

### Lunar Lake Display Engine







# Display Engine Front End

**Decode and decrypt** 

Streaming buffer





# Display Engine Pixel Processing Pipeline

### 6 planes per pipeline

Hardware support for color conversion and composition

### Flexible and power efficient

Designed to match any input format to any output format



# Display Engine Low Power Optimized Pipeline

### Panel replay

Power gating during idle frames

### Brightness sensor with LACE

Local Adaptive Contrast Enhancement





# Display Engine Compression & Encoding

## Display stream compression

3:1 visually lossless compression

### **Transport encoding**

Stream encode for HDMI and DisplayPort protocols



### Display Engine Router & Ports

### **Stream assembly**

Combine dual pixel pipeline streams and drive multi-stream transport

### **Port Routing**

Up to 4 ports are supported for added flexibility, including one eDP port





# eDisplayPort 1.5 Panel Replay

Evolution of Panel Self Refresh

Selective update with early transport

Adaptive sync with panel replay



### Display Engine Content Matched Refresh Rate





Legacy





Burst fill



Panel self refresh





Selective update and optimized vertical blanking interrupts



Selective update and hardware queuing



+ early transport



Efficiency gains





Efficiency gains





Lunar Lake

### Media Engine



Media

### New Memory Side Cache

#### Bandwidth savings

Reduction in traffic to system memory across media workloads

#### Power savings

Significant power reduction for encode workloads





### Lunar Lake Media Engine







Significantly Reducing Bitrate at the Same Quality



Reduction in file size

Adaptive resolution streaming

Screen content coding

360-degree & panoramic



### Traditional Resolution Change

Send new reference data

Refresh decode buffer







### Adaptive Resolution Streaming

Less data transfer

Less stream buffering









Video Scaler

**Color Space Converter** 

Video Enhancer

**HDR Tone Mapper** 

**Bayer Processor** 



#### H.265 AVA VVVC

## Screen Content Coding

Screen sharing

Remote desktop

Game streaming

```
#include "pch.h"
#include "pch.h"
#include "XeSSRuntime.h"
                                      #include "XeSSRuntime h"
#include "XeSSJitter.h"
                                      #include "XeSSJitter.h"
#include "Utility.h"
                                      #include "Utility.h"
#include "GraphicsCore.h"
                                      #include "GraphicsCore.h"
#include "ColorBuffer.h"
                                      #include "ColorBufferih"
#include "DepthBuffer.h"
                                      #include "DepthBuffer.h"
                                      #include "CommandContext.h"
#include "CommandContext.h"
#include "Log.h"
                                      #include "Log.h"
                                      #incipue "Display.h"
#include "Display.h"
                                      #include "xess/xess_d3d12_debug.h"
#include "xess/xess_d3d12_debug.h"
```

AV1 with SCC

AV1 without SCC

### Evolution of Media Codecs

|                 | MPEG 2                                              | H.264/AVC                                    | VP9                                      | H.265/HEVC                                     | AVI                                                 | H.266/VVC                                                           |  |
|-----------------|-----------------------------------------------------|----------------------------------------------|------------------------------------------|------------------------------------------------|-----------------------------------------------------|---------------------------------------------------------------------|--|
| Key motivations | Standard definition<br>DVDs<br>Television broadcast | High definition<br>Blu-ray<br>Internet video | Internet streaming<br>Video conferencing | Ultra HD (4K/8K)<br>4K streaming<br>4K Blu-ray | Streaming at scale<br>Game streaming<br>HDR content | Emerging technology<br>360 / panoramic video<br>Adaptive resolution |  |
| File size       | >2x                                                 | ~2x                                          | 1.4×                                     | 1.4x                                           | lx                                                  | ~0.9x                                                               |  |
| Complexity      | <]x                                                 | lx                                           | 5-10x                                    | 5-10x                                          | 65-100x                                             | 80-100x                                                             |  |
|                 |                                                     |                                              |                                          |                                                |                                                     |                                                                     |  |
|                 | 1996                                                | 2003                                         | 2013                                     | 2013                                           | 2018                                                | 2020                                                                |  |





Lunar Lake

## GPU SW Stack





### Windows GPU Software Stack

Ready for Xe2





### Windows GPU Software Stack

Ready for Xe2







### Lunar Lake

### Graphics

Better, faster and more efficient on all fronts

2nd gen Xe-cores



enhanced XeSS kernels





47 TOPS















efficiency optimized media & display engines







#### Notices & Disclaimers

The preceding presentation contains product features that are currently under development. Information shown through the presentation is based on current expectations and subject to change without notice.

Results that are based on pre-production systems and components as well as results that have been estimated or simulated using an Intel Reference Platform (an internal example new system), internal Intel analysis or architecture simulation or modeling are provided to you for informational purposes only. Results may vary based on future changes to any systems, components, specifications or configurations.

Performance varies by use, configuration and other factors. Learn more at www.intel.com/PerformanceIndex.

Al features may require software purchase, subscription or enablement by a software or platform provider, or may have specific configuration or compatibility requirements. Details at www.intel.com/AIPC.

No product or component can be absolutely secure. Intel technologies may require enabled hardware, software or service activation.

All product plans and roadmaps are subject to change without notice.

Performance hybrid architecture combines two core microarchitectures, Performance-cores (P-cores) and Efficient-cores (E-cores), on a single processor die first introduced on 12th Gen Intel® Core ™ processors do not have performance hybrid architecture, only P-cores or E-cores, and may have the same cache size. See ark.intel.com for SKU details, including cache size and core frequency.

Built-in Intel® Arc™ GPU only available on select Intel® Core™ Ultra processor-powered systems; OEM enablement required.

Some images may have been altered or simulated and are for illustrative purposes only.

Intel does not control or audit third-party data. You should consult other sources to evaluate accuracy.

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.



#### APPENDIX

| Claim # & Statement                                                                                               | Slide # & Title/Details                                                                                                                                                                                                                                                    |  |  |  |  |
|-------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|--|--|--|
|                                                                                                                   | SLIDES 4 & 18:Improving IP performance efficiency                                                                                                                                                                                                                          |  |  |  |  |
| Xe2 IP performance per Xe-core is 1.2x to 12.5x higher than Xe1 IP across a set of various graphics functions.    | Results are based on an internal suite of micro benchmarks and collected on a pre-release Xe2 engineering platform with pre-release GFX software. The comparison is a selected subset of micro benchmarks normalized for equal Xe-cores configuration and clock frequency. |  |  |  |  |
|                                                                                                                   | SLIDE 25: Lunar Lake Xe2 GPU Performance                                                                                                                                                                                                                                   |  |  |  |  |
| 1.5x graphics performance over Meteor Lake                                                                        | Testing by Intel as of May2024. Data based on Lunar Lake reference validation platform measurement vs Meteor Lake reference validation platform as measured by 3DM Time Spy. 3DMark*                                                                                       |  |  |  |  |
|                                                                                                                   | SLIDES 43-44: Display Engine Power Optimization                                                                                                                                                                                                                            |  |  |  |  |
| Lunar Lake's Display Engine<br>benefits from a list of power<br>savings optimization across a set<br>of use cases | Testing conducted by Intel's Display Engine engineering team to validate functionality of various power savings features on pre-release engineering platform with pre-release software.                                                                                    |  |  |  |  |
|                                                                                                                   | SLIDE 59: Lunar Lake Graphics                                                                                                                                                                                                                                              |  |  |  |  |
| 1.5x faster graphics performance vs. Meteor Lake GPU                                                              | Testing by Intel as of May2024. Data based on Lunar Lake reference validation platform measurement vs Meteor Lake reference validation platform as measured by 3DM Time Spy. 3DMark*                                                                                       |  |  |  |  |



##