### Solution Brief



Reliability, Availability, and Serviceability (RAS)
Intel® Xeon® Scalable Processors

# Improving Server Uptime in the Data Center

## Server platforms using Intel® Xeon® Scalable processors can help reduce server crashes.



In today's hyperscale cloud data centers, workloads are distributed across servers, making reliability critical. Failures that are isolated to a single component, such as server memory, can have a widespread effect on workloads. Intel has enhanced its reliability, availability, and serviceability (RAS) technologies to help customers address this challenge. Server platforms using 3rd Gen Intel Xeon Scalable processors and later contain enhancements that can help reduce memory-errorcaused server crashes. These improvements increase server uptime in the data center, which is critical to successful enterprise operations.



#### RAS technology improvements on server platforms

Servers built with 2nd Gen Intel Xeon Scalable processors and earlier used Machine Check Architecture (MCA) recovery to reduce memory-error-caused server crashes.

To further reduce server crashes, Intel implemented error correction code (ECC) and processor context corruption (PCC) case enhancements starting with 3rd Gen Intel Xeon processors. See Table 1 for the evolution of Intel RAS technology enhancements to decrease memory-error-caused server crashes.

This solution brief describes memory-related architectural enhancements to Intel Xeon Scalable processors that benefit server platforms. It also describes how proactively screening memory can lead to improved uptime in the data center.

#### **ECC** enhancements

Memory data errors are logged as correctable errors (CEs) or uncorrectable errors (UEs). UEs are typically accompanied by prior CEs. Therefore, reducing CEs can help reduce the total UE rate.

Intel has enhanced ECCs in Intel Xeon Scalable processors to reduce CEs, starting with 3rd Gen Intel Xeon Scalable processors. Intel® x4 Single Device Data Correction (Intel® x4 SDDC) is also implemented to increase DDR4 memory coverage and help reduce multi-bit errors and sub-device errors.

#### PCC case enhancement

Intel discovered that a large percentage of unrecoverable memory UE crashes on server platforms are caused by processor context corruption (PCC). To address this, Intel made enhancements, starting with 3rd Gen Intel Xeon Scalable processors, that can increase the overall UE recovery rate.

Table 1. Evolution of RAS technology on Intel Xeon Scalable processors

| Intel Xeon Scalable processor generation | MCA recovery | ECC enhancements | PCC enhancements |
|------------------------------------------|--------------|------------------|------------------|
| 1st Gen and 2nd Gen                      | х            |                  |                  |
| 3rd Gen, 4th Gen, and 5th Gen            | х            | х                | х                |

#### Proactive memory screening

Original equipment manufacturers (OEMs) and cloud service providers (CSPs) can proactively screen memory on Intel Xeon processor–based server platforms to reduce failures after deployment.

For example, Intel customers have deployed DIMM test patterns provided by DIMM suppliers on server manufacturing lines to stress test and screen out low-quality DIMMs. This has proven to be an effective approach to preventing low-quality DIMMs from impacting fleets after servers go online.

To better utilize the Converged Pattern Generator and Checker (CPGC) logic and provide a robust memory test application, the Intel® Advanced Memory Test feature, based on industry-standard test algorithms, has been added to the Intel Memory Reference Code (MRC).

The Advanced Memory Test feature consolidates the test patterns of key memory vendors with test patterns initialized by Intel, providing better coverage to help screen out low-quality DDR DIMMs. This testing feature enables OEMs and CSPs to make repairs before broad server deployment. Some CSP customers have also deployed Advanced Memory Test in their manufacturing processes.

#### Tencent: customer case study

Tencent, a CSP, used statistical six-month data to compare server clusters with 3rd Gen Intel Xeon Scalable processors to server clusters with 2nd Gen Intel Xeon Scalable processors. Data reported by Tencent indicated that, on average, there were approximately 67 percent fewer memory-caused server crashes on server platforms with 3rd Gen Intel Xeon Scalable processors, as compared to server platforms with 2nd Gen Intel Xeon Scalable processors. According to the Tencent analysis, key factors that contributed to the reduction included:

- ECC enhancements on the 3rd Gen Intel Xeon Scalable processors reduced the memory UE crash rate by as much as 35 percent when injecting the same CEs on select Intel server platform clusters with 2nd Gen Intel Xeon Scalable processors and 3rd Gen Intel Xeon Scalable processors.<sup>2</sup>
- PCC enhancements on the 3rd Gen Intel Xeon Scalable processors reduced the PCC-caused UE crash rate significantly, by 56 percent.<sup>2</sup>
- Tencent's statistical data for six months showed that the number of DIMMs screened out due to poor quality was 4x higher on clusters built with 3rd Gen Intel Xeon Scalable processors, as compared to clusters built with 2nd Gen Intel Xeon Scalable processors.<sup>2</sup> This represents defect data from all three primary suppliers of DDR4, as measured in defective parts per million (DPPM). A higher screen-out rate at the manufacturing stage is helpful to reduce the memory UE rate for a server fleet deployed online in the field.



**Figure 1.** Intel server platform memory failure improvements measured by Tencent with 3rd Gen Intel Xeon Scalable processors versus 2nd Gen Intel Xeon Scalable processors

#### **Summary**

Server platforms that use 3rd Gen Intel Xeon Scalable processors have demonstrated much lower memory UE crash rates than those of server platforms that use 2nd Gen Intel Xeon Scalable processors. These benefits are expected to continue on Intel Xeon platforms using 4th and 5th Gen Intel Xeon Scalable processors. By improving its RAS technologies, Intel provides customers the ability to deliver computing solutions at improved uptime and higher reliability. This benefits customers around the world who depend on high-quality, reliable data center operations to run their businesses effectively.



 $<sup>^{1}</sup>$  Intel. "Engineering Practice to Reduce Server Crash Rate from DDR Uncorrectable Errors (UCE) in Hyperscale Cloud Data Center." 2020. intel.com/content/dam/www/public/us/en/documents/white-papers/reduce-server-crash-rate-tencent-paper.pdf.

 $Performance \ varies \ by \ use, configuration \ and \ other factors. Learn \ more \ at \ \underline{www.Intel.com/PerformanceIndex}.$ 

Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. See configuration disclosure for additional details.

No product or component can be absolutely secure.

Your costs and results may vary.

 $Intel\,technologies\,may\,require\,enabled\,hardware, software\,or\,service\,activation.$ 

 $Intel\,does\,not\,control\,or\,audit\,third-party\,data.\,You\,should\,consult\,other\,sources\,to\,evaluate\,accuracy.$ 

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. Printed in USA 0823/MG/PRW/PDF Please Recycle 355690-001US.

Based on Tencent internal analysis as of June 2023 provided directly to Intel. Analysis compares 3rd Gen Intel Xeon Scalable processors versus 2nd Gen Intel Xeon Scalable processors. Results may vary.