It’s no secret that Intel’s enterprise processor platform has been stretched in recent generations. Compared to the competition, Intel is chasing its multi-die strategy while relying on a manufacturing platform that hasn’t offered the best in the market. That being said, Intel is quoting more shipments of its latest Xeon products in December than AMD shipped in all of 2021, and the company is launching the next generation Sapphire Rapids Xeon Scalable platform later in 2022. Beyond Sapphire Rapids has been somewhat under the hood, with minor leaks here and there, but today Intel is lifting the lid on that roadmap.

State of Play Today

Currently in the market is Intel’s Ice Lake 3rd Generation Xeon Scalable platform, built on Intel’s 10nm process node with up to 40 Sunny Cove cores. The die is large, around 660 mm2, and in our benchmarks we saw a sizeable generational uplift in performance compared to the 2nd Generation Xeon offering. The response to Ice Lake Xeon has been mixed, given the competition in the market, but Intel has forged ahead by leveraging a more complete platform coupled with FPGAs, memory, storage, networking, and its unique accelerator offerings. Datacenter revenues, depending on the quarter you look at, are either up or down based on how customers are digesting their current processor inventories (as stated by CEO Pat Gelsinger).

That being said, Intel has put a large amount of effort into discussing its 4th Generation Xeon Scalable platform, Sapphire Rapids. For example, we already know that it will be using >1600 mm2 of silicon for the highest core count solutions, with four tiles connected with Intel’s embedded bridge technology. The chip will have eight 64-bit memory channels of DDR5, support for PCIe 5.0, as well as most of the CXL 1.1 specification. New matrix extensions also come into play, along with data streaming accelerators, quick assist technology, all built on the latest P-core designs currently present in the Alder Lake desktop platform, albeit optimized for datacenter use (which typically means AVX512 support and bigger caches). We already know that versions of Sapphire Rapids will be available with HBM memory, and the first customer for those chips will be the Aurora supercomputer at Argonne National Labs, coupled with the new Ponte Vecchio high-performance compute accelerator.

The launch of Sapphire Rapids is significantly later than originally envisioned several years ago, but we expect to see the hardware widely available during 2022, built on Intel 7 process node technology.

Next Generation Xeon Scalable

Looking beyond Sapphire Rapids, Intel is finally putting materials into the public to showcase what is coming up on the roadmap. After Sapphire Rapids, we will have a platform compatible Emerald Rapids Xeon Scalable product, also built on Intel 7, in 2023. Given the naming conventions, Emerald Rapids is likely to be the 5th Generation.

Emerald Rapids (EMR), as with some other platform updates, is expected to capture the low hanging fruit from the Sapphire Rapids design to improve performance, as well as updates from the manufacturing. With platform compatibility, it means Emerald will have the same support when it comes to PCIe lanes, CPU-to-CPU connectivity, DRAM, CXL, and other IO features. We’re likely to see updated accelerators too. Exactly what the silicon will look like however is still an unknown. As we’re still new in Intel’s tiled product portfolio, there’s a good chance it will be similar to Sapphire Rapids, but it could equally be something new, such as what Intel has planned for the generation after.

After Emerald Rapids is where Intel’s roadmap takes on a new highway. We’re going to see a diversification in Intel’s strategy on a number of levels.

Starting at the top is Granite Rapids (GNR), built entirely of Intel’s performance cores, on an Intel 3 process node for launch in 2024. Previously Granite Rapids had been on roadmaps as an Intel 4 node product, however, Intel has stated to us that the progression of the technology as well as the timeline of where it will come into play makes it better to put Granite on that Intel 3 node. Intel 3 is meant to be Intel’s second-generation EUV node after Intel 4, and we expect the design rules to be very similar between the two, so it’s not that much of a jump from one to the other we suspect.

Granite Rapids will be a tiled architecture, just as before, but it will also feature a bifurcated strategy in its tiles: it will have separate IO tiles and separate core tiles, rather than a unified design like Sapphire Rapids. Intel hasn’t disclosed how they will be connected, but the idea here is that the IO tile(s) can contain all the memory channels, PCIe lanes, and other functionality while the core tiles can be focused purely on performance. Yes, it sounds like what Intel’s competition is doing today, but ultimately it’s the right thing to do.

Granite Rapids will share a platform with Intel’s new product line, which starts with Sierra Forest (SRF) which is also on Intel 3. This new product line will be built from datacenter optimized E-cores, which we’re familiar with from Intel’s current Alder Lake consumer portfolio. The E-cores in Sierra Forest will be a future generation than the Gracemont E-cores we have today, but the idea here is to provide a product that focuses more on core density rather than outright core performance. This allows them to run at lower voltages and parallelize, assuming the memory bandwidth and interconnect can keep up.

Sierra Forest will be using the same IO die as Granite Rapids. The two will share a platform – we assume in this instance this means they will be socket compatible – so we expect to see the same DDR and PCIe configurations for both. If Intel’s numbering scheme continues, GNR and SRF will be Xeon Scalable 6th Generation products. Intel stated to us in our briefing that the product portfolio currently offered by Ice Lake Xeon products will be covered and extended by a mix of GNR and SRF Xeons based on customer requirements. Both GNR and SRF are expected to have full global availability when launched.

The E-core Sierra Forest focused on core density will end up being compared to AMD’s equivalent, which for Zen4c will be called Bergamo – AMD might have a Zen5 equivalent when SRF comes to market.

I asked Intel whether the move to GNR+SRF on one unified platform means the generation after will be a unique platform, or whether it will retain the two-generation retention that customers like. I was told that it would be ideal to maintain platform compatibility across the generations, although as these are planned out, it depends on timing and where new technologies need to be integrated. The earliest industry estimates (beyond CPU) for PCIe 6.0 are in the 2026 timeframe, and DDR6 is more like 2029, so unless there are more memory channels to add it’s likely we’re going to see parity between 6th and 7th Gen Xeon.

My other question to Intel was about Hybrid CPU designs – if Intel was now going to make P-core tiles and E-core tiles, what’s stopping a combined product with both? Intel stated that their customers prefer uni-core designs in this market as the needs from customer to customer differ. If one customer prefers an 80/20 split on P-cores to E-cores, there’s another customer that prefers a 20/80 split. Having a wide array of products for each different ratio doesn’t make sense, and customers already investigating this are finding out that the software works better with a homogeneous arrangement, instead split at the system level, rather than the socket level. So we’re not likely to see hybrid Xeons any time soon. (Ian: Which is a good thing.)

I did ask about the unified IO die - giving the same P-core only and E-core only Xeons the same number of memory channels and I/O lanes might not be optimal for either scenario. Intel didn’t really have a good answer here, aside from the fact that building them both into the same platform helped customers synergize non-returnable development costs across both CPUs, regardless of the one they used. I didn’t ask at the time, but we could see the door open to more Xeon-D-like scenarios with different IO configurations for smaller deployments, but we’re talking products that are 2-3+ years away at this point.

Xeon Scalable Generations
Date AnandTech Codename Abbr. Max
Cores
Node Socket
Q3 2017 1st Skylake SKL 28 14nm LGA 3647
Q2 2019 2nd Cascade Lake CXL 28 14nm LGA 3647
Q2 2020 3rd Cooper Lake CPL 28 14nm LGA 4189
Q2 2021 Ice Lake ICL 40 10nm LGA 4189
2022 4th Sapphire Rapids SPR * Intel 7 LGA 4677
2023 5th Emerald Rapids EMR ? Intel 7 **
2024 6th Granite Rapids GNR ? Intel 3 ?
Sierra Forest SRF ? Intel 3
>2024 7th Next-Gen P ? ? ? ?
Next-Gen E
* Estimate is 56 cores
** Estimate is LGA4677

For both Granite Rapids and Sierra Forest, Intel is already working with key ‘definition customers’ for microarchitecture and platform development, testing, and deployment. More details to come, especially as we move through Sapphire and Emerald Rapids during this year and next.

Comments Locked

144 Comments

View All Comments

  • mode_13h - Monday, February 21, 2022 - link

    This seems like a perfect use case for tiered memory (see my post about CXL memory). Because oversubscription is so painful, you need to have RAM for your guests' full memory window. However, that's not to say that all of the RAM needs to be running at full speed. For instance, the "free" RAM in a machine that's serving as disk cache will tend to be fairly light duty-cycle and is an easy target for demoting to a slower memory tier. Watch this space.
  • schujj07 - Monday, February 21, 2022 - link

    Use of CXL to extend RAM into a RAM pool is an interesting option. Right now that isn't a thing but could be in the next couple years for sure. I wonder how they will do redundancy for a RAM pool. If a host crashes that can take down quite a few VMs. However, if a RAM pool crashes that could take down 1/2 your data center. In many ways I think it would have to be a setup like physical SANs. For sure this will be interesting to watch how it is done over the next decade. At first I can see this being too expensive for anyone who isn't like AWS or massive companies. My guess is for smaller companies with their own data centers it will be at least 10 years before it is cheap enough for us to implement this solution.
  • mode_13h - Tuesday, February 22, 2022 - link

    > I wonder how they will do redundancy for a RAM pool.

    For one thing, Intel is contributing CXL memory patches that allow hot insertion/removal. Of course, if a CXL memory device fails that your VM is using, then it's toast.

    There are techniques mainframes use to survive this sort of thing, but I'm not sure if that's the route CXL memory is headed down.

    > if a RAM pool crashes that could take down 1/2 your data center.

    I think the idea of CXL is to be more closely-coupled to the host than that. While it does offer coherency across multiple CPUs and accelerators, I doubt you'd use CXL for communication outside of a single chassis.
  • schujj07 - Tuesday, February 22, 2022 - link

    "I think the idea of CXL is to be more closely-coupled to the host than that. While it does offer coherency across multiple CPUs and accelerators, I doubt you'd use CXL for communication outside of a single chassis."

    From the little bit I have read about CXL memory, what I get from it is you would have a pool or two in each rack. In the data center everything has to be redundant otherwise you can have issues. SAN's have dual controllers, hosts are never loaded to full capacity to allow for failover, etc... Would a CXL pool have dual controllers and mirror the data in RAM to the second controller? I'm sure they will use some of the knowledge from mainframes to figure out how to do this. I'm just not an engineer so I am doing nothing more than speculating.
  • mode_13h - Wednesday, February 23, 2022 - link

    > Would a CXL pool have dual controllers and mirror the data in RAM to the second controller?

    Interesting question. While the CXL protocol might enable cache-coherence across multiple CPUs and accelerators, I think that won't extend to memory mirroring. That would mean that a CXL memory device should implement any mirroring functionality, internally. Not ideal, of course. And I could be wrong about what CXL 2.0 truly supports. I guess we'll have to wait and see.
  • mode_13h - Wednesday, February 23, 2022 - link

    Just to be clear about what I meant, when I write data to one memory device, CXL ensures that write is properly synchronized with all other CXL devices. However, if a CPU tries to write out the same data to two different CXL memory devices, I doubt there's any way to be sure they're mutually synchronized.

    In other words, if you have two devices issuing writes to the same address, which is backed by mirrored memory, the first device might be first to write that address on the first memory module, but second to write it on the second memory module. So, the values will now be inconsistent.
  • Rοb - Monday, February 21, 2022 - link

    I think you could put 3 light memory usage VMs with one heavier usage VM, giving more VMs (at greater CPU utilization) and allowing one of the VMs (using 1/4 of a core) to have half as much memory - but it the user needed more memory then get them to pay for more than 1/4 core; have them pay for 2 cores (that they don't fully utilize) to get double the memory. If you won't buy bigger DIMMs you have to recover the allocatable memory somehow.

    The good (painful) news is that there are 512GB DDR5 DIMMs, and that the sweet (in a few years) spot will also be 2x of the DDR4 sizes, so you'll be able to get more memory (after the prices no longer eat the budget away). That means for 1/2M you could get 24TB of memory into the slots, if 2 CPUs can access that much; they don't try to save an address line.

    That 12 channel and CXL (hopefully 2.0) is coming is expected speculation - they should do it, and not wait too long.

    My theory is that the extra pins not accounted for by the above will go into moving the 2P interconnect via Infinity Fabric over PCIe into connect over an extra set of CXL 2.0 (for the encryption) lanes - freeing up more PCIe lanes; leaving 160-192 lanes as standard, instead of just 128.

    More memory and bandwidth, more PCIe lanes, and CXL 2.0 (which is announced), along with more cores (and their 700W boost) will set them ahead across the board; except for single thread performance (and hybrid E-cores, so they'll be a close second for power; with enough difference in price to pay for the electricity).
  • schujj07 - Tuesday, February 22, 2022 - link

    "I think you could put 3 light memory usage VMs with one heavier usage VM, giving more VMs (at greater CPU utilization) and allowing one of the VMs (using 1/4 of a core) to have half as much memory - but it the user needed more memory then get them to pay for more than 1/4 core; have them pay for 2 cores (that they don't fully utilize) to get double the memory. If you won't buy bigger DIMMs you have to recover the allocatable memory somehow."

    That isn't really how virtualization works. You have a bit of the idea right in that a VM will be placed onto a host that has the free resources. However, no one will be doing this by hand unless they have only 2 hosts. In VMware there is a tool called Distributed Resource Scheduler (DRS) that will automatically place VMs on the correct host in a cluster as well as migrate VMs between hosts for load balancing.

    There is no way to give a system only 1/4 core. The smallest amount of CPU that is able to be given is 1 Virtual CPU (that can be a physical core or a hyperthread). I cannot tell you on which physical CPU that vCPU will be run. Until a system needs CPU power, that vCPU sits idle. Once the system needs compute it goes to the hypervisor to ask for compute resources. The hypervisor then looks at what physical resources are available and then gives that system time on the physical hardware.

    As physical core counts have gone up it has gotten much easier for the hypervisor to schedule CPU resources without having the requesting system wait for said resources. When you used to have only dual 4c/8t or 8c/16t CPUs, you could easily have systems waiting for CPU resources if you were over allocated on vCPU. In this case a dual 8c/16t server will have 32 vCPU but you could have enough VMs on the server that you have allocated a total of 64 vCPUs. A VM with 4 vCPU has to wait until there are 4 threads available (with the Meltdown/Spector mitigations it would be 2c/4t or 4c) before it can get the physical CPU time. It can happen that say one system has 16 vCPU out of the 32 vCPU on the server and will be waiting almost forever for CPU resources. Since the scheduling isn't like getting in line, if only 8 vCPU frees up that 16 vCPU is left to wait while 2x 4 vCPU VMs get time on the CPU. The hosts I'm running all have 128 vCPU and that makes the CPU resource contention much less of an issue since at any time you are almost assured of free CPU resources. For example I have allocated over 160 vCPU to one server and never have I had an issue where VMs are waiting for compute. I would probably need to be in the 250+ vCPU allocated before I run into CPU resource contention. With these high core count server CPUs, the biggest limiting factor for number of VMs running on a host has changed from CPU to RAM.

    From the RAM side it will take a long time until 512GB DDR5 LRDIMMs are available. However, I could easily see the 128GB DDR5 RDIMM being the most popular size, like 64GB is right now for DDR4. For a company like where I work which does small cloud hosting, going from dual socket 8 Channel 64GB DIMMs (1TB total at 1DPC) to dual socket 12 Channel 128GB DIMMs (3TB total at 1DPC) is a huge boost.
  • Rοb - Thursday, February 24, 2022 - link

    "There is no way to give a system only 1/4 core." Semantics - Put 4 VMs on one physical core. One example explanation found with one minute of searching: https://superuser.com/a/698959/
  • schujj07 - Thursday, February 24, 2022 - link

    Not semantics at all. While I can have 4 different VMs each with a single CPU on one physical core, that doesn't mean they get 1/4 core. Here is an example with 4 VMs lets call the VMs A, B, C, & D. The host machine has a single physical non SMT CPU. VM A is running something that is using continuous CPU (say y-cruncher but only 25% CPU load). VMs B, C, & D want to run something on the CPU (say OS updates). Those VMs cannot each get 25% of the CPU all at the same time. They have to wait until VM A is done requesting that CPU access until 1 of those next VMs can request CPU for the updates. I cannot have the CPU running at 100% load with those VMs all running code simultaneously (in parallel for lack of a better term) on the CPU. The requests for CPU time are done in serial not parallel. Therefore you cannot give 1/4 of a CPU to a VM.

    That link you gave is talking about something different. When you setup a VM in VMware, Hyper-V, etc...you are asked to specify the number of CPUs for the system. You have 2 options for giving the number of CPUs (Cores & Sockets). 99.9% of the time it doesn't matter how you do it. You say you are giving VM 4 CPUs it gets 4 CPUs. When you look at the settings in VMware, you see that is 4 sockets @ 1 core /socket. However, I can change that to 1 socket @ 4 cores/socket or 2 sockets @ 2 cores/socket, etc...The OS doesn't care how it is setup as it is getting the CPU you told it was going to get. Where that matters, is in some software licensing is done Per Socket so you might get charged a lot more if your software thinks it is 4 sockets vs 1 socket for licensing. This does not mean I'm giving one system 1/4 (0.25) a CPU vs 4 CPUs.

    FYI I am an expert in virtualization. I have my VMware Data Center Virtualization Certificate and am a VMware Certified Professional and I run a data center. I have been the lead VMware Admin for the last 3.5 years.

Log in

Don't have an account? Sign up now