A Hybrid/Heterogeneous Design

Developing a processor with two different types of core is not a new concept – there are billions of smartphones that have exactly that inside them, running Android or iOS, as well as IoT and embedded systems. We’ve also seen it on Windows, cropping up on Qualcomm’s Windows on Snapdragon mobile notebooks, as well as Intel’s previous Lakefield design. Lakefield was the first x86 hybrid design in that context, and Alder Lake is the more mass-market realization of that plan.

A processor with two different types of core disrupts the typical view of how we might assume a computer works. At the basic level, it has been taught that a modern machine is consistent – every CPU has the same performance, processes the same data at the same rate, has the same latency to memory, the same latency to each other, and everything is equal. This is a straightforward homogenous design that’s very easy to write software for.

Once we start considering that not every core has the same latency to memory, moving up to a situation where there are different aspects of a chip that do different things at different speeds and efficiencies, now we move into a heterogeneous design scenario. In this instance, it becomes more complex to understand what resources are available, and how to use them in the best light. Obviously, it makes sense to make it all transparent to the user.

With Intel’s Alder Lake, we have two types of cores: high performance/P-cores, built on the Golden Cove microarchitecture, and high efficiency/E-cores, built on the Gracemont microarchitecture. Each of these cores are designed for different optimization points – P-cores have a super-wide performance window and go for peak performance, while E-cores focus on saving power at half the frequency, or lower, where the P-core might be inefficient.

This means that if there is a background task waiting on data, or something that isn’t latency-sensitive, it can work on the E-cores in the background and save power. When a user needs speed and power, the system can load up the P-cores with work so it can finish the fastest. Alternatively, if a workload is more throughput sensitive than latency-sensitive, it can be split across both P-cores and E-cores for peak throughput.

For performance, Intel lists a single P-core as ~19% better than a core in Rocket Lake 11th Gen, while a single E-core can offer better performance than a Comet Lake 10th Gen core. Efficiency is similarly aimed to be competitive, with Intel saying a Core i9-12900K with all 16C/24T running at a fixed 65 W will equal its previous generation Core i9-11900K 8C/16T flagship at 250 W. A lot of that will be that having more cores at a lower frequency is more efficient than a few cores at peak frequency (as we see in GPUs), however an effective 4x performance per watt improvement requires deeper investigation in our review.

As a result, the P-cores and E-cores look very different. A deeper explanation can be found in our Alder Lake microarchitecture deep dive, but the E-cores end up being much smaller, such that four of them are roughly in the same area as a single P-core. This creates an interesting dynamic, as Intel highlighted back at its Architecture Day: A single P-core provides the best latency-sensitive performance, but a group of E-cores would beat a P-core in performance per watt, arguably at the same performance level.

However, one big question in all of this is how these workloads end up on the right cores in the first place? Enter Thread Director (more on the next page).

A Word on L1, L2, and L3 Cache

Users with an astute eye will notice that Intel’s diagrams relating to core counts and cache amounts are representations, and some of the numbers on a deeper inspection need some explanation.

For the cores, the processor design is physically split into 10 segments.

A segment contains either a P-core or a set of four E-cores, due to their relative size and functionality. Each P-core has 1.25 MiB of private L2 cache, which a group of four E-cores has 2 MiB of shared L2 cache.

This is backed by a large shared L3 cache, totaling 30 MiB. Intel’s diagram shows that there are 10 LLC segments which should mean 3.0 MiB each, right? However, moving from Core i9 to Core i7, we only lose one segment (one group of four E-cores), so how come 5.0 MiB is lost from the total L3? Looking at the processor tables makes less sense.

 

Please note that the following is conjecture; we're awaiting confirmation from Intel that this is indeed the case.

It’s because there are more than 10 LLC slices – there’s actually 12 of them, and they’re each 2.5 MiB.  It’s likely that either each group of E-cores has two slices each, or there are extra ring stops for more cache.

Each of the P-cores has a 2.5 MiB slice of L3 cache, with eight cores making 20 MiB of the total. This leaves 10 MiB between two groups of four E-cores, suggesting that either each group has 5.0 MiB of L3 cache split into two 2.5 MiB slices, or there are two extra LLC slices on Intel’s interconnect.

Alder Lake Cache
AnandTech Cores
P+E/T
L2
Cache
L3
Cache
IGP Base
W
Turbo
W
Price
$1ku
i9-12900K 8+8/24 8x1.25
2x2.00
30 770 125 241 $589
i9-12900KF 8+8/24 8x1.25
2x2.00
30 - 125 241 $564
i7-12700K 8+4/20 8x1.25
1x2.00
25 770 125 190 $409
i7-12700KF 8+4/20 8x1.25
1x2.00
25 - 125 190 $384
i5-12600K 6+4/20 6x1.25
1x2.00
20 770 125 150 $289
i5-12600KF 6+4/20 6.125
1x200
20 - 125 150 $264

This is important because moving from Core i9 to Core i7, we lose 4xE-cores, but also lose 5.0 MiB of L3 cache, making 25 MiB as listed in the table. Then from Core i7 to Core i5, two P-cores are lost, totaling another 5.0 MiB of L3 cache, going down to 20 MiB. So while Intel’s diagram shows 10 distinct core/LLC segments, there are actually 12. I suspect that if both sets of E-cores are disabled, so we end up with a processor with eight P-cores, 20 MiB of L3 cache will be shown.

 
Intel Announces 12th Gen Core Alder Lake Thread Director: Windows 11 Does It Best
Comments Locked

395 Comments

View All Comments

  • lmcd - Wednesday, October 27, 2021 - link

    You don't remember correctly at all. Apple's little cores are stupidly fast for little cores. Andrei flails in every Apple SoC review how stupid it is that there's no ARM licensed core answer to Apple's little cores.

    Intel probably roadmapped Alder Lake the minute they saw how performant Apple little cores were in even the iPhone 6S.

    Atom has been surprisingly good for a while. No need to make up conspiracies when you can buy a Jasper Lake SKU that confirms Intel Atom is far from slow.
  • name99 - Thursday, October 28, 2021 - link

    Apple's small cores are
    - about 1/3 the performance at
    - about 1/0th the power, net result being
    - same amount of computation takes about 1/3 the energy.

    The Intel cores appear (based on what's claimed) to be substantially faster -- BUT at the cost of substantially more power and thus net energy.
    If they are 70% of a P core but also use 70% of the power, that's net equal energy! No win!
    It won't be that bad, but if it's something like 70% of a P core at 35% of the power, that's still only half the net energy. Adequate, but not as good as Apple. My guess is we won't get as good as that, we'll land up at something like 50% of the power, so net 70% of the energy of a P core.

    (And of course you have to be honest in the accounting. Apple integrates the NoC speed, cache speeds, DRAM speed all ramped up or down in tandem with demand, so that if you're running only E cores it's your entire energy footprint that's reduced to a third. Will Intel drop the E-*core* energy quite a bit, but it makes no real difference because everything from the NoC to the L3 to the DRAM to the PCIe is burning just as much power as before?)

    Essentially Apple is optimizing for energy usage by the small cores, whereas Intel seems to be optimizing for something like "performance per area".
    That's not an utterly insane design point, but it's also not clear that it's a *great* design point. In essence, it keeps Intel on the same track as the past ten years or so -- prioritizing revenue issues over performance (broadly defined, to include things like energy and new functionality). And so it keeps Intel on track with the Intel faithful -- but does nothing to expand into new markets, or to persuade those who are close to giving up on Intel.

    Or to put it more bluntly, it allows Intel to ship a box that scores higher in Cinebench-MT at the same chip area -- but that's unlikely to provide an notably different, "wow", experience from its predecessor, either in energy use or "normal" (ie not highly-threaded) apps.

    Of course we'll see when the Anandtech review comes out. But this is what it looks like to me, as the salient difference between how Apple (and, just not as well, ARM) think of big vs little, compared to Intel.
  • nandnandnand - Thursday, October 28, 2021 - link

    "It won't be that bad, but if it's something like 70% of a P core at 35% of the power, that's still only half the net energy."

    I don't know how it will compare to Apple, but if it has a performance-per-area *and* a performance-per-watt advantage, it is a major improvement for x86. Especially as Intel iterates and puts 16 or 32 E-cores alongside 8 P-cores.

    Basically, Intel can continue to tinker with the P-cores to get the best possible single-threaded performance, knowing that 8 P-cores is enough for anyone™, but spamming many E-cores is Intel's path to more multi-threaded performance.

    Alder Lake can be considered a beta test. The benefits will really be felt when we see 40 cores, 48 threads (8+32) at the die space equivalent of 16 P-cores. The next node shrink after "Intel 7" will help keep power under control.
  • vogonpoetry - Wednesday, October 27, 2021 - link

    User-rewritable SPDs are a total game-changer for RAM overclockers. Many times I have wished for such a feature. As is on-the-fly power/frequency adjustment (though I wish we could change timings too).

    As for "Dynamic Memory Boost", doesnt Power Down Mode already do something similar currently? My DDR4 laptop memory frequency can be seen changing depending on workload.
  • Oxford Guy - Thursday, October 28, 2021 - link

    All overclocking is dead.
  • Oxford Guy - Thursday, October 28, 2021 - link

    I should have said: 'All user overclocking is dead'.

    Vendor-approved overclocking (i.e. going beyond JEDEC) is another matter.
  • Silver5urfer - Wednesday, October 27, 2021 - link

    On paper it looks okay. Staring with the Z690 chipset is a really deserved upgrade, lot of I/O plus RAID mode optimizations. AMD RAID is so bad, Level1Techs also showed how awful it was.

    STIM is interesting, given how 10900K and 11900K improved vastly with Delidding and LM. So that's a plus. Then the whole Win11 BS is annoying garbage. The WIn11 OS is horrible anti user anti desktop anti computing it reeks desperation to imitate Apple as an Ape. It looks ugly, has Win32 downgrades with integration to UWP, Taskbar downgrades, Awful explorer UI. It's outright unacceptable.

    Now the big part CPU and Price - Looks like Intel is pricing it wayy lower than AMD. For unknown reasons as Intel never does it, I find it disrupting. Also the CPU OC features are somewhat okay I was expecting lower clocks but looks like 5.1GHz but looking that the new PL1 system I do not have a problem at all since I want full performance now no more BS by GN and etc citing omg the power limits 125W must be kept on a damn K unlocked processor. But there were rumors on power consumption going past 350W like RKL once OCed that's the reason why Intel is going 8C max unlike Sapphire Rapids Xeon at 14C. DDR5 is also on it's new life not worth investing money into the DDR5 new adopter tax if DDR4 works which is what I'm curious about RKL Gear 1 4000Mhz is impossible. I wonder how this will fare.

    The leaked performance preview shows mediocre improvements, the ST is definitely a lead on the P cores, real cores. But the SMT / HT is really what I'm interested vs 10900K and Ryzen 5900X. RKL is also fast in ST but SMT was okay not great because 14nm backport.

    I'll be waiting to see how Intel executes this, not going to invest tbh because new chipset, new CPU design, Win11 and I want to run Windows 7. I'd rather settle for a 10900K on Z590. PCIe SSDs are not much of a value for me, they have no use beyond load times and boot times for my use case, MLC 860 Pro SATA SSD is way better, runs cool, long lasting as well.
  • Gothmoth - Wednesday, October 27, 2021 - link

    people who do raid without a dedicated PCIe RAID controller have no clue anyway.

    while most focus on performance i am waiting on performance per WATT figures.
  • vegemeister - Wednesday, October 27, 2021 - link

    Hardware RAID is a recipe for weird bugs and data loss, and provides no benefit over software RAID on top of the same controller running as a dumb HBA.

    Motherboard fake RAID is similarly pointless.
  • Gigaplex - Thursday, October 28, 2021 - link

    I'd rather do mdadm software RAID or use ZFS vs a PCIe RAID controller.

Log in

Don't have an account? Sign up now