The Intel Xe-LP GPU Architecture Deep Dive: Building Up The Next Generationby Ryan Smith on August 13, 2020 9:00 AM EST
- Posted in
- Tiger Lake
- Intel Arch Day 2020
As part of today’s Intel Architecture Day, Intel is devoting a good bit of its time to talking about the company’s GPU architecture plans. Though not a shy spot for Intel, per-se, the company is still best known for its CPU cores, and the amount of marketing attention they’ve put into the graphics side of their business has always been a bit weaker as a result. But, like so many other things at Intel, times are changing – not only is Intel devoting ever more die real estate to GPUs, but over the next two years they are transitioning into a true third player in the PC GPU space, launching their first new discrete GPU in several generations.
As part of Intel’s previously-announced Xe GPU architecture, the company intends to become a top-to-bottom GPU provider. This means offering discrete and integrated GPUs for everything from datacenters and HPC clusters to high-end gaming machines and laptops. This is a massive expansion for a company whom for the last decade has only been offering integrated GPUs, and one that has required a lot of engineering to get here. But, at long last, after a couple of years of talking up Xe and laying out their vision, Xe is about to become a reality for Intel’s customers.
While we’ll focus on different Xe-related announcements in separate articles – with this one focusing on Xe-LP – let’s quickly recap the state of Intel’s Xe plans, what’s new as of today, and where Xe-LP fits into the bigger picture.
When first announced back in 2018, Intel laid out plans for a single GPU architecture, Xe, comprised of three different microarchitectures: Xe-LP, Xe-HP, and Xe-HPC. Spanning the market from the bottom to the top respectively, Xe-LP would go into integrated and entry-level discrete graphics, Xe-HP into enthusiast and datacenter parts, and finally Xe-HPC would be for high performance computing clusters such as the upcoming Aurora supercomputer, the US Department of Energy’s long-awaited exaflop machine.
Since then, Intel has revised that plan a bit, and what was three microarchitectures is now four. Being announced as part of Intel Architecture Day today, Intel is revealing Xe-HPG, an additional microarchitecture for gaming-focused chips. We have more on Xe-HPG in this article, but at a high level it’s meant to be the missing piece of the puzzle in Intel’s product stack, offering a high-performance gaming and graphics-focused chip as opposed to Xe-HP, which is specializing in datacenter features like FP64 and multi-tile scalability. Xe-HPG is set to arrive in 2021, and notably will be built entirely at a third-party fab, unlike the rest of the Xe family.
Which to bring things back to the immediate topic of Xe-LP, makes this year’s launch of Intel’s first Xe microarchitecture all the more important. Intel’s plans for Xe involve building up successive Xe parts – quite literally in Xe-HPC’s case – going for wider designs that incorporate ever increasing numbers of base building blocks, and then scaling out the number of GPUs when even that isn’t enough. So Xe-LP is very much the foundation of the Xe family, not just in diagrams but in architecture as well; and consequently, what Intel has designed for Xe-LP will have repercussions for the entire Xe product stack.
Xe-LP: Integrated with Tiger Lake, But Discrete as Well
While Intel’s product plans for Xe have them eventually reaching far and wide, it’s only fitting that things start at the same place they always have for Intel’s GPUs: integrated graphics. Xe and Xe-LP will be making their first appearance in the market as part of Intel’s new Tiger Lake SoC, which the company will be launching on September 2nd. And despite the fact that Intel isn’t talking much about the product side of Tiger Lake itself – preferring to keep today about architecture and making September about products – Tiger Lake was very clearly the focal point for Xe-LP’s design. So Tiger Lake is the catalyst for it all, as we’ll see when discussing Xe-LP’s features.
Given Intel’s official disclosures today (never mind the many, many leaks), it’s clear that Tiger Lake parts are going to top out with a single Xe-LP slice. Which, new with this generation, is now 96 of Intel’s more fundamental GPU Execution Units (EUs). Overall, the company is targeting a 2x increase in performance over Ice Lake (Gen11) graphics,
But Tiger Lake won’t be the only place where Xe-LP will show up. As previously disclosed by Intel, the company is developing a discrete GPU version of it, which they are calling DG1. Designed to be paired with Tiger Lake in notebooks and other mobile devices, DG1 is Intel’s first discrete GPU in over twenty years, and it’s the spiritual-successor of sorts to Intel’s GT3 and GT4e integrated GPU configurations. Only instead of building low-volume CPU designs with a larger GPU, Intel will instead sell OEMs a discrete GPU based on the same architecture and built on the same 10nm SuperFin process as the integrated GPU.
DG1 will be shipping this year, so expect to see it show up in higher-performing Tiger Lake laptops. However Intel is otherwise disclosing very little about the part, as they are not disclosing much of anything with regards to product configurations today. So while we know that it’s based on Xe-LP and that it’s mobile-focused (Intel has dropped all discussion of desktop usage), we don’t have official details on anything such as its configuration or what type of memory it uses.
And forming the final pillar, the sever space won’t be left out on Xe-LP either. Intel will be delivering a quad-GPU product for servers that they are calling SG1. Based on four DG1 GPUs, this will be a replacement of sorts for Intel’s Xeon Visual Compute Accelerator family of products. Designed to leverage Intel’s earlier integrated GPUs, the Xeon VCA cards were aimed at the video encoding market, using Intel’s QuickSync media blocks to accelerate the process. Now that Intel has discrete GPUs, they no longer need to gang together CPUs for this market, and instead can sell accelerators with just the GPU. It’s a bit of a niche market with regards to the larger GPU ecosystem, but it’s an important one for Intel, so they’re hoping that SG1 will make server operators stand up and take notice – or at least those pesky Goa'ulds.
Post Your CommentPlease log in or sign up to comment.
View All Comments
mode_13h - Thursday, August 13, 2020 - linkAs always, thanks for the deep coverage.
Not finished reading, but I already have one complaint:
> Gen11’s smallest wavefront width is 8 threads wide (SIMD8), so it can take multiple clock cycles to execute a single wavefront, with Intel interleaving multiple threads as a form of latency hiding.
Wow. Mixing 2 different definitions of "thread" in the same sentence? Please don't.
Last I checked Nvidia is the only one talking about SIMD lanes as if they're threads. In Intel's Gen 9 whitepaper, it uses "threads" in a manner equivalent to CPU threads, and they talk about SIMD lanes as SIMD lanes.
And speaking of Gen 9, they claim it has 7-way SMT. Did they ever specify this, for Gen 11? I don't recall seeing it in their Gen 11 whitepaper, which went into significantly less detail on the EUs than previous whitepapers.
mode_13h - Thursday, August 13, 2020 - linkI guess your article could be self-consistent by replacing the second use of "thread" in that quoted sentence with "wavefront"?
Although, "wavefront" is an AMD term (Nvidia calls them "Warps"). However, Intel's slides suggest they still call them "threads".
Ryan Smith - Thursday, August 13, 2020 - link"I guess your article could be self-consistent by replacing the second use of "thread" in that quoted sentence with "wavefront"?"
You are correct sir! That was supposed to be "wavefront".
And Intel tends to use "wave" in its literature, though I prefer to collapse it down to just wavefront to keep things reasonably consistent. We don't need 2 nearly-identical terms for the same thing.
mode_13h - Thursday, August 13, 2020 - linkCool. Thanks for the reply!
BTW, I don't mind the term "wavefront" - I said that more to point it out to those who might not know.
mode_13h - Thursday, August 13, 2020 - linkIMO, the reason Nvidia has long called their Warp elements "threads" is so they can claim that each SIMD lane is a "core", to make their GPUs *sound* more impressive.
Since Volta finally fixed their per-lane IP register (which is basically just a fancy form of branch predication), there's almost a touch of truth in that characterization, and I'd finally agree that their ISA is more than just a straight-forward combination of SIMD + SMT.
xenol - Thursday, August 13, 2020 - linkAMD feels more confusing. Their base unit is a "stream processor" which seems to suggest something larger than it really is. But a group of stream processors is called a Compute Unit, which that seems to suggest something smaller than it really is.
Though looking at some of the programming literature for GPUs, I can see where the "thread" terminology comes from. So this looks more like a problem of someone coming up with their own language instead of the industry coming together to standardize on it. However, given that NVIDIA, AMD, and Intel have their own way of doing things, it may not be possible to do that and for the sake of clarity, having their own terminology is more or less correct.
mode_13h - Thursday, August 13, 2020 - linkSince Nvidia's Fermi and AMD's GCN, their architectures basically amount to SIMD + SMT. I'm not sure exactly when Intel added SMT.
Anyway, I wouldn't characterize their architectures as fundamentally different. Intel is traditionally the most distinct, among the three.
jim bone - Friday, August 14, 2020 - linkrecent editions of Hennessy and Patterson have a nice table mapping the CPU terminology to nvidia’s GPU terminology:
jim bone - Friday, August 14, 2020 - linkand yes for reasons nvidia calls a vertical slice of simd instructions a thread
kpx86 - Thursday, August 13, 2020 - linkI believe the SW libraries like DirectX and OpenGL use threads this way.
From MSFT website: The maximum number of threads is limited to D3D11_CS_4_X_THREAD_GROUP_MAX_THREADS_PER_GROUP (768) per group.