Wait, isn't Nvidia doing async, just via pre-emption? As far as I understand, AMD has proper ACEs so they do async on hardware, whereas Nvidia doesn't have the hardware parts and thus does async via software through pre-emption. In a similar way, AMD doesn't have Pascal's simultaneous multi-projection so they do it via software.
In the end, they're both doing async in one way or another. Isn't that right?
Kind of. Pre-emption has almost nothing to do with Asynchronous Compute.
Maxwell, Pascal, and GCN all support Async Compute, but implement it in different ways. GCN uses Asynchronous Shaders (and ACEs) with hardware scheduling. But this only works under DX12 and Vulkan when software actually explicitly targets Async Compute. Otherwise, that silicon is left underutilised. Maxwell and Pascal perform scheduling at the driver level (GPC particioned in Maxwell, SM partitioned in Pascal). But because this is done in software, it was already implemented for DX11. This is why Async Compute sees little benefit on Maxwell and Pascal when moving from DX11 to DX12: Async Compute was already being performed.
So unless the app specifically asks to use the ACEs, AMD's drivers won't put the instructions through there whereas NVIDIA does all its sorting ahead of time?
Well, AMD's drivers did not implement Driver Command Lists (DCLS) in the DX11 API properly. The DX11 API allows you to create multiple contexts, which you can use from multiple threads, where each context/thread can create their own DCL. On nVidia hardware, we saw up to 2x the performance compared to using a single context (see the earlier 3DMark API overhead test). On AMD, this was not implemented, so even if you made multiple threads and contexts, the driver would just serialize it and run it sequentially from a single thread. As a result, you saw 1x performance, regardless of how many threads you used. Given this serializing behaviour, it seems that there was no way for AMD to make use of async compute in DX11 either. nVidia could do this, but I'm not sure to what extent they actually did. All we can see is that nVidia did get reasonable performance increase from using multiple DX11 contexts, where AMD did not get anything at all. Whether some or all of nVidia's performance increase came from async compute, or some other benefits of multithreading, is difficult to say.
DirectX 11 has been out for 7 years and has been the mainstay of games development for a long time. How likely is it that AMD missed out, and continues to miss out, on performance using DirectX 11 simply because of poor driver implementation? If that is a feature of the API actually used by games and it makes a significant difference in performance then it's hard to believe AMD would just let it languish. It would be both incompetence on the part of their driver team and strategic mismanagement of resources by their management. Is it not possible that their architecture simply is not amenable to that feature of the API?
Driver-level changes alone cannot mitigate hardware limitations, and despite the added features in the later versions, at its core, GCN has been outdated for quite some time. Consequently, we have been seeing one family after another of GPUs with nearly nonexistent OC headroom, ridiculous power usage and/or temperatures, and a list of open-source "features".
Sadly, right now, AMD is in a vicious cycle of finance; they need more money to fix these issues, but to make more money they need to fix these issues, hence the inevitable downward slope.
The 3DMark Technical Guide says as much: http://s3.amazonaws.com/download-aws.futuremark.co... "DirectX 11 offers multi-threaded (deferred) context support, but not all vendors implement it in hardware, so it is slow. And overall, it is quite limited."
I don't think the static scheduling is done at a driver level. The way I understand it (and I am not a programmer, this is only from watching presentations) it is done at the request of the application programmer. And Pascal has a dynamic load balancing option which is able to override the requested balance, if allowed. I really don't know for sure, but I wonder if preemption might be used in their dynamic load balancing algorithm, but perhaps the one workload must completely finish before the other is rescheduled. I didn't see any details on how it works.
"but perhaps the one workload must completely finish before the other is rescheduled"
That is what happens when there is no pre-emption. Pre-emption basically means that a workload can be stopped at some point, and its state saved to be continued at a later time. Another workload can then take over. This is how multi-threading was implemented in OSes like UNIX and Windows 9x/NT, so multiple threads and programs can be run on a single core.
From what I understood about Maxwell v2's implementation, pre-emption was only possible between draw calls. So while the GPU can run various tasks in parallel, when you run a graphics task at the same time, you have to wait until it is finished before you can re-evaluate the compute tasks. As a result, compute tasks may have completed at some point during the draw call, and the compute units were sitting idle until they could receive new work. This works fine, as long as your draw calls aren't taking too long, and your compute tasks aren't taking too little time. A common scenario is to have many relatively short draw calls, and relatively long-running compute tasks. In this scenario, Maxwell v2 will not lose that much time. But it implies that you must design your application workload in such a way, because if you don't, it may take a considerable hit, and it may be a better option not to try running many tasks concurrently, but rather running them one-by-one, each having the complete GPU at their disposal, rather than sharing the GPU resources between various tasks. This is what you can also read here: https://developer.nvidia.com/dx12-dos-and-donts
With Pascal, pre-emption can occur at GPU instruction granularity, which means there is no longer any 'delay' until new compute tasks can be rescheduled. This means that units that would otherwise go idle can be supplied with a new workload immediately, and the efficiency of running concurrent compute tasks is not impacted much by the nature of the graphics work and overall application design.
I think that preemption (Why does everyone want to put a hyphen in the word? 'Emption' isn't even a word that I've heard of. Preemption, on the other hand, is.) and dynamic parallelism are two different things. The point of what I wrote that you quoted was that I haven't read anything which specifies how the dynamic parallelistic(?) algorithm works in Pascal. So I didn't want to assume that it relies on or even takes advantage of preemption.
You seem to be saying that dynamic parallelism cannot be implemented without an improved preemption, but AMD seems to accomplish it. NVIDIA's hardware cannot schedule new work on unused SMs while other SMs are working without suspending those working SMs first? Is that what you're saying?
Dynamic parallelism is yet another thing. Dynamic parallelism means that compute tasks can spawn new compute tasks themselves. This is not part of the DX12 spec afaik, and I don't know if AMD even supports that at all. It is a CUDA feature.
I was not talking about dynamic parallelism, but rather about 'static' parallelism, better known as concurrency. Concurrency is possible, but the better granularity you have with pre-emption ('emption' is actually a word, as is 'pre', and 'pre-emption' is in the dictionary that way. It was originally a legal term), the more quickly you can switch between compute tasks and replace idle tasks with new ones.
And yes, as far as I understand it, Maxwell v2 cannot pre-empt graphics tasks. Which means that it cannot schedule new work on unused compute SMs until the graphics task goes idle (as far as I know this is excusively a graphics-specific thing, and when you run only compute-tasks, eg with CUDA HyperQ, this problem is not present). In Pascal they removed this limitation.
(We had a DB problem and we lost one post. I have however recovered the text, which is below)
Sorry I said dynamic parallelism when I meant dynamic load balancing. Yes dynamic parallelism is as you say. But dynamic load balancing is a feature related to concurrency. What I meant to say: Preemption and dynamic load balancing are two different things. And it is not clear to me that preemption is necessary or if it is used at all in Pascal's dynamic load balancing. I am not sure if you know how it is done or not, but the reason I wrote what I did in my original message ("but perhaps the one workload must completely finish before the other is rescheduled") is because I didn't want to assume it was used.
As an aside, I believe when a graphics task is running concurrently with a compute task (within a single GPC? or within the GPU? Again, the context here is unclear to me), Pascal has pixel-level graphics and thread-level compute granularity for preemption, not instruction-level granularity. When only compute tasks are running (on the GPU or within a GPC??), Pascal has instruction-level preemption. Your previous message seem to imply that instruction-level granularity existed when running graphics a workload concurrently with compute tasks.
'Emption' follows 'pre-emption' so closely that one wonders how much 'emption' was ever used by itself since 1800. I do not believe you have seen 'emption' used very much. As for the dictionary, 'preemption' is definitely in the dictionary as the standard spelling. It seems clear the most common modern usage is 'preemption'. 'Pre-emption' is in no way wrong, I'm just wondering why people insist on the hyphen when 'emption' is hardly, if ever, used.
"I am not sure if you know how it is done or not, but the reason I wrote what I did in my original message ("but perhaps the one workload must completely finish before the other is rescheduled") is because I didn't want to assume it was used."
But that is what pre-emption does, and we know that Maxwell and Pascal have pre-emption at some level.
"Pascal has pixel-level graphics and thread-level compute granularity for preemption, not instruction-level granularity."
What do you base this on? nVidia has specifically stated that it has instruction-level granularity. See here: https://images.nvidia.com/content/pdf/tesla/whitep... "The new Pascal GP100 Compute Preemption feature allows compute tasks running on the GPU to be interrupted at instruction-level granularity, and their context swapped to GPU DRAM. This permits other applications to be swapped in and run, followed by the original task’s context being swapped back in to continue execution where it left off. Compute Preemption solves the important problem of long-running or ill-behaved applications that can monopolize a system, causing the system to become unresponsive while it waits for the task to complete, possibly resulting in the task timing out and/or being killed by the OS or CUDA driver. Before Pascal, on systems where compute and display tasks were run on the same GPU, long-running compute kernels could cause the OS and other visual applications to become unresponsive and non-interactive until the kernel timed out."
So really, I don't know why so many people spread false information about nVidia's hardware.
No, it means that Maxwell v2 is more sensitive to the mix of workloads that you want to run concurrently. So it's more difficult to gain from async compute, but not impossible.
On further exploration it does seem like the driver is the one who tries to set up the static partitioning, by looking at the application. But I think with Pascal the programmer can request the driver to not allow dynamic load balancing.
There shouldn't be 'static partitioning'... In the D3D API you can create a number of queues to submit work to, and you can assign low or high priority to each queue. See here: https://msdn.microsoft.com/en-us/library/windows/d... So both the number of queues, and the priority of each queue, is configured by the application. It is up to the driver to dynamically schedule the workloads in these queues, while maintaining the requested priorities.
One of us is not understanding things accurately...
Preemption has a lot to do with async compute's performance, but more to do with how well the GPU can shuffle multiple long-running compute tasks. Both AMD and Intel have been FAR ahead of nVidia in this area, nVidia is desperately attempting to seek parity.
The 5% they are gaining from async compute now here can be mostly (if not entirely) attributed to their work on preemption, in fact.
nVidia in no way had async compute in DX11. While their drivers did a great job of optimizing the graphics queue their hardware was so horribly inefficient with concurrent computation and intensive graphics that they had to create time windows for PhysX. There, though, they had so much control their hardware's weaknesses wasn't really an issue - therefore not a weakness at all, really, when it came to gaming.
The new APIs have simply revealed where AMD was strong - in context switching. AMD's original GCN is an order of magnitude faster than nVidia's Maxwell when it comes to context switching - which occurs in preemption - so nVidia is just playing catch-up. Pascal helps to remove a small part of AMD's advantage here.
Further, AMD uses dedicated schedulers (ACEs) to help with asynchronous compute - the RX 480 has cut their numbers in half, so it is the worst-case scenario for async compute scaling moving forward (well, Polaris 11 should be worse still). Fury is seeing 50% scaling with Vulkan...
"Both AMD and Intel have been FAR ahead of nVidia in this area, nVidia is desperately attempting to seek parity."
Say what? Afaik Intel does not implement async compute yet in DX12. nVidia has introduced async compute in the form of HyperQ for CUDA on Kepler. I suggest you read up on HyperQ and what it's supposed to do. TL;DR: It solves the problem of running multiple processes (or threads) with GPGPU tasks on a single GPU, by having multiple work queues that the processes/threads can submit work to. In the case of Kepler, these were 32 separate queues, so up to 32 streams of compute workloads could be sent to the GPU in parallel, and the GPU would execute these concurrently.
DX12 async compute is the same principle, except that they seamlessly integrate it with graphics as well, so one of the queues can accept both graphics and compute workloads, where the other queues accept only compute workloads (CUDA was compute-only, and could be used in parallel with OpenGL or Direct3D for graphics).
I understand that 99.9999999% of the people online don't understand more than 1% of this stuff. I just wish they would not post on the subject and spread all sorts of damaging misinformation.
But I think he's right that AMD did do a lot of early work in the area. When AMD bought ATI their hope was to create a heterogeneous processor combining both the CPU and the GPU. Therefore these issues were something they immediately came up against. I think maybe they didn't optimize their GCN design for DirectX 11 or the use of their GPUs as coprocessors because they had more ambitious goals. Those goals fell through and it wasn't until Mantle that some of the features of their architecture could be taken advantage of. NVIDIA was more conservative in their approach and I'm sure these issues were on their long-term radar, but they weren't directly concerned with them because they believed that the latency mismatch between GPUs and CPUs and the memory technologies available meant that a marriage of the CPU with the GPU was not advantageous. Plus they didn't have a good means of doing such a marriage themselves (although AMD reportedly wanted to buyout NVIDIA rather than ATI and NVIDIA declined).
At this point, though, I think NVIDIA with an x86 license would be a pretty interesting competitor to Intel. I'm guessing that a big reason KNL was made a bootable processor was to allow Xeon Phi high speed access to main memory without allowing it to NVIDIA's GPUs. It was so important to them that they sacrificed the ability to have flexible node topologies (multiple processors per node).
Well, given that GCN and Kepler are about as old, and both implement pretty much the same async compute technology, I don't think you can say one did more 'early work' than the other. If anyone did any 'early work', then I would say it is nVidia, who pretty much single-handedly invented compute shaders and everything that goes with it, with the first version of CUDA in the 8800. Async compute was just an evolution of CUDA, as they found that traditional HPC tended to work with MPI solutions with multiple processes. So this led to HyperQ. I would certainly not say that nVidia is the 'conservative one' when it comes to compute.
AMD isn't a big player in the HPC market, so I'm not quite sure what they wanted to achieve with their ACE's. I don't think it has anything to do with heterogeneous processing though.
Asynchronous compute + graphics does not work due to the lack of proper Ressource Barrier support under HyperQ that prevents further command execution until the GPU has finished doing any work needed to convert the resources as requested. Without Ressource Barrier support, HyperQ implementation cannot be used by DX12 in order to execute Graphics and Compute commands in parallel.
Do you have any sources for these claims? The nVidia documentation I read states that command lists are split up at any fences. As long as you have good pre-emption/scheduling, it's a perfectly workable solution. See also: https://developer.nvidia.com/dx12-dos-and-donts
True, but I don't think it's well-understood in the fan community at large whether AMD's greater performance benefit from asynchronous compute is because of a superior asynchronous compute implementation or rather because AMD's architecture is simply less efficient at filling its pipelines than NVIDIA's and asynchronous compute simply closes that gap somewhat by taking some of the "air" out of the pipelines.
In the end what matters is the comparative performance in real-world situations of course, regardless of how it gets there. But as people like to say that AMD has an advantage in DX12, or that NVIDIA sucks at DX12, the accuracy of those statements hinge partly on the answer to the above question. If the case is the latter, that DX12 helps to compensate for inefficiencies in AMD's GPUs through additional work by the developer, then such statements are not accurate.
"Wait, isn't Nvidia doing async, just via pre-emption? "
No. They are doing async - or rather, concurrency - just as AMD does. Work from multiple tasks is being executed on GTX 1070's various SMs at the same time.
Pre-emption, though a function of async compute, is not concurrency, and is best not discussed in the same context. It's not what you use to get concurrency.
Nvidia is doing inter-SM concurrency, AMD support intra-SM concurrency. AMD can do single clock-cycle context switching to fill in pipeline holes in the SMs. This comes as a transistor cost and is only a benefit if there are a lot of "holes".
My limited understanding is that the graphics pipeline is riddled with holes, leaving an average of 10%-30% of untapped compute power even if all of the SMs are in use. AMD's async allows their compute engine to fill in these holes.
Based on how competitive Nvidia is, the nature of graphics processing may better benefit from focusing on making the holes smaller rather than filling them. But it's hard to tell if one architecture or the other is better or if it's the game engines of the drivers.
"AMD can do single clock-cycle context switching to fill in pipeline holes in the SMs."
Pascal can do this as well.
"My limited understanding is that the graphics pipeline is riddled with holes, leaving an average of 10%-30% of untapped compute power even if all of the SMs are in use."
All signs point to nVidia having better efficiency than AMD does. If you look at hardware with the same TFLOPS rating, you'll find that nVidia's hardware delivers significantly better performance than AMD. Which would imply that nVidia has less 'holes' to begin with... Which implies that async compute may not get the same gains on their hardware as AMD would.
Keep in mind that AMD opengl drivers are notoriously bad. So Vulkan bringing a big boost wasn't that huge of a surprise. Dx 11 and Dx 12 vs Vulkan is an entirely different story.
You said Doom/Vulkan numbers. Comparing Doom is comparing OpenGL and Vulkan performance, so obviously OpenGL optimization is a huge part of how big those numbers were, which is why I find it silly people even compare it to this benchmark with async on/off. Both vulkan and dx12 have way more optimizations than just async, Doom vulkan async on vs off numbers are very close to what happens in this bench.
How so? The DOOM Vulkan FAQ states that async compute was not enabled on nVidia cards yet. So you can't compare it to the async results of this benchmark.
Yes you can compare the results... AMD only gets Async Compute when TSAA is the AA of choice (stated by the developers the other AA options will turn off Async Compute until they add support for them). Most the early benchmarks were using SMAA as the choice of aliasing. So if both AMD and Nvidia both use SMAA then you are comparing apples to apples.. Even then AMD was still getting 20 frames boost (Async was adding an addtional 10 frames ontop of that).
It is true that open gl blew for AMD .. but the interesting point is that Vulkan runs substantially better on AMD hardware without async .. and with async the 480 is within 10% or less of the 1080... The Fury X (last gen card) is beating the 1070 and trading blows with the 1080.
Another reason why DOOM comparisons are difficult is because DOOM uses AMD's intrinsic shader extension, but nothing equivalent on nVidia. So the gains on AMD hardware are partly Vulkan, partly AMD-specific shader optimizations, and partly async compute. On the nVidia side, you purely see OpenGL vs Vulkan. All gains come from better API efficiency.
We don't know yet. The DOOM Vulkan FAQ says async compute is not enabled on nVidia hardware yet, and they're still working on this. So we should expect a future update which enables it, with gains.
But, well, you just want to hear that Pascal can't handle it, right? Even though Time Spy already proves that it does. Looking at your other comments here, you throw around some half-truths, and some fancy buzzwords with no sources whatsoever, trying to build up some theory that NV/Pascal can't do async compute. It's interesting how many people like you have been active on various forums, with similar propaganda. Makes you wonder if AMD is paying shills/trolls again... Sadly, I'm an actual dev, with CUDA and DX11/DX12 experience, so I actually know what this stuff is all about. And I'm not fooled by the lies that these people spread. Sadly, not enough is done to shut these people up.
Daniel, could you please share some results for your rig (i7-2600K + GTX1080), I'm curious to see how the older CPUs handle the 1080. I have an i5-2500K@4.4GHz now and I will get a GTX1080 soon. Will it be a significant bottleneck?
And so far my ingame frame rates have been in line with what the 1080 reviews have demonstrated. I don't have any incentive to upgrade the CPU. Maybe with the next intel die shrink.
With true low level APIs optimization Teraflops finally matter, and yes nvidias hardware is already almost fully used amd's on the other hand has the most to gain.
I've been thinking about it. I think GCN was AMD's first architecture whose foundations were laid down post-merger. I wonder if AMD built GCN more with Fusion in mind than with DirectX 11 in mind. If that's true then maybe there is hope for AMD when Navi comes out. Perhaps Navi will be AMD's first post GCN architecture, with foundations laid down after Fusion had already failed.
DirectX 12 seems to expose GCN better than DirectX 11, but GCN is still a lot less efficient in DirectX 12 than Pascal.
Yeah, my old faithful i7-930 (@ 3.99 Ghz) puts out virtually identical FPS numbers in graphics test 1 and 2, when compared with a 6700k that has a similarly clocked GTX 970 also. Pretty safe to say those tests are GPU limited, but it is good to know the CPU isn't holding the GPU back.
"It's not an apples-to-apples comparison in that they have much different performance levels, but for now it's the best look we can take at async on Pascal."
Why is the 480 the best choice for the comparison? Why not add in a 390X, Nano, or something?
"The RX 480 is just one GPU, and we’ve already discussed how different cards can see very different levels of performance improvement depending on the game in question — the R9 Nano picks up 12% additional performance from enabling versus disabling async compute in Ashes of the Singularity, whereas the RX 480 only sees a 3% performance uplift from the same feature."
Geez, what an idiots... 'concurrent' and 'parallel' means the same thing in this case. And asynchronous compute doesn't specifically require concurrent execution. 'Asynchronous' just means that tasks can be scheduled to run in the background, and they can set a signal when they have completed. Even if you use a time-slicing approach, running a single task at a time, but switching between tasks at certain intervals, that's still asynchronous. 'Concurrent' means that more than one task is executing at the same time, so multiple tasks run in parallel.
I'm just a little curious why a variety of AMD cards ranging from the 290 all the way to a Fury are performing about the same! I've got a Strix Fury on a FX-9370 and it's getting either the same or outdone by an equivalent CPU or lower with, let's say, an R9 290x! I would've throught it would be higher for the Fury? Then there's the Maxwell cards like the 970 and 980 getting around the same figure when I would've thought that the Fury and Fury X would've smash them! With the spat of "GALAX" (an Nvidia only company) brand advertising all over the tests, I'm wondering whether this is not leaning more in favour of Nvidia cards. Food for thought.
You could also look at it from the other side... Pretty much all DX12 stuff so far was developed in cooperation with AMD. FutureMark is an independent benchmarking organization, unlike game devs that are either in Gaming Evolved or TWIMTBP, and as such, FutureMark works with all vendors on an equal basis: https://www.futuremark.com/business/benchmark-deve...
It could just be that Time Spy is the first DX12 benchmark that is NOT in favour of any vendor. The Time Spy Technical Guide says they didn't optimize for a specific architecture anyway.
We’ve updated our terms. By continuing to use the site and/or by logging into your account, you agree to the Site’s updated Terms of Use and Privacy Policy.
75 Comments
Back to Article
Eden-K121D - Thursday, July 14, 2016 - link
There is something fishy. are they disguising pre-emption as async compute for nvidia cardsEden-K121D - Thursday, July 14, 2016 - link
and no other GPUsddriver - Thursday, July 14, 2016 - link
Like everyone else, they sell out to the highest bidder, and amd just doesn't have that much to bid.euskalzabe - Thursday, July 14, 2016 - link
Wait, isn't Nvidia doing async, just via pre-emption? As far as I understand, AMD has proper ACEs so they do async on hardware, whereas Nvidia doesn't have the hardware parts and thus does async via software through pre-emption. In a similar way, AMD doesn't have Pascal's simultaneous multi-projection so they do it via software.In the end, they're both doing async in one way or another. Isn't that right?
edzieba - Thursday, July 14, 2016 - link
Kind of. Pre-emption has almost nothing to do with Asynchronous Compute.Maxwell, Pascal, and GCN all support Async Compute, but implement it in different ways.
GCN uses Asynchronous Shaders (and ACEs) with hardware scheduling. But this only works under DX12 and Vulkan when software actually explicitly targets Async Compute. Otherwise, that silicon is left underutilised.
Maxwell and Pascal perform scheduling at the driver level (GPC particioned in Maxwell, SM partitioned in Pascal). But because this is done in software, it was already implemented for DX11. This is why Async Compute sees little benefit on Maxwell and Pascal when moving from DX11 to DX12: Async Compute was already being performed.
xenol - Thursday, July 14, 2016 - link
So unless the app specifically asks to use the ACEs, AMD's drivers won't put the instructions through there whereas NVIDIA does all its sorting ahead of time?Yojimbo - Thursday, July 14, 2016 - link
the ACEs are schedulers, not execution units.Scali - Friday, July 15, 2016 - link
Well, AMD's drivers did not implement Driver Command Lists (DCLS) in the DX11 API properly.The DX11 API allows you to create multiple contexts, which you can use from multiple threads, where each context/thread can create their own DCL.
On nVidia hardware, we saw up to 2x the performance compared to using a single context (see the earlier 3DMark API overhead test).
On AMD, this was not implemented, so even if you made multiple threads and contexts, the driver would just serialize it and run it sequentially from a single thread. As a result, you saw 1x performance, regardless of how many threads you used.
Given this serializing behaviour, it seems that there was no way for AMD to make use of async compute in DX11 either.
nVidia could do this, but I'm not sure to what extent they actually did. All we can see is that nVidia did get reasonable performance increase from using multiple DX11 contexts, where AMD did not get anything at all. Whether some or all of nVidia's performance increase came from async compute, or some other benefits of multithreading, is difficult to say.
Yojimbo - Friday, July 15, 2016 - link
DirectX 11 has been out for 7 years and has been the mainstay of games development for a long time. How likely is it that AMD missed out, and continues to miss out, on performance using DirectX 11 simply because of poor driver implementation? If that is a feature of the API actually used by games and it makes a significant difference in performance then it's hard to believe AMD would just let it languish. It would be both incompetence on the part of their driver team and strategic mismanagement of resources by their management. Is it not possible that their architecture simply is not amenable to that feature of the API?D. Lister - Friday, July 15, 2016 - link
Driver-level changes alone cannot mitigate hardware limitations, and despite the added features in the later versions, at its core, GCN has been outdated for quite some time. Consequently, we have been seeing one family after another of GPUs with nearly nonexistent OC headroom, ridiculous power usage and/or temperatures, and a list of open-source "features".Sadly, right now, AMD is in a vicious cycle of finance; they need more money to fix these issues, but to make more money they need to fix these issues, hence the inevitable downward slope.
Scali - Saturday, July 16, 2016 - link
The 3DMark Technical Guide says as much: http://s3.amazonaws.com/download-aws.futuremark.co..."DirectX 11 offers multi-threaded (deferred) context support, but not all vendors implement it in hardware, so it is slow. And overall, it is quite limited."
bluesoul - Saturday, July 16, 2016 - link
Wow you must believe in yourself.Yojimbo - Thursday, July 14, 2016 - link
I don't think the static scheduling is done at a driver level. The way I understand it (and I am not a programmer, this is only from watching presentations) it is done at the request of the application programmer. And Pascal has a dynamic load balancing option which is able to override the requested balance, if allowed. I really don't know for sure, but I wonder if preemption might be used in their dynamic load balancing algorithm, but perhaps the one workload must completely finish before the other is rescheduled. I didn't see any details on how it works.Scali - Friday, July 15, 2016 - link
"but perhaps the one workload must completely finish before the other is rescheduled"That is what happens when there is no pre-emption. Pre-emption basically means that a workload can be stopped at some point, and its state saved to be continued at a later time. Another workload can then take over.
This is how multi-threading was implemented in OSes like UNIX and Windows 9x/NT, so multiple threads and programs can be run on a single core.
From what I understood about Maxwell v2's implementation, pre-emption was only possible between draw calls. So while the GPU can run various tasks in parallel, when you run a graphics task at the same time, you have to wait until it is finished before you can re-evaluate the compute tasks. As a result, compute tasks may have completed at some point during the draw call, and the compute units were sitting idle until they could receive new work.
This works fine, as long as your draw calls aren't taking too long, and your compute tasks aren't taking too little time. A common scenario is to have many relatively short draw calls, and relatively long-running compute tasks. In this scenario, Maxwell v2 will not lose that much time. But it implies that you must design your application workload in such a way, because if you don't, it may take a considerable hit, and it may be a better option not to try running many tasks concurrently, but rather running them one-by-one, each having the complete GPU at their disposal, rather than sharing the GPU resources between various tasks.
This is what you can also read here: https://developer.nvidia.com/dx12-dos-and-donts
With Pascal, pre-emption can occur at GPU instruction granularity, which means there is no longer any 'delay' until new compute tasks can be rescheduled. This means that units that would otherwise go idle can be supplied with a new workload immediately, and the efficiency of running concurrent compute tasks is not impacted much by the nature of the graphics work and overall application design.
Yojimbo - Friday, July 15, 2016 - link
I think that preemption (Why does everyone want to put a hyphen in the word? 'Emption' isn't even a word that I've heard of. Preemption, on the other hand, is.) and dynamic parallelism are two different things. The point of what I wrote that you quoted was that I haven't read anything which specifies how the dynamic parallelistic(?) algorithm works in Pascal. So I didn't want to assume that it relies on or even takes advantage of preemption.You seem to be saying that dynamic parallelism cannot be implemented without an improved preemption, but AMD seems to accomplish it. NVIDIA's hardware cannot schedule new work on unused SMs while other SMs are working without suspending those working SMs first? Is that what you're saying?
Scali - Friday, July 15, 2016 - link
Dynamic parallelism is yet another thing.Dynamic parallelism means that compute tasks can spawn new compute tasks themselves. This is not part of the DX12 spec afaik, and I don't know if AMD even supports that at all. It is a CUDA feature.
I was not talking about dynamic parallelism, but rather about 'static' parallelism, better known as concurrency.
Concurrency is possible, but the better granularity you have with pre-emption ('emption' is actually a word, as is 'pre', and 'pre-emption' is in the dictionary that way. It was originally a legal term), the more quickly you can switch between compute tasks and replace idle tasks with new ones.
And yes, as far as I understand it, Maxwell v2 cannot pre-empt graphics tasks. Which means that it cannot schedule new work on unused compute SMs until the graphics task goes idle (as far as I know this is excusively a graphics-specific thing, and when you run only compute-tasks, eg with CUDA HyperQ, this problem is not present).
In Pascal they removed this limitation.
Ryan Smith - Friday, July 15, 2016 - link
(We had a DB problem and we lost one post. I have however recovered the text, which is below)Sorry I said dynamic parallelism when I meant dynamic load balancing. Yes dynamic parallelism is as you say. But dynamic load balancing is a feature related to concurrency. What I meant to say: Preemption and dynamic load balancing are two different things. And it is not clear to me that preemption is necessary or if it is used at all in Pascal's dynamic load balancing. I am not sure if you know how it is done or not, but the reason I wrote what I did in my original message ("but perhaps the one workload must completely finish before the other is rescheduled") is because I didn't want to assume it was used.
As an aside, I believe when a graphics task is running concurrently with a compute task (within a single GPC? or within the GPU? Again, the context here is unclear to me), Pascal has pixel-level graphics and thread-level compute granularity for preemption, not instruction-level granularity. When only compute tasks are running (on the GPU or within a GPC??), Pascal has instruction-level preemption. Your previous message seem to imply that instruction-level granularity existed when running graphics a workload concurrently with compute tasks.
As far as emption, preemption, and pre-emption: https://goo.gl/PXKcmK
'Emption' follows 'pre-emption' so closely that one wonders how much 'emption' was ever used by itself since 1800. I do not believe you have seen 'emption' used very much. As for the dictionary, 'preemption' is definitely in the dictionary as the standard spelling. It seems clear the most common modern usage is 'preemption'. 'Pre-emption' is in no way wrong, I'm just wondering why people insist on the hyphen when 'emption' is hardly, if ever, used.
Scali - Saturday, July 16, 2016 - link
"I am not sure if you know how it is done or not, but the reason I wrote what I did in my original message ("but perhaps the one workload must completely finish before the other is rescheduled") is because I didn't want to assume it was used."But that is what pre-emption does, and we know that Maxwell and Pascal have pre-emption at some level.
"Pascal has pixel-level graphics and thread-level compute granularity for preemption, not instruction-level granularity."
What do you base this on?
nVidia has specifically stated that it has instruction-level granularity. See here:
https://images.nvidia.com/content/pdf/tesla/whitep...
"The new Pascal GP100 Compute Preemption feature allows compute tasks running on the GPU to be interrupted at instruction-level granularity, and their context swapped to GPU DRAM. This permits other applications to be swapped in and run, followed by the original task’s context being swapped back in to continue execution where it left off.
Compute Preemption solves the important problem of long-running or ill-behaved applications that can monopolize a system, causing the system to become unresponsive while it waits for the task to complete, possibly resulting in the task timing out and/or being killed by the OS or CUDA driver. Before Pascal, on systems where compute and display tasks were run on the same GPU, long-running compute kernels could cause the OS and other visual applications to become unresponsive and non-interactive until the kernel timed out."
So really, I don't know why so many people spread false information about nVidia's hardware.
bluesoul - Saturday, July 16, 2016 - link
in other words, or in laymen terms, does that mean that Maxwell v2 lacks the hardware or software required to gain from Async Compute?Scali - Saturday, July 16, 2016 - link
No, it means that Maxwell v2 is more sensitive to the mix of workloads that you want to run concurrently. So it's more difficult to gain from async compute, but not impossible.Yojimbo - Friday, July 15, 2016 - link
On further exploration it does seem like the driver is the one who tries to set up the static partitioning, by looking at the application. But I think with Pascal the programmer can request the driver to not allow dynamic load balancing.Scali - Saturday, July 16, 2016 - link
There shouldn't be 'static partitioning'... In the D3D API you can create a number of queues to submit work to, and you can assign low or high priority to each queue. See here: https://msdn.microsoft.com/en-us/library/windows/d...So both the number of queues, and the priority of each queue, is configured by the application.
It is up to the driver to dynamically schedule the workloads in these queues, while maintaining the requested priorities.
looncraz - Friday, July 15, 2016 - link
One of us is not understanding things accurately...Preemption has a lot to do with async compute's performance, but more to do with how well the GPU can shuffle multiple long-running compute tasks. Both AMD and Intel have been FAR ahead of nVidia in this area, nVidia is desperately attempting to seek parity.
The 5% they are gaining from async compute now here can be mostly (if not entirely) attributed to their work on preemption, in fact.
nVidia in no way had async compute in DX11. While their drivers did a great job of optimizing the graphics queue their hardware was so horribly inefficient with concurrent computation and intensive graphics that they had to create time windows for PhysX. There, though, they had so much control their hardware's weaknesses wasn't really an issue - therefore not a weakness at all, really, when it came to gaming.
The new APIs have simply revealed where AMD was strong - in context switching. AMD's original GCN is an order of magnitude faster than nVidia's Maxwell when it comes to context switching - which occurs in preemption - so nVidia is just playing catch-up. Pascal helps to remove a small part of AMD's advantage here.
Further, AMD uses dedicated schedulers (ACEs) to help with asynchronous compute - the RX 480 has cut their numbers in half, so it is the worst-case scenario for async compute scaling moving forward (well, Polaris 11 should be worse still). Fury is seeing 50% scaling with Vulkan...
Scali - Friday, July 15, 2016 - link
"Both AMD and Intel have been FAR ahead of nVidia in this area, nVidia is desperately attempting to seek parity."Say what?
Afaik Intel does not implement async compute yet in DX12.
nVidia has introduced async compute in the form of HyperQ for CUDA on Kepler.
I suggest you read up on HyperQ and what it's supposed to do.
TL;DR: It solves the problem of running multiple processes (or threads) with GPGPU tasks on a single GPU, by having multiple work queues that the processes/threads can submit work to. In the case of Kepler, these were 32 separate queues, so up to 32 streams of compute workloads could be sent to the GPU in parallel, and the GPU would execute these concurrently.
DX12 async compute is the same principle, except that they seamlessly integrate it with graphics as well, so one of the queues can accept both graphics and compute workloads, where the other queues accept only compute workloads (CUDA was compute-only, and could be used in parallel with OpenGL or Direct3D for graphics).
I understand that 99.9999999% of the people online don't understand more than 1% of this stuff.
I just wish they would not post on the subject and spread all sorts of damaging misinformation.
Yojimbo - Friday, July 15, 2016 - link
But I think he's right that AMD did do a lot of early work in the area. When AMD bought ATI their hope was to create a heterogeneous processor combining both the CPU and the GPU. Therefore these issues were something they immediately came up against. I think maybe they didn't optimize their GCN design for DirectX 11 or the use of their GPUs as coprocessors because they had more ambitious goals. Those goals fell through and it wasn't until Mantle that some of the features of their architecture could be taken advantage of. NVIDIA was more conservative in their approach and I'm sure these issues were on their long-term radar, but they weren't directly concerned with them because they believed that the latency mismatch between GPUs and CPUs and the memory technologies available meant that a marriage of the CPU with the GPU was not advantageous. Plus they didn't have a good means of doing such a marriage themselves (although AMD reportedly wanted to buyout NVIDIA rather than ATI and NVIDIA declined).At this point, though, I think NVIDIA with an x86 license would be a pretty interesting competitor to Intel. I'm guessing that a big reason KNL was made a bootable processor was to allow Xeon Phi high speed access to main memory without allowing it to NVIDIA's GPUs. It was so important to them that they sacrificed the ability to have flexible node topologies (multiple processors per node).
Scali - Friday, July 15, 2016 - link
Well, given that GCN and Kepler are about as old, and both implement pretty much the same async compute technology, I don't think you can say one did more 'early work' than the other.If anyone did any 'early work', then I would say it is nVidia, who pretty much single-handedly invented compute shaders and everything that goes with it, with the first version of CUDA in the 8800.
Async compute was just an evolution of CUDA, as they found that traditional HPC tended to work with MPI solutions with multiple processes. So this led to HyperQ. I would certainly not say that nVidia is the 'conservative one' when it comes to compute.
AMD isn't a big player in the HPC market, so I'm not quite sure what they wanted to achieve with their ACE's. I don't think it has anything to do with heterogeneous processing though.
bluesoul - Saturday, July 16, 2016 - link
Asynchronous compute + graphics does not work due to the lack of proper Ressource Barrier support under HyperQ that prevents further command execution until the GPU has finished doing any work needed to convert the resources as requested. Without Ressource Barrier support, HyperQ implementation cannot be used by DX12 in order to execute Graphics and Compute commands in parallel.Scali - Saturday, July 16, 2016 - link
Do you have any sources for these claims?The nVidia documentation I read states that command lists are split up at any fences. As long as you have good pre-emption/scheduling, it's a perfectly workable solution.
See also: https://developer.nvidia.com/dx12-dos-and-donts
donkay - Thursday, July 14, 2016 - link
Yes, both nvidia (pascal) and amd get better performance with async. The possible performance increase is just much bigger for AMD.Yojimbo - Thursday, July 14, 2016 - link
True, but I don't think it's well-understood in the fan community at large whether AMD's greater performance benefit from asynchronous compute is because of a superior asynchronous compute implementation or rather because AMD's architecture is simply less efficient at filling its pipelines than NVIDIA's and asynchronous compute simply closes that gap somewhat by taking some of the "air" out of the pipelines.In the end what matters is the comparative performance in real-world situations of course, regardless of how it gets there. But as people like to say that AMD has an advantage in DX12, or that NVIDIA sucks at DX12, the accuracy of those statements hinge partly on the answer to the above question. If the case is the latter, that DX12 helps to compensate for inefficiencies in AMD's GPUs through additional work by the developer, then such statements are not accurate.
Ryan Smith - Thursday, July 14, 2016 - link
"Wait, isn't Nvidia doing async, just via pre-emption? "No. They are doing async - or rather, concurrency - just as AMD does. Work from multiple tasks is being executed on GTX 1070's various SMs at the same time.
Pre-emption, though a function of async compute, is not concurrency, and is best not discussed in the same context. It's not what you use to get concurrency.
bcronce - Friday, July 15, 2016 - link
Nvidia is doing inter-SM concurrency, AMD support intra-SM concurrency. AMD can do single clock-cycle context switching to fill in pipeline holes in the SMs. This comes as a transistor cost and is only a benefit if there are a lot of "holes".My limited understanding is that the graphics pipeline is riddled with holes, leaving an average of 10%-30% of untapped compute power even if all of the SMs are in use. AMD's async allows their compute engine to fill in these holes.
Based on how competitive Nvidia is, the nature of graphics processing may better benefit from focusing on making the holes smaller rather than filling them. But it's hard to tell if one architecture or the other is better or if it's the game engines of the drivers.
Scali - Friday, July 15, 2016 - link
"AMD can do single clock-cycle context switching to fill in pipeline holes in the SMs."Pascal can do this as well.
"My limited understanding is that the graphics pipeline is riddled with holes, leaving an average of 10%-30% of untapped compute power even if all of the SMs are in use."
All signs point to nVidia having better efficiency than AMD does. If you look at hardware with the same TFLOPS rating, you'll find that nVidia's hardware delivers significantly better performance than AMD.
Which would imply that nVidia has less 'holes' to begin with... Which implies that async compute may not get the same gains on their hardware as AMD would.
powerarmour - Thursday, July 14, 2016 - link
Does look slightly odd, especially considering the recent Doom/Vulkan numbers.Eden-K121D - Thursday, July 14, 2016 - link
BTW Doom is a real world scenario instead of being a benchmarkdonkay - Thursday, July 14, 2016 - link
Keep in mind that AMD opengl drivers are notoriously bad. So Vulkan bringing a big boost wasn't that huge of a surprise. Dx 11 and Dx 12 vs Vulkan is an entirely different story.powerarmour - Thursday, July 14, 2016 - link
AMD's OpenGL performance has nothing to do with their Vulkan results, compare Vulkan to Vulkan, not OpenGL to Vulkan.I'd be very surprised if when fully optimised, DX12 was any faster than Vulkan in a real world scenario.
donkay - Thursday, July 14, 2016 - link
You said Doom/Vulkan numbers. Comparing Doom is comparing OpenGL and Vulkan performance, so obviously OpenGL optimization is a huge part of how big those numbers were, which is why I find it silly people even compare it to this benchmark with async on/off. Both vulkan and dx12 have way more optimizations than just async, Doom vulkan async on vs off numbers are very close to what happens in this bench.powerarmour - Thursday, July 14, 2016 - link
Vulkan has been built from the ground up as a new API, I still don't get your point?donkay - Thursday, July 14, 2016 - link
I don't get your point saying that these numbers look odd considering the Doom/Vulkan numbers. For the reasons I just explained.powerarmour - Thursday, July 14, 2016 - link
I don't get your point about OpenGL, just simply look at the Vulkan numbers for AMD and Nvidia, and the order in performance between the cards.FORTHEWIND - Thursday, July 14, 2016 - link
Um no. Vulkan is based on Mantle. Get your fact check please.Scali - Friday, July 15, 2016 - link
How so? The DOOM Vulkan FAQ states that async compute was not enabled on nVidia cards yet. So you can't compare it to the async results of this benchmark.bobacdigital - Friday, July 15, 2016 - link
Yes you can compare the results... AMD only gets Async Compute when TSAA is the AA of choice (stated by the developers the other AA options will turn off Async Compute until they add support for them). Most the early benchmarks were using SMAA as the choice of aliasing. So if both AMD and Nvidia both use SMAA then you are comparing apples to apples.. Even then AMD was still getting 20 frames boost (Async was adding an addtional 10 frames ontop of that).It is true that open gl blew for AMD .. but the interesting point is that Vulkan runs substantially better on AMD hardware without async .. and with async the 480 is within 10% or less of the 1080... The Fury X (last gen card) is beating the 1070 and trading blows with the 1080.
bobacdigital - Friday, July 15, 2016 - link
Sorry within 10% of the 1070 (not the 1080)Scali - Friday, July 15, 2016 - link
Another reason why DOOM comparisons are difficult is because DOOM uses AMD's intrinsic shader extension, but nothing equivalent on nVidia.So the gains on AMD hardware are partly Vulkan, partly AMD-specific shader optimizations, and partly async compute.
On the nVidia side, you purely see OpenGL vs Vulkan. All gains come from better API efficiency.
bluesoul - Saturday, July 16, 2016 - link
so Pascal didnt gain from Async Compute?Scali - Saturday, July 16, 2016 - link
We don't know yet. The DOOM Vulkan FAQ says async compute is not enabled on nVidia hardware yet, and they're still working on this. So we should expect a future update which enables it, with gains.But, well, you just want to hear that Pascal can't handle it, right? Even though Time Spy already proves that it does.
Looking at your other comments here, you throw around some half-truths, and some fancy buzzwords with no sources whatsoever, trying to build up some theory that NV/Pascal can't do async compute. It's interesting how many people like you have been active on various forums, with similar propaganda. Makes you wonder if AMD is paying shills/trolls again...
Sadly, I'm an actual dev, with CUDA and DX11/DX12 experience, so I actually know what this stuff is all about. And I'm not fooled by the lies that these people spread.
Sadly, not enough is done to shut these people up.
Yojimbo - Thursday, July 14, 2016 - link
It's a conspiracy, I tell you. Hmm, wait, maybe not. Maybe their facts are just sexist.tcnasc - Thursday, July 14, 2016 - link
Daniel, could you please share some results for your rig (i7-2600K + GTX1080), I'm curious to see how the older CPUs handle the 1080. I have an i5-2500K@4.4GHz now and I will get a GTX1080 soon. Will it be a significant bottleneck?darkchazz - Thursday, July 14, 2016 - link
Got a score of 6701 (7371 graphics & 4424 physics)On my rig with an i7 3770K @ 4.4 GHz and Asus strix gtx 1080 @ 2038/5200
http://www.3dmark.com/spy/7187
And so far my ingame frame rates have been in line with what the 1080 reviews have demonstrated.
I don't have any incentive to upgrade the CPU. Maybe with the next intel die shrink.
tcnasc - Thursday, July 14, 2016 - link
Great, thank you!amayii - Thursday, July 14, 2016 - link
Here are some Fire Strike results on a 6700K and GTX1080FS: 16617 [http://www.3dmark.com/fs/8667039]
FSE: 9286 [http://www.3dmark.com/fs/8667154]
FSU: 4990 [http://www.3dmark.com/fs/8668003]
ikjadoon - Thursday, July 14, 2016 - link
"To anyone who’s found FireStrike to easy of a benchmark, keep an eye out for Time Spy in the near future."Maybe "too easy"?
Ryan Smith - Thursday, July 14, 2016 - link
That it would be. Thanks!flazza - Thursday, July 14, 2016 - link
http://www.3dmark.com/3dm/131996242x480's
shabby - Thursday, July 14, 2016 - link
This looks pretty good... if it was 2010.Geranium - Thursday, July 14, 2016 - link
Does new 3Dmark use real DX12, or RoTR like DX12 which is not real DX12.Yojimbo - Thursday, July 14, 2016 - link
Maybe someone would be able to answer your question if it weren't so vague. What is real DX12, exactly?ddriver - Thursday, July 14, 2016 - link
I suppose it would be "actually making use of dx12 features rather than just running on dx12".donkay - Thursday, July 14, 2016 - link
I could copy paste parts of the article here, or you could just read the full article. There's more here than just charts you know.godrilla - Friday, July 15, 2016 - link
With true low level APIs optimization Teraflops finally matter, and yes nvidias hardware is already almost fully used amd's on the other hand has the most to gain.Yojimbo - Friday, July 15, 2016 - link
I've been thinking about it. I think GCN was AMD's first architecture whose foundations were laid down post-merger. I wonder if AMD built GCN more with Fusion in mind than with DirectX 11 in mind. If that's true then maybe there is hope for AMD when Navi comes out. Perhaps Navi will be AMD's first post GCN architecture, with foundations laid down after Fusion had already failed.DirectX 12 seems to expose GCN better than DirectX 11, but GCN is still a lot less efficient in DirectX 12 than Pascal.
JRW - Friday, July 15, 2016 - link
Dang my old timey i7 920 (overclocked 3.5Ghz) and R9 290X 4GB scored a 3878, didn't expect that higher considering CPU.scottjames_12 - Friday, July 15, 2016 - link
Yeah, my old faithful i7-930 (@ 3.99 Ghz) puts out virtually identical FPS numbers in graphics test 1 and 2, when compared with a 6700k that has a similarly clocked GTX 970 also. Pretty safe to say those tests are GPU limited, but it is good to know the CPU isn't holding the GPU back.http://www.3dmark.com/compare/spy/32476/spy/25710#
Oxford Guy - Friday, July 15, 2016 - link
"It's not an apples-to-apples comparison in that they have much different performance levels, but for now it's the best look we can take at async on Pascal."Why is the 480 the best choice for the comparison? Why not add in a 390X, Nano, or something?
Oxford Guy - Friday, July 15, 2016 - link
The best choice for Nvidia maybe:Extreme Tech:
"The RX 480 is just one GPU, and we’ve already discussed how different cards can see very different levels of performance improvement depending on the game in question — the R9 Nano picks up 12% additional performance from enabling versus disabling async compute in Ashes of the Singularity, whereas the RX 480 only sees a 3% performance uplift from the same feature."
pencea - Friday, July 15, 2016 - link
I just ran a benchmark test on the new 3dmark Time Spy DX12 with the GTX 1080.I uploaded the video here for those who's interested to see how the card performs.
https://youtu.be/RF7wfujdT7c
Deders - Friday, July 15, 2016 - link
Wow, and it still dips close to 30fps in the 2nd test. I wonder what AMD's 480 gets at that point.Deders - Friday, July 15, 2016 - link
Not sure if the stone horse is a reference to one of the sub tests. It's the same pose and colour, with added aliens.DeadMan3001 - Saturday, July 16, 2016 - link
3Dmark developer responds to accusations of cheating in Time Spy.http://steamcommunity.com/app/223850/discussions/0...
Scali - Saturday, July 16, 2016 - link
Geez, what an idiots... 'concurrent' and 'parallel' means the same thing in this case.And asynchronous compute doesn't specifically require concurrent execution. 'Asynchronous' just means that tasks can be scheduled to run in the background, and they can set a signal when they have completed. Even if you use a time-slicing approach, running a single task at a time, but switching between tasks at certain intervals, that's still asynchronous.
'Concurrent' means that more than one task is executing at the same time, so multiple tasks run in parallel.
Zak - Monday, July 18, 2016 - link
That Steam thread is sad....Donkegin - Sunday, July 17, 2016 - link
I'm just a little curious why a variety of AMD cards ranging from the 290 all the way to a Fury are performing about the same! I've got a Strix Fury on a FX-9370 and it's getting either the same or outdone by an equivalent CPU or lower with, let's say, an R9 290x! I would've throught it would be higher for the Fury? Then there's the Maxwell cards like the 970 and 980 getting around the same figure when I would've thought that the Fury and Fury X would've smash them! With the spat of "GALAX" (an Nvidia only company) brand advertising all over the tests, I'm wondering whether this is not leaning more in favour of Nvidia cards. Food for thought.Scali - Sunday, July 17, 2016 - link
You could also look at it from the other side... Pretty much all DX12 stuff so far was developed in cooperation with AMD.FutureMark is an independent benchmarking organization, unlike game devs that are either in Gaming Evolved or TWIMTBP, and as such, FutureMark works with all vendors on an equal basis: https://www.futuremark.com/business/benchmark-deve...
It could just be that Time Spy is the first DX12 benchmark that is NOT in favour of any vendor.
The Time Spy Technical Guide says they didn't optimize for a specific architecture anyway.