64 Cores of Rendering Madness: The AMD Threadripper Pro 3995WX Reviewby Dr. Ian Cutress on February 9, 2021 9:00 AM EST
- Posted in
- Threadripper Pro
Knowing your market is a key fundamental of product planning, marketing, and distribution. There’s no point creating a product with no market, or finding you have something amazing but offer it to the wrong sort of customers. When AMD started offering high-core count Threadripper processors, the one market that took as many as they could get was the graphics design business – visual effects companies and those focused on rendering loved the core count, the memory support, all the PCIe lanes, and the price. But if there’s one thing more performance brings, it’s the desire for even more performance. Enter Threadripper Pro.
computational graphics goes brrrrrrr
There are a number of industries that, when looking from the outside, an enthusiast might assume that using a CPU is probably old fashioned – the question is asked as to why hasn’t that industry moved fully to using GPU accelerators? One of the big ones is machine learning – despite the push to dedicated machine learning hardware and lots of big businesses doing ML on GPUs, most machine learning today is still done on CPUs. The same is still true with graphics and visual effects.
The reason behind this typically comes down to the software packages in use, and the programmers in charge.
Developing software for CPUs is easy, because that is what most people are trained on. Optimization packages for CPUs are well established, and even for upcoming specialist instructions, these can be developed in simulated environments. A CPU is designed to handle almost anything thrown at it, even super bad code.
By contrast, GPU compute is harder. It isn’t as difficult as it used to be, as there are wide arrays of libraries that enable GPU compilation without having to know too much about how to program for a GPU, however the difficulty lies in architecting the workload to take advantage of what a GPU has to offer. A GPU is a massive engine that performs the same operation to hundreds of parallel threads at the same time – it also has a very small cache and accesses to GPU memory are long, so that latency is hidden by having even more threads in flight at once. If the compute part of the software isn’t amenable to that sort of workload, such as being structurally more linear, then spending 6 months redeveloping for a GPU is a wasted effort. Or even if the math works out better on GPU, trying to rebuild a 20-year old codebase (or older) for GPUs still requires a substantial undertaking by a group of experts.
GPU compute is coming on leaps and bounds ever since I did it in the late 2000s. But the fact remains is that there are still a number of industries that are a mix of CPU/GPU throughput. These include machine learning, oil and gas, financial, medical, and the one we’re focusing on today is visual effects.
A visual effects design and rendering workload is a complex mix of dedicated software platforms and plugins. Software like Cinema4D, Blender, Maya, and others rely on the GPU to showcase a partially rendered scene for these artists to work on in real time, also relying on strong single core performance, but the bulk of compute for the final render will depend on what plugins are being used for that particular product. Some plugins are GPU accelerated, such as Blender Cycles, and the move to more GPU-accelerated workloads is taking its time – ray tracing accelerated design is an area that is getting a lot of GPU attention, for example.
There are always questions as to which method produces the best image – there’s no point using a GPU to accelerate the rendering time if it adds additional noise or reduces the quality. A film studio is more than likely to prioritize a slow higher-quality render on CPUs than a fast noisy one on GPUs, or alternatively, render a lower resolution image and then upscale with trained AI. Based on our conversations with OEMs that supply the industry, we've been told that a number of studios will outright say that rendering their workflow on a CPU is the only way they do it. The other angle is memory, as the right CPU can have 256 GB to 4 TB of DRAM available, whereas the best GPUs can only supply 80 GB (and those are the super expensive ones).
The point I’m making here is that VFX studios still prefer CPU compute, and the more the better. When AMD launched its new Zen-based processors, particularly the 32 and 64 core count models, these were immediately earmarked as potential replacements for the Xeons being used in these VFX studios. AMD’s parts prioritized FP compute, a key element in VFX design, and having double the cores per socket was also a winner, combined with the large amount of cache per core. This latter part meant that even though the first high-core count parts had a non-uniform memory architecture, it wasn’t as much of an issue as with some other compute processes.
A number of VFX companies as far as we understand focused on AMD’s Threadripper platform over the corresponding EPYC. When both of these parts first arrived to market, it was very easy for VFX studios to invest in under-the-desk workstations built on Threadripper, while EPYC was more for the server rack installations and not so much for workstations. Roll around to Threadripper 3000, and EPYC 7002, and now there are 64 cores, 64 PCIe 4.0 lanes, and lots of choice. VFX studios still went for Threadripper, mostly due to offering higher power 280 W in something that could easily be sourced by system integrators like Armari that specialize in high-compute under-desk systems. They also asked AMD for more.
AMD has now rolled out its Threadripper Pro platform, addressing some of these requirements. While VFX is always core compute focused, the TR Pro now gives double the PCIe lanes, double the memory bandwidth, support for up to 2TB of memory, and Pro-level admin support. These PCIe lanes could be extended to local storage (always important in VFX) as well as large RAMDisks, and the admin support through DASH helps keep the company systems managed together appropriately. AMD’s Memory Guard is also in its Pro line of parts, which is designed to enable full memory encryption.
Beyond VFX, AMD has cited world leadership compute with TR Pro for product engineering with Creo, 3D visualization with KeyShot, model design in architecture with Autodesk Revit, and data science, such as oil and gas dataset analysis, where the datasets are growing into the hundreds of GB and require substantial compute support.
Threadripper Pro vs Workstation EPYC (WEPYC)
Looking at the benefits that these new processors provide, it’s clear to see that these are more Workstation-style EPYC parts than ‘enhanced’ Threadrippers. Here’s a breakdown:
|AMD Zen 2 High-End Comparison|
|1P Flagship||TR 3990X||TR Pro 3995WX||EPYC 7702P|
|TDP||280 W||280 W||200 W|
|Base Freq||2900 MHz||2700 MHz||2000 MHz|
|Turbo Freq||4300 MHz||4200 MHz||3350 MHz|
|L3 Cache||256 MB||256 MB||256 MB|
|DRAM||4 x DDR4-3200||8 x DDR4-3200||8 x DDR4-3200|
|DRAM Capacity||256 GB||2 TB, ECC||4 TB, ECC|
|PCIe||4.0 x56 + chipset||4.0 x120 + chipset||4.0 x128|
To get these new parts starting from EPYC, all AMD had to do was raise the TDP to 280 W, and cut the DRAM support. If we start from a Threadripper base, there are 3-4 substantial changes. So why is this called Threadripper Pro, and not Workstation EPYC?
We come back to the VFX studios again. Having already bought in to the Threadripper branding and way of thinking, keeping these parts as Threadripper helps smooth that transition – this vertical had kind of already said they preferred Threadripper over EPYC, from what we are told, and so keeping the naming consistent means that there is no real re-education to do.
The other element is that the EPYC processor line is somewhat fractured: there are standard versions, high performance H models, high frequency F models, and then a series of custom designs under B, V, and others for specific customers. By keeping this new line as Threadripper Pro, it keeps it all under one umbrella.
Threadripper Pro Offerings: 12 core to 64 core
AMD announced these processors in the middle of last year, along with the Lenovo Thinkstation P620 as being the launch platform. From my experience, the Thinkstation line is very well designed, and we’re testing our 3995WX in a P620 today.
|AMD Ryzen Threadripper Pro|
|3995WX||64 / 128||2700||4200||8 + 1||256 MB||280 W||$5490|
|3975WX||32 / 64||3500||4200||4 + 1||128 MB||280 W||$2750|
|3955WX||16 / 32||3900||4300||2 + 1||64 MB||280 W||$1150|
|3945WX||12 / 24||4000||4300||2 + 1||64 MB||280 W||*|
|*Unsure if this is a special OEM model|
When TR Pro was announced with Lenovo, we weren’t sure if any other OEM would have access to Threadripper. When we asked OEMs earlier in that year about it, before we even knew if TR Pro was a real thing, they stated that AMD hadn’t even marked the platform on their roadmap, which we reported at the time. We have since learned that Lenovo had the 6-month exclusive, and information was only supplied to other vendors (ASUS, GIGABYTE, Supermicro) after it had been announced.
To that end, AMD has since announced that Threadripper Pro is coming to retail, both for other OEMs to design systems, or for end-users to build their own. Despite using the same LGA4094 socket as the other Threadripper and EPYC processors, TR Pro will be locked down to WRX80 motherboards. We currently know of three, such as the Supermicro and GIGABYTE models, plus we have also had the ASUS Pro WS WRX80E-SAGE SE Wi-Fi model in house for a short hands-on, although we weren’t able to test it.
Of the four processors listed above, the top three are going on sale. It’s worth noting that only the 64-core comes with 256 MB of L3 cache, while the 32-core comes with 128 MB of L3. AMD has kept that these chiplet designs only use as many chipsets as is absolutely necessary, keeping L3 cache per core consistent as well as the 8-cores per chiplet (the EPYC product line varies this a bit).
The fourth processor, the 12-core, would appear to be an OEM-only specific processor for prebuilt systems.
Threadripper Pro versus The World
These Threadripper Pro offerings are designed to compete against two segments: first is AMD themselves, showcasing anyone who is using a high-end professional system built on first generation Zen hardware that there is a lot of performance to be had. The second is against Intel workstation customers, either using single socket Xeon W (which tops out at 28 cores), or a dual socket Xeon system that costs more or uses a lot more power, just because it is dual socket, but also has a non-uniform memory architecture.
We have almost all these in this test (we don't have the 7702P, but we do have the 7742), and realistically these are the only processors that should be considered if the 3995WX is an option for you:
|3995WX Comparison Offerings|
|TR Pro 3995WX||64C||$5490||1P||280W||2700||4200||8x3200||128x 4.0||2 TB|
|TR 3990X||64C||$3990||1P||280W||2900||4300||4x3200||64x 4.0||¼ TB|
|EPYC 7702P||64C||$4425||1P||200W||2000||3350||8x3200||128x 4.0||4 TB|
|EPYC 7742||64C||$6950||2P||225W||2250||3400||8x3200||128x 4.0||4 TB|
|Xeon 6258R||28C||$3950||2P||205W||2700||4000||6x2933||48x 3.0||1 TB|
|Xeon W-3175X||28C||$2999||1P||255W||3100||4300||6x2933||48x 3.0||½ TB|
Intel tops out at 28 cores, and there is no getting around that. Technically Intel has the AP processor line that goes up to 56 cores, however these are for specialist systems and we haven’t had one physically sent to us for testing. Those are also $20k+ per CPU, and are two CPUs in the same system bolted under one package.
The AMD comparison points are the best Threadripper option and the best available EPYC Processor, albeit the 2P version. The best comparison here would be the 7702P, the single socket variant and much more price competitive, however we haven’t got this in for testing, instead we have AMD's EPYC 7742, which is the dual socket version but slightly higher performance.
|AMD TR Pro||TR Pro
|AMD TR||TR 3990X||MSI
|EPYC 7742||Supermicro H11DSI||BIOS
|Xeon Gold 6258R||ASUS ROG
|GPU||Sapphire RX 460 2GB (CPU Tests)|
|PSU||Various (inc. Corsair AX860i)|
|SSD||Crucial MX500 2TB|
|Silverstone SST-FHP141-VF 173 CFM fans also used. Nice and loud.|
We must thank the following companies for kindly providing hardware for our multiple test beds. Some of this hardware is not in this test bed specifically, but is used in other testing.
|Hardware Providers for CPU and Motherboard Reviews|
RX 460 Nitro
RTX 2080 Ti
|Crucial SSDs||Corsair PSUs|
Users interested in the details of our current CPU benchmark suite can refer to our #CPUOverload article which covers the topics of benchmark automation as well as what our suite runs and why. We also benchmark much more data than is shown in a typical review, all of which you can see in our benchmark database. We call it ‘Bench’, and there’s also a link on the top of the website in case you need it for processor comparison in the future.
Post Your CommentPlease log in or sign up to comment.
View All Comments
avb122 - Tuesday, February 9, 2021 - linkThose cases do not matter unless you are checking that the result is the same as a golden reference. Otherwise the image it creates is just as if the object it was rendering moved 10 micrometers. To our brain it not doesn't matter.
Being off by one bit with FP32 for geometry is about the same magnitude as modeling light as a partial instead of a wave. For color intensity, one bit of FP32 is less than one photon in real world cases.
But, CPUs and GPUs all get the same answer when doing the same FP32 arithmetic. The programmer can choose to do something else like use lossy texture compression or goofy rounding modes.
avb122 - Tuesday, February 9, 2021 - linkIt's not because of the hardware. AMD and NVIDIA's GPUs have IEEE complient FPUs. So, they get the same answer as the CPU when using the same algorithm.
With CUDA, the same C or C++ code doing computations can run on the CPU and GPU and get the same answer.
The REAL reasons to not use a GPU are that the non-compete parts (threading, memory management, synchronization, etc.) are different on the GPU and not all GPUs support CUDA. Those are very good reasons. But it is not about the hardware. It is about the software ecosystem.
Also GPUs do not have a tiny amount of cache. They have more total cache than a CPU. The ratio of "threads" to cache is lower. That requires changing the size of the block that each "thread" operates on. Ultimately, GPUs have so much more internal and external bandwidth than a CPU that only extreme cases where everything fits in the CPUs' L1 caches buy not in the GPU's register file can a CPU have more bandwidth.
Ian's statement about wanting 36 bits so that it can do 12-bit color is way off. I only know CUDA and NVIDIA's OpenGL. For those, each color channel is represented by a non-SIMD register. Each color channel is then either an FP16 or FP32 value (before neural networks GPUs were not faster at FP16, it was just for memory capacity and bandwidth). Both cover 12-bit color space. Remember, games have had HDR for almost two decades.
Dug - Tuesday, February 9, 2021 - linkIt's software.
But sometimes you don't want perfect. It can work in your benefit depending on what end results you view and interpret.
Smell This - Tuesday, February 9, 2021 - link
Paragraph below the first image
**Results for Cinebench R20 are not comparable to R15 or older, because both the scene being used is different, but also the updates in the code bath. **
I do like my code clean ...
alpha754293 - Tuesday, February 9, 2021 - linkIt's a pity that the processor and as a platform, you can buy a used dual EPYC 7702 server and still reap the multithreaded performance of 128-cores/256-threads moreso than you would be able to get out of this processor.
I'd wished that this review actually included the results of a dual EPYC 7702/7742 system for the purposes of comparing the two, as I think that the dual EPYC 7702/7742 would still outperform this Threadripper Pro 3995WX.
Duncan Macdonald - Tuesday, February 9, 2021 - linkGiven the benchmarks and the prices, the main reason for using the Threadripper Pro rather than the plain Threadripper is likely to be the higher memory capacity (2TB vs 256GB) .
Even a small overclock on a standard Threadripper would allow it to be faster than a non-overclocked Threadripper Pro for any application that fits into 256GB.
twtech - Tuesday, February 9, 2021 - linkThere are a couple other pretty significant differences that matter perf-wise in some scenarios - the Pro has 8-channel memory support, and more PCIE lanes.
Significant differences not directly tied to performance include registered ECC support, and management tools for corporate security, which actually matters quite a bit with everyone working remotely.
WaltC - Tuesday, February 9, 2021 - linkOn the whole, a nice review...;)
Yes, it's fairly obvious that one CPU core does not equal one GPU core, as comparatively, the latter is wide and shallow and handles fewer instructions, IPC, etc. GPU cores are designed for a specific, narrow use case, whereas CPU cores are much deeper (in several ways) and designed for a much wider use case. It's nice that companies are designing programming languages to utilize GPUs as untapped computing resources, but the bottom line is that GPUs are designed primarily to accelerate 3d graphics and CPUs are designed for heavy, multi-use, multithreaded computation with a much deeper pipeline, etc. While it might make sense to use both GPUs and CPUs together in a more general computing case once the specific-case programming goals for each kind of processing hardware are reached, it makes no sense to use GPUs in place of CPUs or CPUs in place of GPUs. AMD has recently made no secret it is divulging its GPU line to provide more 3d-acceleration circuitry and less compute circuitry for gaming, and another branch that will include more CU circuitry and less gaming-use 3d-acceleration circuitry. 'bout time.
The software rendering of Crysis is a great example--an old, relatively slow 3d GPU accelerator with a CPU can bust the chops of even WX3995 CPUs *if* the 3995WX is tasked to rendering Crysis sans a 3d accelerator. When the Crysis-engine talks about how many cores and so on it will support, it's talking about using a 3d accelerator *with* a general-purpose CPU. That's what the engine is designed to do, actually. Take the CPU out and the engine won't run at all--trying to use the CPU as the API renderer and it's a crawl that no one wants...;) Most of all, using the CPU to "render" Crysis in software has no comparison to a CPU rendering a ray-traced scene, for instance. Whereas the CPU is rendering to a software D3d API in Crysis, ray-tracing is done by far more complex programming that will not be found in the Crysis engine (of course.)
I was surprised to read that Ian didn't think that 8-channel memory would add much of anything to performance beyond 4-channel support....;) Eh? It's the same principle as expecting 4-channel to outperform 2 channel, everything else being equal. Of course, it makes a difference--if it didn't there would be no sense in having 3995WX support 8 channels. No point at all...;)
Oxford Guy - Tuesday, February 9, 2021 - linkYes, the same principle of expecting a dual core to outperform a single core — which is why single/core CPUs are still dominant.
(Or, we could recognize that diminishing returns only begin to matter at a certain point.)
tyger11 - Tuesday, February 9, 2021 - linkDefinitely waiting for the Zen 3 version of the 3955X. I'm fine with 16 cores.