Assessing Cavium's ThunderX2: The Arm Server Dream Realized At Lastby Johan De Gelas on May 23, 2018 9:00 AM EST
- Posted in
- Enterprise CPUs
Sizing Things Up: Specifications Compared
Thirty-two high-IPC cores in one package sounds promising. But how does the best ThunderX2 compare to what AMD, Qualcomm and Intel have to offer? In the table below we compare the high level specifications of several top server SKUs.
|Comparison of Major Server SKUs|
|4 dies x 8 cores
|Max. number of sockets||2||1||8||4||2|
|Base Frequency||2.2 GHz||2.2 GHz||2.2 GHz||2.4 GHz||2.2 GHz|
|Turbo Frequency||2.5 GHz||2.6 GHz||3.8 GHz||3.7 GHz||3.2 GHz|
|L3 Cache||32 MB||60 MB||38.5 MB||27.5 MB||8x8 MB|
|PCIe 3.0 lanes||56||32||48||48||128|
Astute readers will quickly remark that Intel's top of the line CPU is the Xeon Platinum 8180. However that SKU with its 205W TDP and $10k+ price tage is not comparable at all to any CPU in the list. We are already going out on a limb by including the 8176, which we feel belongs in this list of maximum core/thread count SKUs. In fact, as we will see further, Cavium positions the Cavium 9980 as "comparable" to the Xeon Platinum 8164, which is essentially the same part as the 8176 but with slightly lower clockspeeds.
However, it terms of performance per dollar, Cavium typically compares their flagship 9980 to the Intel Xeon Gold 6148, against which the pricing of Cavium's CPU is very aggressive. Many of Cavium's benchmarks claim that the fastest ThunderX2 is 30% to 40% ahead of the Xeon 6148, all the while Cavium's offering comes in at $1300 less. That aggressive pricing might explain the increasingly persistent rumors that Qualcomm is not going to enter the server market after all.
When looking at the table above, you can already see some important differences between the contenders. Intel seems to have the most advanced core topology and the highest turbo clockspeed. Meanwhile Qualcomm has the best chances when it comes to performance per watt, and has already published some benchmarking data that underlines this advantage.
Similar to AMD's EPYC, Cavium's ThunderX2 is likely to shine in the "sparse matrix" HPC market. This is thanks to its 33% greater theoretical memory bandwidth and a high core/thread count. However as we've seen in the case of AMD's design, EPYC's L3-cache is slow once you need data that is not in the local 8 MB cache slice. The ThunderX2, by comparison, is a lot more sophisticated with a dual ring architecture, which seems to be similar to the ring architecture of the Xeon v4 (Broadwell-EP). According to Cavium, this ring structure is able to offer up to 6 TB/s of bandwidth and is non-blocking.
This ring architecture is connected to Cavium's Coherent Processor Interconnect (CCPI2 - at the top of the picture), which runs at 600 Gb/sec. This interconnect links the two sockets/NUMA nodes. Also connected to the ring are the SoC's 56 PCIe 3.0 lanes, which Cavium allocates among 14 PCIe "controllers.". These 14 controllers can, in turn, be bifurcated down to x4 or x1 as you can see below.
SR-IOV, which is important for I/O virtualization (Xen and KVM), is also supported.
Post Your CommentPlease log in or sign up to comment.
View All Comments
Wilco1 - Wednesday, May 23, 2018 - linkYou might want to study RISC and CISC first before making any claims. RISC doesn't use more instructions than CISC. Vector instructions are actually quite similar on most ISAs. In fact I would say the Neon ones are more powerful and more general due to being well designed rather than added ad-hoc.
HStewart - Wednesday, May 23, 2018 - linkThe following site explain the difference using a simple multiply action, where a CISC architecture can do in single instruction, RISC would need to use multiple instructions
of course as time move on RISC chips added more complex operations and CISC also found ways to breaking more complex CISC instruction in smaller RISC like microcode increasing the chip ability to multitask the pipeline.
Wilco1 - Thursday, May 24, 2018 - linkThe example was about load/store architecture, not multiply. In reality almost all instructions use registers (even on CISCs) since memory is too slow, so it's not a good example of what happens in actual code. The number of executed instructions on large applications is actually very close. The key reason is that compilers avoid all the complex instructions on x86 and mostly use register operations, not memory.
Kevin G - Tuesday, May 29, 2018 - linkRaw instruction counts isn't a good metric to determine the difference between RISC and CISC, especially as both have evolved to include various SIMD and transactional extensions.
The big thing for RISC is that it only supports a handful of instruction formats, generally all of the same length (traditionally 4 bytes)* and have alignment rules in place. x86 on the other hand leverages a series of prefixes to enhance instructions which permits length up to 15 bytes. On the flip side, there are also x86 instructions that consume a single byte. This also means x86 doesn't have the alignment rules that RISC chips generally adhere to.
*ARM does offer some compressed instruction formats in Thumb/Thumb2 but they those are also of a fixed length. 16 bit Thumb instructions are half size as 32 bit ARM instructions and have alignment rules as well.
Modern x86 is radically different internally than its philosophical lineage. x86 instructions are broken down into micro-ops which are RISC-like in nature. These decoded instructions are now being cached to bypass the complex and power hungry decode stages. Compare this to some ARM cores where some instructions do not have to be decoded. While having a simpler decode doesn't directly help with performance, it does impact power consumption.
However, I would differ and say that ARM's FPU and vector history has been rather troubled. Initially ARM didn't specify a FPU but rather a method to add coprocessors. This lead to 3rd parties producing ARM cores with incompatible FPUs. It wasn't until recently that ARM themselves put their foot down and mandated NEON as the one to rule them all, especially in 64 bit mode.
peevee - Wednesday, May 23, 2018 - linkThe whole RISC vs CISC distinction is outdated for at least 20 years. Both now include a shi(p)load of instruction far outnumbering original CISC processors like 68000 and 8088 (from the epoch of the whole CISC vs RISC discussion), and both have a lot of architectural registers (which on speculative OoO CPUs are not even the same as real register files). ARMv8 for example includes NEON instructions, which is like... "AVX-128" (or SSE3 or smth).
A lot of instructions means that both have to have huge decoders, which limits how small the CPU can be (because any reduction in other hardware which decrease performance faster than cost). For 64-bit ARMv8.2 it is very unlikely than an implementation can be made smaller than A55, and it is a huge core (in transistors) compared to even Pentium, let alone 8088.
HStewart - Wednesday, May 23, 2018 - linkI think the big difference between SIMD technologies - even though ARM has included they are not as wide as instructions as Intel or AMD. The following link appears to have a good comparison of chip SIMD comparison in size, To me in looks like AMD is on AVX level 8/16 instead of 16/32 in current chips while ARM including Neon is 4 Wide which is actually less than Core 2 SSE instructions from 10 years ago.
It also interesting to note Ryzen stats - which I heard that AMD implement AVX 256 by combine two 128 together
One thing is that both Intel and AMD CPUs have grown a long ways since 20 years ago. In fact even todays Atom's can out rune most core-2 CPU's from 10 years - not my Xeon 5160 however.
ZolaIII - Thursday, May 24, 2018 - linkIt's 2x128 NEON SIMD per ARM A75 core which goes into your smartphone.
Even with smaller SIMD utilising TBL QC Centriq is able to beat up an Xerox Gold.
Wilco1 - Thursday, May 24, 2018 - linkModern Arm cores have 2-3 128-bit SIMD units, so 16-24 SP FLOPS/cycle. About half of Skylake theoretical flops, and yet they can match or beat Skylake on many HPC codes. Size is not everything...
peevee - Thursday, May 24, 2018 - link"ARM including Neon is 4 Wide which is actually less than Core 2 SSE instructions from 10 years ago"
How is it less? It is the same 128 bits, 2x64 or 4x32 or 2x16...
And AMD combines 2 AVX-256 operations (not 2 128-bit SSEs) to get AVX-512.
patrickjp93 - Friday, May 25, 2018 - linkAMD does NOT have AVX-512. They combine 2 128s into a 256 on Ryzen, ThreadRipper, and Epyc.