Assessing Cavium's ThunderX2: The Arm Server Dream Realized At Lastby Johan De Gelas on May 23, 2018 9:00 AM EST
- Posted in
- Enterprise CPUs
Sizing Things Up: Specifications Compared
Thirty-two high-IPC cores in one package sounds promising. But how does the best ThunderX2 compare to what AMD, Qualcomm and Intel have to offer? In the table below we compare the high level specifications of several top server SKUs.
|Comparison of Major Server SKUs|
|4 dies x 8 cores
|Max. number of sockets||2||1||8||4||2|
|Base Frequency||2.2 GHz||2.2 GHz||2.2 GHz||2.4 GHz||2.2 GHz|
|Turbo Frequency||2.5 GHz||2.6 GHz||3.8 GHz||3.7 GHz||3.2 GHz|
|L3 Cache||32 MB||60 MB||38.5 MB||27.5 MB||8x8 MB|
|PCIe 3.0 lanes||56||32||48||48||128|
Astute readers will quickly remark that Intel's top of the line CPU is the Xeon Platinum 8180. However that SKU with its 205W TDP and $10k+ price tage is not comparable at all to any CPU in the list. We are already going out on a limb by including the 8176, which we feel belongs in this list of maximum core/thread count SKUs. In fact, as we will see further, Cavium positions the Cavium 9980 as "comparable" to the Xeon Platinum 8164, which is essentially the same part as the 8176 but with slightly lower clockspeeds.
However, it terms of performance per dollar, Cavium typically compares their flagship 9980 to the Intel Xeon Gold 6148, against which the pricing of Cavium's CPU is very aggressive. Many of Cavium's benchmarks claim that the fastest ThunderX2 is 30% to 40% ahead of the Xeon 6148, all the while Cavium's offering comes in at $1300 less. That aggressive pricing might explain the increasingly persistent rumors that Qualcomm is not going to enter the server market after all.
When looking at the table above, you can already see some important differences between the contenders. Intel seems to have the most advanced core topology and the highest turbo clockspeed. Meanwhile Qualcomm has the best chances when it comes to performance per watt, and has already published some benchmarking data that underlines this advantage.
Similar to AMD's EPYC, Cavium's ThunderX2 is likely to shine in the "sparse matrix" HPC market. This is thanks to its 33% greater theoretical memory bandwidth and a high core/thread count. However as we've seen in the case of AMD's design, EPYC's L3-cache is slow once you need data that is not in the local 8 MB cache slice. The ThunderX2, by comparison, is a lot more sophisticated with a dual ring architecture, which seems to be similar to the ring architecture of the Xeon v4 (Broadwell-EP). According to Cavium, this ring structure is able to offer up to 6 TB/s of bandwidth and is non-blocking.
This ring architecture is connected to Cavium's Coherent Processor Interconnect (CCPI2 - at the top of the picture), which runs at 600 Gb/sec. This interconnect links the two sockets/NUMA nodes. Also connected to the ring are the SoC's 56 PCIe 3.0 lanes, which Cavium allocates among 14 PCIe "controllers.". These 14 controllers can, in turn, be bifurcated down to x4 or x1 as you can see below.
SR-IOV, which is important for I/O virtualization (Xen and KVM), is also supported.
Post Your CommentPlease log in or sign up to comment.
View All Comments
Gunbuster - Wednesday, May 23, 2018 - linkBecause it's hard to explain the critical line of business software or database is having some unknown edge case issue because you thought look at me I'm so smart and saved 1% of the project cost using unproven low penetration hardware.
daanno2 - Wednesday, May 23, 2018 - linkI'm guessing you've never dealt with expensive enterprise software before. They are mostly licensed per-core, so getting the absolute best performance per core, even if the CPU is 2-3x more expensive, is worth it. At the end of the day, the CPUs might be <5% of the total cost.
SirPerro - Wednesday, May 23, 2018 - linkYou can swallow a big risk if the benefit is 75% of the cost. Hey, it's definitely worth the try.
If your hardware makes up for 5% of the cost, saving a 3% of the total budget is not worth the risk of migration.
FunBunny2 - Thursday, May 24, 2018 - link"You can swallow a big risk if the benefit is 75% of the cost. Hey, it's definitely worth the try."
the EOL of today's machines, the amortization schedules must be draconian. only if a 'different' server pays off in dozens of months, not years, will it have chance. to the extent that enterprise software is a C/C++ and *nix codebase, porting won't be onerous. but, I'm willing to guess, even Oracle code isn't all that parallel, so throwing a truckload of teeny cpu at it won't necessarily work.
name99 - Thursday, May 24, 2018 - linkThe bigger problem here is the massive uncertainty around the meaning of the word "server" and thus the target for these new ARM CPUs.
Some people seem to think "server" means primarily boxes that run SAP or ORACLE, but I think it's clear that the ARM ecosystem has little interest in that, at least right now.
What's of much more interest is racks on racks of CPUs running commodity (LAMP) or homegrown software, ie data warehouses and HPC. I'm not even sure the Java benchmarks being run are of much interest to this market. The things that matter are the sorts of things Cloudflare was measuring when they tested Centriq -- memcached, nginx, transforming one type of data into another (compression/decompression, encrypt/decrypt, transcode,...) at massive throughput.
That's where I'd expect to see the big sales of the ARM "server" cores -- to Cloudflare, Baidu, Google, and so on.
Also now that Marvell is in the game, will be interesting to see the extent to which they pull this downward, into their traditional sorts of markets like infrastructure network and storage control (eg to go into network appliances and NAS boxes).
Ed469546 - Wednesday, June 13, 2018 - linkSome of the commercial software you pay per core. Intel had the best single threaded performance mening power license costs.
Interesting question is how the Thunderx2 cores are counted in this case: one core can run 4 threads.
andrewaggb - Wednesday, May 23, 2018 - linkI wonder what workloads they are targeting? High throughput with poor single threaded results is somewhat limiting.
peevee - Wednesday, May 23, 2018 - linkWeb app servers. VM servers. Hadoop/Spark nodes. All benefit more from having more threads running in parallel instead of each request waiting or switching contexts.
If you are concerned about single-thread performance on 256-thread server (as 2-CPU server with this CPU will provide) AT ALL, you choose outrageously wrong hardware for the task to begin with. Go buy a 2-core i3. Practically the only test in this article which matters is Critical jOPS (assuming the used quality of service metric was configured realistically).
GeekyMcGeekface - Friday, May 25, 2018 - linkI’m building a cluster now with a few hundred Raspberry Pi’s because scale up is expensive and stupid. By distributing across a pool of clusters, I can handle far more memory bandwidth and compute. Consider 100 Raspberry PIs have 400 64-bit cores and 100GB of RAM. Total cost $3500 + power, mounting and switches.
Running three clusters of those with Kubernetes, Couchbase and Azure Functions provides 1200 64-bit cores, about 100GB of extremely high performance storage, incredible failover and a map-reduce environment to die for.
Add some 64GB MicroSD cards and an object storage system to the cluster and there’s 12TB of cold storage (4TB when made redundant).
Pay a service fee to some sweatshop in the Eastern Block to do the labor intensive bits and you can build a massively parallel, almost impossible to crash, CI/CD friendly, multi-tenant, infinitely scalable PaaS... for less than the cost of the RAM for a single one of the servers here.
The only expensive bits in the design are the Netscalers.
Oh... and the power foot print is about the same as one of these servers.
I honestly have no idea what I what I would use a server like these in a new design for.
jospoortvliet - Wednesday, May 30, 2018 - linksingle-core performance with your pi's is considerably lower, as is inter-core bandwidth. If your tasks require little inter-process communication you're good but with highly interdependent compute it won't perform well. But for specific tasks, yes, it might be very cost effective.