Assessing IBM's POWER8, Part 2: Server Applications on OpenPOWER
by Johan De Gelas on September 15, 2016 8:01 AM ESTFuture Visions, Cont: POWERed by NVIDIA
We have to check for ourselves of course, but IBM claims that compared to a dual K80 setup, a dual P100 gets a 2.07x speedup on the S822LC HPC. The same dual P100 on a fast Xeon with PCIe 3.0 only saw a 1.5x speedup. The benchmark used was a rather exotic Lattice QCD, or an approach to "solve quantum chromodynamics".
However, IBM reports that NVLink removes performance bottlenecks in
- FFT (signal processing)
- STAC-A2 (risk analysis)
- CPMD - computational chemistry
- Hash tables (used in many algorithms, security and big data)
- Spark
Those got our attention as, they are not some exotic niche HPC applications, but wide spread software components/frameworks used in both the HPC and data analytics world.
NVIDIA also claims that thanks to NVLink and the improved page migration engine capabilities, a new breed of GPU accelerated applications will be possible. The unified memory space (CUDA 6) introduced in Kepler was a huge step forward for the CUDA programmers: they no longer had to explicitly copy data from the CPU to the GPU. The Page Migration Engine would do that for them.
But the current system (Kepler and Maxwell) also had quite a few limitations. For example the memory space where the CPU and GPU are sharing data was limited to size of the GPU memory (typically 8-16 GB). The P100 now gets 49-bit virtual addressing, which means CUDA programs can thread every available RAM byte as one big virtual space. In the case of the newly launched S822LC, this means up to 1 TB of DRAM, and consequently 1 TB of memory space. Secondly, the whole virtual address space is coherent thanks to the new page fault mechanism: both the CPU and GPU can access the DRAM together. This requires OS support, and NVIDIA cooperated with the Linux community to make this happen.
Of course as the unified memory space gets larger, the amount of data to transfer back and forth gets larger too and that is where NVLink and the extra memory bandwidth of the POWER8 have a large advantage. Remember that even the POWER8 with only 4 buffer chips delivered twice as much memory bandwidth than the best Xeons. The higher end POWER8 have 8 buffer chips, and as a result offer almost twice as much memory bandwidth.
NVLink, together with the beefy memory subsystem of the POWER8, ensures that CUDA applications using such a unified 1TB memory space can actually work well.
The POWER8 - al heatsinks - looks less hot headed now that it has the companion of 4 Tesla P100 GPUs...
The S822LC will cost less than $50000, and it offers a lot of FLOPS per dollar if you ask us. First consider that a single Tesla P100 SXM2 costs around $9500. The S822LC integrates four of them, two 10-core POWER8s and 256 GB of RAM. More than 21 TFLOPS (FP64) connected by the latest and greatest interconnects in a 2U box: the S822LC HPC is going to turn some heads.
Last but not least, note that once you add two or more GPUs which consume 300W each, the biggest disadvantage of the POWER8 almost literally melts away. The fact that each POWER8 CPU may consume 45-100W more than the high performance Xeons seems all of a sudden relative and not such a deal breaker anymore. Especially in the HPC world, where performance is more important than Watts.
49 Comments
View All Comments
loa - Monday, September 19, 2016 - link
This article neglects one important aspect to costs:per-core licensed software.
Those licenses can easily be north of 10 000$ . PER CORE. For some special purpose software the license cost can be over 100 000 $ / core. Yes, per core. It sounds ridiculous, but it's true.
So if your 10-core IBM system has the same performance as a 14-core Intel system, and your license cost is 10 000$ / core, well, then you just saved yourself 40 000 $ by using the IBM processor.
Even with lower license fee / core, the cost advantage can be significant, easily outweighing the additional electricity bill over the lifetime of the server.
aryonoco - Tuesday, September 20, 2016 - link
Thanks Johan for another very interesting article.As I have said before, there is literally nothing on the web that compares with your work. You are one of a kind!
Looking forward to POWER 9. Should be very interesting.
HellStew - Tuesday, September 20, 2016 - link
Good article as usually. Thanks Johan.I'd still love to see some VM benchmarks!
cdimauro - Wednesday, September 21, 2016 - link
I don't know how much value could have the performed tests, because they don't reflect what happens in the real world. In the real world you don't use an old o.s. version and an old compiler for an x86/x64 platform, only because the POWER platform has problems with the newer ones. And a company which spends so much money in setting up its systems, can also spend just a fraction and buy an Intel compiler to squeeze out the maximum performance.IMO you should perform the tests with the best environment(s) which is available for a specific platform.
JohanAnandtech - Sunday, September 25, 2016 - link
I missed your reaction, but we discussed this is in the first part. Using Intel's compiler is good practice in HPC, but it is not common at all in the rest of the server market. And I do not see what an Intel compiler can do when you install mysql or run java based applications. Nobody is running recompiled databases or most other server software.cdimauro - Sunday, October 2, 2016 - link
Then why you haven't used the latest available distro (and compiler) for x86? It's the one which people usually use when installing a brand new system.nils_ - Monday, September 26, 2016 - link
This seems rather disappointing, and with regards to optmized Postgres and MariaDB, I think in that case one should also build these software packages optimized for Xeon Broadwell.jesperfrimann - Thursday, September 29, 2016 - link
@nils_Optimized for.. simply means that the software has been officially ported to POWER, and yes that would normally include that the specific accelerators that are inside the POWER architecture now are actually used by the software, and this usually means changing the code a bit.
So .. to put it in other words .. just like it is with Intel x86 Xeons.
// Jesper
alpha754293 - Monday, October 3, 2016 - link
I look forward to your HPC benchmarks if/when they become available.