Micro CPU Benchmarks: Isolating the FPU

Although it surely wasn't the main subject of our first article, the FLOPS (Floating Point Operations Per Second) portion was one part where I clearly made a mistake. Indeed, the --noaltivec flag and the comment that Altivec was enabled by default in the gcc 3.3 compiler docs made me believe that some Altivec SIMD optimization was being done when compiling flops, a synthetic micro FPU benchmark. That was not true: flops is double precision and gcc 3.3 did not support vectorisation.

As I wrote in the article, we used -O2 and then tried a bucket load of other options like --fast-math --mtune=G5, but it didn't make any significant difference.

Again, note that benchmarking with flops is not real world, but it isolates the FPU power. Flops shows the maximum double precision power that the core has by making sure that the program fits in the L1-cache. Flops consists of 8 tests, and each test has a different but well known instruction mix. The most frequently used instructions are FADD (addition), FSUB (subtraction) and FMUL (multiplication). We used the following on the Opteron based PCs:
Gcc -O2 -march=k8 flops.c -o flops
And, on the G5 machines, we used:
Gcc -O2 -mcpu=G5 flops.c -o flops
The command "gcc - version" gave this output "gcc (GCC) 4.0.0 Copyright (C) 2005 Free Software Foundation, Inc."

Let us check out the results:

MOD FADD FSUB FMUL FDIV Powermac G5 2.7 GHz
gcc 4.0
Powermac G5 2.7 GHz
gcc 3.3
Powermac G5 2.5 GHz
gcc 3.3
Opteron 850 2.4 GHz
gcc 3.3.3
Opteron 850 2.4 GHz
gcc 4.0
1 50% 0% 43% 7% 1158 1104 1026 1404 1319
2 43% 29% 14% 14% 607 665 618 844 695
3 35% 12% 53% 0% 3047 2890 2677 1955 1866
4 47% 0% 53% 0% 1583 522 486 1856 1850
5 45% 0% 52% 3% 1418 675 628 1831 1362
6 45% 0% 55% 0% 2163 915 851 1922 1698
7 25% 25% 25% 25% 546 284 265 562 502
8 43% 0% 57% 0% 2020 925 860 1989 1703
Average: 1568 998 926 1545 1374

As Gabriel Svelto and other readers pointed out, the problem with gcc 3.3 generating code for PowerPC CPUs is that it outputs very poorly scheduled code for these CPUs. The result is that gcc 3.3 does not make good use of the FP units of the G5 core, which are capable of FMADD instructions. This kind of instruction performs a 64-bit, double-precision floating-point multiply of an operand in floating-point register (FPR) "FRA" by the 64-bit, double-precision floating-point operand in FPR "FRC"; then add the result of this operation to the 64-bit, double-precision floating-point operand in FPR "FRB". Thus if the code allows it, you can do a multiplication and an addition while executing only one instruction. gcc 4.0 is a lot better at using these capabilities as you can see.

A bit disappointing is the fact that gcc 4.0 lowers the performance of the Opteron compared to gcc 3.3.3, but this article is not about compiler technology; rather, it is about comparing the G5 and the Apple platform to the x86 platform. With our current benchmark data, we can conclude that the G5's FPU performance is as good as the best x86 FP chip, the AMD Athlon 64 / Opteron. Using IBM's compiler for the G5 and Intel's compiler on the Opteron, there will be higher results for both platforms, but we wanted a comparison with exactly the same compiler technology.

Benchmark Configuration The Xserve Server Platform
Comments Locked

47 Comments

View All Comments

  • stmok - Thursday, September 1, 2005 - link

    LOL...As everyday passes, it seems more "interesting things" are revealed from Apple solutions.
  • ViRGE - Thursday, September 1, 2005 - link

    Granted, some of this was over my head(more than I'd like to admit to), but your results are none the less very interesting Johan. Now that we have the Linux/G5 numbers, there's no arguing that there's a weakness in MacOSX somewhere, which is a bit depressing as a Mac user, but still a very useful insight as to how there's obviously something very broken in some design aspect of the OS(it simply shouldn't be getting crushed like it is). My only question now is how Apple and its devs will respond to this - it is pretty damning after all.

    Thanks for finally getting some Linux/G5 numbers out to settle this.
  • sdf - Friday, September 2, 2005 - link

    By changing hardware platforms.

    No, seriously.

    A transition from PowerPC to Intel would be the perfect time to correct ABI flaws like this. It isn't that the G5 causes the slow down, it's that the slow down (maybe) can't really be fixed without breaking binary compatibility. A CPU transition is clearly going to do that anyway, so maybe they'll just wait...
  • toelovell - Thursday, September 1, 2005 - link

    I am kind of curious to see how Darwin would work on an x86 based system for these same tests. There are x86 binaries for Darwin 8. So it should be possible to run these tests and compare Darwin with Linux on an x86 platform. This would help to see if the OS really is the limitation. Just a thought.
  • JohanAnandtech - Thursday, September 1, 2005 - link

    If linux is capable of pushing the G5 8 times higer than with Mac OS X, there is little doubt on my mind that the OS is the problem. Or did I understand you wrong?

    Anyway, I have no experience whatsoever with Darwin. My first impression is that installing Darwin on x86 is probably a very masochistic experience, due to lack of proper drivers. We might get it working but can it really run MySQL and other apps? THere are probably libraries missing... Will the results be representative of anything as it is probably tuned for just getting it running instead of performance? Anyone with Darwin x86 experience?
  • wjcott - Thursday, September 1, 2005 - link

    The only interest I have in a mac OS is if they are going to sell it without a computer. I would love to have OS X, but I must build the machine.
  • Quanticles - Thursday, September 1, 2005 - link

    Every component must be fine tuned to the upmost degree... Every BIOS Setting... Every Hidden Register... *crazy eyes* =)

Log in

Don't have an account? Sign up now