No More Apple Mysteries, Part Two

Name: No More Apple Mysteries, Part Two
Item: No More Apple Mysteries, Part Two
Author: Johan De Gelas

by Johan De Gelas on September 1, 2005 12:05 AM EST

Posted in
Mac

47 Comments | Add A Comment

47 Comments

Micro CPU Benchmarks: Isolating the FPU

Although it surely wasn't the main subject of our first article, the FLOPS (Floating Point Operations Per Second) portion was one part where I clearly made a mistake. Indeed, the --noaltivec flag and the comment that Altivec was enabled by default in the gcc 3.3 compiler docs made me believe that some Altivec SIMD optimization was being done when compiling flops, a synthetic micro FPU benchmark. That was not true: flops is double precision and gcc 3.3 did not support vectorisation.

As I wrote in the article, we used -O2 and then tried a bucket load of other options like --fast-math --mtune=G5, but it didn't make any significant difference.

Again, note that benchmarking with flops is not real world, but it isolates the FPU power. Flops shows the maximum double precision power that the core has by making sure that the program fits in the L1-cache. Flops consists of 8 tests, and each test has a different but well known instruction mix. The most frequently used instructions are FADD (addition), FSUB (subtraction) and FMUL (multiplication). We used the following on the Opteron based PCs:

Gcc -O2 -march=k8 flops.c -o flops

And, on the G5 machines, we used:

Gcc -O2 -mcpu=G5 flops.c -o flops

The command "gcc - version" gave this output "gcc (GCC) 4.0.0 Copyright (C) 2005 Free Software Foundation, Inc."

Let us check out the results:

MOD	FADD	FSUB	FMUL	FDIV	Powermac G5 2.7 GHz gcc 4.0	Powermac G5 2.7 GHz gcc 3.3	Powermac G5 2.5 GHz gcc 3.3	Opteron 850 2.4 GHz gcc 3.3.3	Opteron 850 2.4 GHz gcc 4.0
1	50%	0%	43%	7%	1158	1104	1026	1404	1319
2	43%	29%	14%	14%	607	665	618	844	695
3	35%	12%	53%	0%	3047	2890	2677	1955	1866
4	47%	0%	53%	0%	1583	522	486	1856	1850
5	45%	0%	52%	3%	1418	675	628	1831	1362
6	45%	0%	55%	0%	2163	915	851	1922	1698
7	25%	25%	25%	25%	546	284	265	562	502
8	43%	0%	57%	0%	2020	925	860	1989	1703
Average:					1568	998	926	1545	1374

As Gabriel Svelto and other readers pointed out, the problem with gcc 3.3 generating code for PowerPC CPUs is that it outputs very poorly scheduled code for these CPUs. The result is that gcc 3.3 does not make good use of the FP units of the G5 core, which are capable of FMADD instructions. This kind of instruction performs a 64-bit, double-precision floating-point multiply of an operand in floating-point register (FPR) "FRA" by the 64-bit, double-precision floating-point operand in FPR "FRC"; then add the result of this operation to the 64-bit, double-precision floating-point operand in FPR "FRB". Thus if the code allows it, you can do a multiplication and an addition while executing only one instruction. gcc 4.0 is a lot better at using these capabilities as you can see.

A bit disappointing is the fact that gcc 4.0 lowers the performance of the Opteron compared to gcc 3.3.3, but this article is not about compiler technology; rather, it is about comparing the G5 and the Apple platform to the x86 platform. With our current benchmark data, we can conclude that the G5's FPU performance is as good as the best x86 FP chip, the AMD Athlon 64 / Opteron. Using IBM's compiler for the G5 and Intel's compiler on the Opteron, there will be higher results for both platforms, but we wanted a comparison with exactly the same compiler technology.

Benchmark Configuration The Xserve Server Platform

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

47 Comments

View All Comments

JohanAnandtech - Friday, September 2, 2005 - link
Sorry couldn't resist :-). (for the rest of the world, pannekoek is dutch for Pancake)

Desktop performance is ok, as desktop apps are similar to the workstation apps we tested in the first article. Those apps spend from 5-20% in the OS, while server apps spend up to 80% of their time in the OS!

However, I should point out that we tested Mac OS X SERVER, so it is a problem for the Xserves.
Pannenkoek - Friday, September 2, 2005 - link
I stand corrected then. However, my reasoning still applies, it's just that Apple relies even more on its brand than on technology to sell server systems apparently. Who runs Mac OS servers anyway, it's an oxymoron. ;-)

P.S. Do not mock my nick, it served well in beating godlike UT bots, and should be honoured as much as Loque.
Tanclearas - Thursday, September 1, 2005 - link
"Apple told us that the problem lies in the Apachebench (the client side), which stalls from time to time and thus, generates too low of a load on the (Apache) server."

How does this explanation make any sense? Linux obviously doesn't have a problem with these "stalls".
JohanAnandtech - Friday, September 2, 2005 - link
What follows is not what Apple said, but my interpretation...

They are probably pointing out that the version for Mac OS X has a Mac OS X specific bug. Of course, who is to blame? I am sceptical like you.
mariush - Thursday, September 1, 2005 - link
Page 4 :

We used the following on the Opteron based PCs:

Gcc -O2 -mcpu=G5 flops.c -o flops

And, on the G5 machines, we used:

Gcc -O2 -march=k8 flops.c -o flops

I think it's the other way around.
Houdani - Thursday, September 1, 2005 - link
Aye, was gonna point that out also.

In addition, on page 3 should you list the Yellow Dog Linux along with OSX in the Software section for the Apple PowerMac G5?
Shinei - Thursday, September 1, 2005 - link
My question is, would the memory latencies be so high for the 970FX if high-end RAM was used for the Linux tests (like, say, some TCCD or BH-5 at 2-2-2-5), instead of the standard 3-3-3-8 SPD that ships with the G5 system? Or is there some limitation to the G5 motherboard that prevents posting with performance RAM as a way for Apple to ensure that only certain, accepted DIMMs are used with their computers?
Anyway, these results are very telling about what the OSX86 Macs are going to perform like--that is to say, ~25% slower than the equivalent Windows/Linux boxes running the same hardware...
IntelUser2000 - Sunday, September 4, 2005 - link

quote:
My question is, would the memory latencies be so high for the 970FX if high-end RAM was used for the Linux tests (like, say, some TCCD or BH-5 at 2-2-2-5), instead of the standard 3-3-3-8 SPD that ships with the G5 system? Or is there some limitation to the G5 motherboard that prevents posting with performance RAM as a way for Apple to ensure that only certain, accepted DIMMs are used with their computers?

That doesn't matter since they are testing workstations, Irwindale and Opteron is also using CAS3 RAM. No workstations/servers use 2-2-2-5 RAM.

The poor scores of OS X compared to Linux makes sense. G5 was rumored to be fast in speccpu benchmarks but came out to be slower. Must be that rumor systems were benched with Linux and the production was benched with OSX.

I am impressed with OS X's features though.
Jedi2155 - Thursday, September 1, 2005 - link
The G5 motherboard has the limitations due to Apple's way to insure you only buy certified ram. The SPD settings must be perfect.
ceefka - Thursday, September 1, 2005 - link
I am humbled by the sheer expertise of Johan. Amazing work, Johan!

This makes me even more curious about Intel's contribution to the next generation of Macs. How will they compare to the best G5s?

No More Apple Mysteries, Part Two

Micro CPU Benchmarks: Isolating the FPU

Post Your Comment

47 Comments

View All Comments

JohanAnandtech - Friday, September 2, 2005 - link

Pannenkoek - Friday, September 2, 2005 - link

Tanclearas - Thursday, September 1, 2005 - link

JohanAnandtech - Friday, September 2, 2005 - link

mariush - Thursday, September 1, 2005 - link

Houdani - Thursday, September 1, 2005 - link

Shinei - Thursday, September 1, 2005 - link

IntelUser2000 - Sunday, September 4, 2005 - link

Jedi2155 - Thursday, September 1, 2005 - link

ceefka - Thursday, September 1, 2005 - link

Log in

Don't have an account? Sign up now