No More Apple Mysteries, Part Two

Name: No More Apple Mysteries, Part Two
Item: No More Apple Mysteries, Part Two
Author: Johan De Gelas

by Johan De Gelas on September 1, 2005 12:05 AM EST

Posted in
Mac

47 Comments | Add A Comment

47 Comments

Micro CPU Benchmarks: Isolating the FPU

Although it surely wasn't the main subject of our first article, the FLOPS (Floating Point Operations Per Second) portion was one part where I clearly made a mistake. Indeed, the --noaltivec flag and the comment that Altivec was enabled by default in the gcc 3.3 compiler docs made me believe that some Altivec SIMD optimization was being done when compiling flops, a synthetic micro FPU benchmark. That was not true: flops is double precision and gcc 3.3 did not support vectorisation.

As I wrote in the article, we used -O2 and then tried a bucket load of other options like --fast-math --mtune=G5, but it didn't make any significant difference.

Again, note that benchmarking with flops is not real world, but it isolates the FPU power. Flops shows the maximum double precision power that the core has by making sure that the program fits in the L1-cache. Flops consists of 8 tests, and each test has a different but well known instruction mix. The most frequently used instructions are FADD (addition), FSUB (subtraction) and FMUL (multiplication). We used the following on the Opteron based PCs:

Gcc -O2 -march=k8 flops.c -o flops

And, on the G5 machines, we used:

Gcc -O2 -mcpu=G5 flops.c -o flops

The command "gcc - version" gave this output "gcc (GCC) 4.0.0 Copyright (C) 2005 Free Software Foundation, Inc."

Let us check out the results:

MOD	FADD	FSUB	FMUL	FDIV	Powermac G5 2.7 GHz gcc 4.0	Powermac G5 2.7 GHz gcc 3.3	Powermac G5 2.5 GHz gcc 3.3	Opteron 850 2.4 GHz gcc 3.3.3	Opteron 850 2.4 GHz gcc 4.0
1	50%	0%	43%	7%	1158	1104	1026	1404	1319
2	43%	29%	14%	14%	607	665	618	844	695
3	35%	12%	53%	0%	3047	2890	2677	1955	1866
4	47%	0%	53%	0%	1583	522	486	1856	1850
5	45%	0%	52%	3%	1418	675	628	1831	1362
6	45%	0%	55%	0%	2163	915	851	1922	1698
7	25%	25%	25%	25%	546	284	265	562	502
8	43%	0%	57%	0%	2020	925	860	1989	1703
Average:					1568	998	926	1545	1374

As Gabriel Svelto and other readers pointed out, the problem with gcc 3.3 generating code for PowerPC CPUs is that it outputs very poorly scheduled code for these CPUs. The result is that gcc 3.3 does not make good use of the FP units of the G5 core, which are capable of FMADD instructions. This kind of instruction performs a 64-bit, double-precision floating-point multiply of an operand in floating-point register (FPR) "FRA" by the 64-bit, double-precision floating-point operand in FPR "FRC"; then add the result of this operation to the 64-bit, double-precision floating-point operand in FPR "FRB". Thus if the code allows it, you can do a multiplication and an addition while executing only one instruction. gcc 4.0 is a lot better at using these capabilities as you can see.

A bit disappointing is the fact that gcc 4.0 lowers the performance of the Opteron compared to gcc 3.3.3, but this article is not about compiler technology; rather, it is about comparing the G5 and the Apple platform to the x86 platform. With our current benchmark data, we can conclude that the G5's FPU performance is as good as the best x86 FP chip, the AMD Athlon 64 / Opteron. Using IBM's compiler for the G5 and Intel's compiler on the Opteron, there will be higher results for both platforms, but we wanted a comparison with exactly the same compiler technology.

Benchmark Configuration The Xserve Server Platform

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

47 Comments

View All Comments

Gandalf90125 - Friday, September 2, 2005 - link
From the article:

"...so it seems that IBM, although slightly late, could have provided everything that Apple needs."

I'd say not everything Apple needs as I suspect the switch to Intel was driven more by marketing than any technical aspect of the IBM vs. the Intel chips.
Illissius - Friday, September 2, 2005 - link
A few notes:

- you mention trying a --fast-math option, which I've never heard of... presumably this was a typo for -ffast-math?

- when I tried using -mcpu (which you say you used for YDL) on GCC 3.4, it told me the option had been deprecated, and -mtune has to be used instead (dunno whether it told me this latter part itself or I read it somewhere else, but it's true). I'm not sure whether GCC4 has the same behaviour (I'd think so), whether it still has the intended effect despite the warning, or whether it matters at all.

- was there a reason for using -march on one, and -mcpu/-mtune on the other? (the difference is that -mcpu/-mtune optimize the code for that processor as much as possible while still keeping the code compatible with everything else in the architecture, while -march does the same without care for compatibility -- on x86 at least, not sure whether it's the same on PPC)

- you mention using the same compiler because, err, you wanted to use the same compiler... if this was done in the hopes of it generating code of similar speed for each architecture, though, then your own results show there isn't much point -- seems GCC, 3.3 at least, is much better at generating x86 code than PPC (which isn't surprising, given much more work probably went into it due to the larger userbase). Not saying it was a bad idea to use GCC on both platforms (it's a good one, if for no other reason than most code, on the Linux side at least and OSX itself (not sure about the apps) are compiled with it), just that if the above was the reason, it wasn't a very good one ;).

- Continuing the above, I was a bit surprised at the, *ahem*, noticeable difference in speed between not even two different compilers, but two versions of the same. (I was expecting something like 1-5, maybe 10% difference, not 100). Maybe this could warrant yet another followup article, this time on compilers? :)
Pannenkoek - Friday, September 2, 2005 - link
The reason is that GCC 4.0 incorporated infrastrucure for vector optimization (tree-ssa), which can give, especially in synthetic benchmarks, huge increase in FP performance. GCC can now finally optimize for SSE, Altivec, etc., a reason why in the past optimizing specifically for newer Pentiums did not yield much improvement.

Althougn compiler benchmarks would be interesting, I doubt it is a task for Anandtech. Normal desktop users do not have to worry about whether or not their applications are optimized optimally, and any differences between, say GCC and ICC, are small or negligible for ordinary desktop programs. (Multimedia programs often have inline assembly for performance critical parts anyway).

GCC is free, supports about any platform and improves continually while it's already a first class compiler.
javaxman - Friday, September 2, 2005 - link
While I generally love this article, I have to wonder...
why not write a simple benchmark for pthread(), if you think that's the bottleneck? Surely it'd be a simple thing to write a page of code which creates a bunch of threads in a loop, then issues a thread count and/or timestamp. It seems like a blindingly obvious test to run. Please run it.

I have to say that I *do* think pthread() is the likely bottleneck, possibly due to BSD4.9-derivative code, but why not test that if we think that's the problem? I understand wanting to see real-world MySQL performance, but if you're trying to find a system-level bottleneck, that's not the right type of testing to do...

Now that I metion it, Darwinx86 vs. BSD 4.9 ( on the same system ) vs. BSD 5.x ( on the same system ) vs. Linux ( on the same system ) would really be a more interesting test at this point... I'm really not caring about PPC these days unless it's an IBM blade system, to be honest... testing an Apple PPC almost seems silly, they'll be gone before too long... Apple's decision to move away from PPC has more to do with *future* chip development than *current* offerings, anyway... Intel and AMD are just putting more R&D into their x86 chips, IBM's not matching it, and Apple knows it...

but even if you are going to look at PPC systems, if you're trying to find a system-level bottleneck, write and run system-level tests... a pthread() test is what is needed here.
rhavenn - Friday, September 2, 2005 - link
If I remember correctly, OS X is forked off of the FreeBSD 4.9 codebase. The 4.x series of BSD always had a crappy threading system and didn't handled threaded apps well at all. I doubt Apple really touched those internals all that much.

FreeBSD 5.x has a much better time of it. I'm wondering if the switch back to a Intel platform will make it easier for Apple to integrate the BSD 5.x codebase into their OS? or even if they plan on using the BSD 6.x codebase for a future release? The threading models have vastly improved.

Just a thought :)
JohanAnandtech - Friday, September 2, 2005 - link
http://www.apple.com/education/hed/compsci/tiger.h...">http://www.apple.com/education/hed/compsci/tiger.h... :

"FreeBSD 5.0
The upgraded kernel in Tiger, based on mach and FreeBSD, provides optimized resource locking for better scalability across multiple processors, support for 64-bit memory pointers through the System library and standards-based access control lists"

Where did you see FreeBSD 4.9?
mbe - Friday, September 2, 2005 - link
Readers also pointed out that LMBench uses "fork", which is the way to create a process and not threads in all Unix variants, including Mac OS X and Linux. I fully agree, but does this mean that the benchmark tells us nothing about the way that the OS handles threading? The relation between a low number in this particular Lmbench benchmark and a slow creating of threads may or may not be the answer, but it does give us some indication of a performance issue. Allow me to explain...

This misses the point, your claim in the last article was that MacOS X used userspace threads. Mentioning that LMbench uses processes still rules out userspace threads having any part to play. This is since processes can't in any meaningful way (short of violating some pretty basic principles) be implemented around userspace threads. The point is that a process is a virtual memory space attached to a main system thread, not a userspace thread which are not normally even considered threads on this level.

This is necessary since the virtual memory attached to the thread has to be managed when doing context switches, and by its very definition userspace code cannot directly touch the memory mappings.
JohanAnandtech - Friday, September 2, 2005 - link
Yes, it could be. The interesting questions are:
- Is the only culprit for the 8 time lower performance. Microkernels are reported to be 66 to 5% slower depending on who benchmarked it. But not 8 times slower.
- What makes it still interesting for the apple devs to use it?

I hope Apple will be a bit more keen to defend their product, because their might be interesting technical reasons to keep the Mach kernel.
sdf - Friday, September 2, 2005 - link
Is Mac OS X really a microkernel? I understood it to be designed to function as a microkernel, but compiled and shipped as a macrokernel for performance reasons.
JohanAnandtech - Sunday, September 4, 2005 - link
I am sorry if I wasn't clear. As I state in the article clearly: Mac OS X is ** NOT ** a microkernel, but based on a microkernel as the Mach kernel is burried inside the FreeBSD monolithic kernel.

Most of the tasks are done by a FreeBSD alike kernel, but threading is done by the Mach kernel.

No More Apple Mysteries, Part Two

Micro CPU Benchmarks: Isolating the FPU

Post Your Comment

47 Comments

View All Comments

Gandalf90125 - Friday, September 2, 2005 - link

Illissius - Friday, September 2, 2005 - link

Pannenkoek - Friday, September 2, 2005 - link

javaxman - Friday, September 2, 2005 - link

rhavenn - Friday, September 2, 2005 - link

JohanAnandtech - Friday, September 2, 2005 - link

mbe - Friday, September 2, 2005 - link

JohanAnandtech - Friday, September 2, 2005 - link

sdf - Friday, September 2, 2005 - link

JohanAnandtech - Sunday, September 4, 2005 - link

Log in

Don't have an account? Sign up now