No More Apple Mysteries, Part Two

Name: No More Apple Mysteries, Part Two
Item: No More Apple Mysteries, Part Two
Author: Johan De Gelas

by Johan De Gelas on September 1, 2005 12:05 AM EST

Posted in
Mac

47 Comments | Add A Comment

47 Comments

Micro CPU Benchmarks: Isolating the FPU

Although it surely wasn't the main subject of our first article, the FLOPS (Floating Point Operations Per Second) portion was one part where I clearly made a mistake. Indeed, the --noaltivec flag and the comment that Altivec was enabled by default in the gcc 3.3 compiler docs made me believe that some Altivec SIMD optimization was being done when compiling flops, a synthetic micro FPU benchmark. That was not true: flops is double precision and gcc 3.3 did not support vectorisation.

As I wrote in the article, we used -O2 and then tried a bucket load of other options like --fast-math --mtune=G5, but it didn't make any significant difference.

Again, note that benchmarking with flops is not real world, but it isolates the FPU power. Flops shows the maximum double precision power that the core has by making sure that the program fits in the L1-cache. Flops consists of 8 tests, and each test has a different but well known instruction mix. The most frequently used instructions are FADD (addition), FSUB (subtraction) and FMUL (multiplication). We used the following on the Opteron based PCs:

Gcc -O2 -march=k8 flops.c -o flops

And, on the G5 machines, we used:

Gcc -O2 -mcpu=G5 flops.c -o flops

The command "gcc - version" gave this output "gcc (GCC) 4.0.0 Copyright (C) 2005 Free Software Foundation, Inc."

Let us check out the results:

MOD	FADD	FSUB	FMUL	FDIV	Powermac G5 2.7 GHz gcc 4.0	Powermac G5 2.7 GHz gcc 3.3	Powermac G5 2.5 GHz gcc 3.3	Opteron 850 2.4 GHz gcc 3.3.3	Opteron 850 2.4 GHz gcc 4.0
1	50%	0%	43%	7%	1158	1104	1026	1404	1319
2	43%	29%	14%	14%	607	665	618	844	695
3	35%	12%	53%	0%	3047	2890	2677	1955	1866
4	47%	0%	53%	0%	1583	522	486	1856	1850
5	45%	0%	52%	3%	1418	675	628	1831	1362
6	45%	0%	55%	0%	2163	915	851	1922	1698
7	25%	25%	25%	25%	546	284	265	562	502
8	43%	0%	57%	0%	2020	925	860	1989	1703
Average:					1568	998	926	1545	1374

As Gabriel Svelto and other readers pointed out, the problem with gcc 3.3 generating code for PowerPC CPUs is that it outputs very poorly scheduled code for these CPUs. The result is that gcc 3.3 does not make good use of the FP units of the G5 core, which are capable of FMADD instructions. This kind of instruction performs a 64-bit, double-precision floating-point multiply of an operand in floating-point register (FPR) "FRA" by the 64-bit, double-precision floating-point operand in FPR "FRC"; then add the result of this operation to the 64-bit, double-precision floating-point operand in FPR "FRB". Thus if the code allows it, you can do a multiplication and an addition while executing only one instruction. gcc 4.0 is a lot better at using these capabilities as you can see.

A bit disappointing is the fact that gcc 4.0 lowers the performance of the Opteron compared to gcc 3.3.3, but this article is not about compiler technology; rather, it is about comparing the G5 and the Apple platform to the x86 platform. With our current benchmark data, we can conclude that the G5's FPU performance is as good as the best x86 FP chip, the AMD Athlon 64 / Opteron. Using IBM's compiler for the G5 and Intel's compiler on the Opteron, there will be higher results for both platforms, but we wanted a comparison with exactly the same compiler technology.

Benchmark Configuration The Xserve Server Platform

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

47 Comments

View All Comments

Lori - Friday, September 2, 2005 - link
http://en.wikipedia.org/wiki/Microkernel">http://en.wikipedia.org/wiki/Microkernel

MacOS X uses a modified microkernel (a monolithic / microkernel hybrid). The idea was to cut down IPC costs by putting servers that would be IPC heavy directly into the kernel. However, there has recently been a lot of work in the microkernel world to reduce this IPC cost and bring its speed near that of a monolithic kernel.

L4Ka::Pistachio is an example of this:
http://www.l4ka.org/">http://www.l4ka.org/
leviat - Thursday, September 1, 2005 - link
If the problem is indeed in the thread creation portion of the OS, it would be interesting to see how a single threaded webserver fairs. I would love to see a benchmark test of Lighttpd (www.lighttpd.org) to see a comparison of how that runs on Darwin vs linux-ppc.

Another interesting test would be to see MySQL can be configured to precreate the handler threads. This might allow us to see how it handles the context-switching between the multiple threads and allow for it to compete.

Anyways, great article!
JohanAnandtech - Friday, September 2, 2005 - link
What exactly to do you mean by single threaded? Because Apache 1.3 works with processes, and is thus single-threaded per user.

MySQL can make use of a Thread cache, we played with it but it didn't give any substantial boost. I don't see how the software would be able to precreate all threads as it has close down and open connections. If you got some insight, please share :-).

Context switching is quite fast on the G5 OS X, give or take a few percentages compared to Linux x86 or G5 Linux, as we tested with lmbench.
Lori - Friday, September 2, 2005 - link
Actually there are more than one way to handle multiple connections in a server application.

To give you some examples...

1. Multi process
2. Multi thread
3. Some hybrid of the two

You can see combinations of these types all provided by Apache 2's MPMs. (perchild, prefork, threadpool, worker, leader.. etc)

4. Asynchronus multiplexing.

Your program becomes its own schedular. You can do all your processing within a single thread. Also read up on non blocking i/o. I am actually surprised apache does not have a MPM to handle this type of connection multiplexing but I also read its harder to get OS support.

Letsee... links... umm... ahh...:

http://www.kegel.com/c10k.html">http://www.kegel.com/c10k.html
Avalon - Thursday, September 1, 2005 - link
Seems like once you remove the G5 from OSX, it's a very capable chip.
jamawass - Thursday, September 1, 2005 - link
Great article, in response to the previous post Anand has posted tons of server articles on x86 systems so Apple is fair game here. Secondly Apple servers are based on OSX in the market, corporations want to know the real world performance not the desktop feel. Also Johan's speculation on Apple's move to Intel raises some troubling questions for Apple execs.
karlreading - Thursday, September 1, 2005 - link
a lot of people commenting on how apple have mad a wrong dicision turning to intel.
possibly, but IMHO, and, if im not mistaken, didnt the opteron dominate all the tests.
so in my mind whilst its true for people to doubt apple for going intel, x86 on the whole is still a very viable option if you go the AMD route.
yes i know people will say AMD dont hae the capacity, but amd powered macs should be how x86 macs are done.
karlos
karlreading - Thursday, September 1, 2005 - link
also worth noting is that they say the FP poerformance is as good as the fastest x86 chip. well scuse me, but isnt that a 2.7ghz g5 part there testing there? thats the fastest g5 currently avalible isnt it? well then why not test the opteron 254. thats the fastest x86 chip, running 2.8ghz, rather than the 850/250 2.4ghz part tested? that would put some lead against the g5 and also, 2.8ghz is a lot closer than 2.4ghz is to the 2.7ghz g5's core speed. if were trying to be fair.
if we was being really picky we would be stating duakl core opteron as the fastest x86, but i digress....
karlos
JohanAnandtech - Friday, September 2, 2005 - link
You are right about the recentely introduced 2.8 GHz Opteron. Well, to be really accurate, at the time of the introduction of the 2.7 GHz G5, a 2.6 Ghz opteron was available.

Anyway, It was not my intention to be "accurate", it was more a general impression. Give or take a few percent, the G5 can compete FP wise :-).
Pannenkoek - Thursday, September 1, 2005 - link
It's a matter of scalability, SMP support and not so much of how fast some system calls are executed as the reason for the bad performance I would think. Linux is the most used OS for superclusters these days, Mac OS remains a desktop OS. It's no wonder that it performs poorly as a serious server on a multiprocessor/core system. It would have been interesting to see how Windows would have faired (on the x86 of course), if we are testing OSes in this way.

However, MySQL benchmarks say little about desktop performance, Anandtech's audience consists of desktop users and the reason people love or hate Mac OS is its desktop. Nevertheless, almost a great article. It should have been if the autor could have resisted the temptation of too much speculation, instead of honest benchmark numbers.

No More Apple Mysteries, Part Two

Micro CPU Benchmarks: Isolating the FPU

Post Your Comment

47 Comments

View All Comments

Lori - Friday, September 2, 2005 - link

leviat - Thursday, September 1, 2005 - link

JohanAnandtech - Friday, September 2, 2005 - link

Lori - Friday, September 2, 2005 - link

Avalon - Thursday, September 1, 2005 - link

jamawass - Thursday, September 1, 2005 - link

karlreading - Thursday, September 1, 2005 - link

karlreading - Thursday, September 1, 2005 - link

JohanAnandtech - Friday, September 2, 2005 - link

Pannenkoek - Thursday, September 1, 2005 - link

Log in

Don't have an account? Sign up now