Threads & Performance

"Threads" is a popular discussed subject. Therefore, we like to give a small introduction to those of you who are not familiar with threads. To understand threads, you first must understand processes. Any decent OS controls the memory allocation to the different programs or processes. A process gets its own private, virtual address space in memory from the OS. Thus, a process cannot communicate/exchange data with other processes without the help of the kernel, the heart of the OS that controls everything. Processes can split up in threads, parallel tasks that share the virtual address space, which can exchange data very quickly without intervention of the OS (global, static, and instance fields, etc.).

The thread is the entity to which the modern operating system (Windows NT based, Solaris, Linux) assigns CPU time. While you could split a CPU intensive program in processes (modern OS sees it as 1 process consisting of one thread), threads of the same process have much less overhead and synchronize data much quicker. The operating system assigns CPU time to running threads based on their priority. Performance gains of multi-CPU or multi-core CPU configurations are only high if: You have more than one CPU intensive thread; The threads are balanced - there is not one very intensive and a few others that are hardly CPU intensive; Synchronization between threads (shared data) either happens quickly, thanks to fast interconnects, or little synchronization is necessary; The OS provides well-tuned load-balanced scheduling; The threads are cache friendly (memory latency!) and do not push the memory bandwidth to the limits. In that case, you may typically expect a 70% to 99% performance speed-up, thanks to the second core. Be warned that Intel was already showing performance increases, which are not realistic "up to 124%". [1]

The benchmarks compare a Pentium 4 EE 840, a Dual Core Pentium 4 3.2 GHz (1 MB L2), to a 3.73 GHz Pentium 4 EE with 2 MB L2. Especially in the last benchmark, a game running in foreground with two PVR (Personal Video Recorders), and tuners running in the background gives a very weird result. How can a slower Dual core be more than 100% faster than a single core with a higher clock speed, bigger caches and a faster FSB? When we first asked Intel, they pointed to the platform (newer chipset, etc.), but no new chipset can make up for a 33 % slower FSB.

We suspected that different thread priorities (giving the game thread a higher priority) might have been the explanation, but Intel's engineers had another interesting explanation. They pointed out that the Windows scheduler can sometimes be inefficient when running many heavy tasks on a single CPU and might have given the game less CPU time than normal. The Windows scheduler didn't have that problem when two CPUs were present: less context switching between threads, and no reason to give the game not enough CPU time. Prepare for a load of hard-to-interprete benchmarks on the Internet...

Threads & Programming

Programming in Threads brings many advantages, especially on dual-cores. Threads with long running CPU intensive processing are not able to the give the system a sluggish unresponsive feeling when you want you do something else at the same time. The OS scheduler should take care of that as long as the CPU is fast enough, but the Intel benchmarks above show you that that is only true in theory. Dual and multi-core can definitely help here. Threads make a system more responsive and offer a very nice performance boost on multi-CPU systems. But the other side of the medal is complexity. Running separate tasks in separate threads that do not need to share data is the easiest part of making a program more suitable to multi-core CPUs. But that has been done a long time ago, and the real challenge is to handle threads that have to share data. The programmer also has to watch over the fact that high amounts of threads introduce overhead in the form of (unnecessary) context switches even on dual core CPUs.

A nasty problem that might pop-up is a "deadlock", when two threads are each waiting for the other to complete, resulting in neither thread ever completing. A race between two threads might sound speedier, but it means that the result of a program's operation depends on which of two or more threads completes first. The problem becomes exponentionally worse if more and more threads are able to run into these problems. Both the Java and .Net ("Threadpool") platform provide classes and tools to deal with thread management - programmers are not left on their own. The problem is not creating threads, but debugging the multithreaded programs. The result is that multithreading has been used sparingly and with as few threads as possible to keep complexity down. But the right tools are coming, right?

Multi-threading toolbox

Intel does provide a few interesting tools for multithreading.

OpenMP is the industry standard for "portable" multi-threaded application development, and can do fine grain (loop level) and large grain (function level) threading.

The newest Intel compilers are even capable of Auto-Parallelization. That sounds fantastic - would multithreading be as easy as using the right compiler? After all, Intel's compiler is able to vectorize existing FP code too. Just recompile your FP intensive code with the right compiler flags and you get speed-ups of 100% and more as the Intel compiler is able to replace x87 instructions by faster SSE-2 alternatives.

Let us see what Intel says about auto-parallelization:
"Improve application performance on multiprocessor systems using auto-parallelization for automatic threading of loops. This option detects parallel loops capable of being executed safely in parallel and automatically generates multi-threaded code. Automatic parallelization relieves the user from having to deal with the low-level details of iteration partitioning, data sharing, thread scheduling and synchronizations. It also provides the benefit of the performance available from multiprocessor systems and systems that support Hyper-Threading Technology."
So, it is just a matter of using the right tools? A chicken and egg problem? When the hardware is there, the software will follow? Is it just a matter of having the right tools and enough market penetration of multi-core CPUs? We asked Tim Sweeney, founder of Epic and a multi-threaded game engine programming guru.

Index Unreal 3
Comments Locked


View All Comments

  • ChronoReverse - Tuesday, March 15, 2005 - link

    Eh? 20% speed reduction? The dual-core sample in the new post was running at 2.4GHz (FX-53). Sure it's not FX-55 speeds but it's still faster than most everything.
  • kmmatney - Monday, March 14, 2005 - link

    edit - I just read some of the above posts. Yes, I agree that dual core can be more efficent than dual cpu. However you have about a 20% reduction in core speed which the dual core optimizations will have to overcome, when compared to a single core cpu.
  • kmmatney - Monday, March 14, 2005 - link

    For starters, why would dual core be any different than dual cpu? One of the Quake games (quake 3?) was able to make use of a second cpu, and the gain was very minimal. I'm not even sure Id bothered with dual cpu use for Doom3. If everybody has dual core cpu's, then obviously more work would be done to make use of it, but we've had dual cpu motherboards for a long time already.
  • Verdant - Monday, March 14, 2005 - link

    there is no one who (has a clue) doubts that you will see an ever increasing level of cores provide an ever increasing level of performance, in fact i would not be surprised if the Mhz races of the 90s become the "number of core" races of this decade.

    but i think the one line that really hit the nail on the head is the one about a lack of developer tools.

    writting a lower level multi-threaded application is extremely difficult, game developers aren't using tools like java or c# where it is a matter of enclosing a section of code in a synchronized/lock block, throwing a few wait() calls in and launching their new thread. - the performance of these platforms just isn't there.

    for consideration - a basic 2 thread bounded buffer program in C is easily 200 lines of code, while it can easily be done in a language like C# in about 20.

    developers are going to need to either: move to one of these new languages/platforms and take the performance hit, develop a new specialized platform/language, or they will most likely go bankrupt with the old tools.

    the other thing that may have some merit - is a compiler that can generate multi-threaded code from single thread code, however to have any sort of real effect it will need to have an enormous amount of research poured into it, as automatically deciding un-serializable tasks is a huge AI task. Intel's current compiler obviously is many years away from the sort of thing i am talking about.
  • Doormat - Monday, March 14, 2005 - link


    The AMD architecture is different than Intels dual core architecture.

    AMD will have a seperate HTT link between chips (phy layer only) for intercore communication, and a seperate link to the memory arbitor/access unit.

    Whereas intel (when they opt for two seperate cores, two seperate pieces of silicon) will have a link between the two processors, but its is a bus, and not point-to-point, and also will share that bus with all traffic out to the northbridge/mch. Memory traffic, non-DMA I/O traffic, etc.

    In other words, AMD has a dedicated intercore comm channel via HTT while Intel does not. This will affect heavily interconnected threads.
  • saratoga - Monday, March 14, 2005 - link

    "Unless you hit a power and/or heat output wall.

    Tell nVidia that parallell GPUs are bad, they alreay sell their SLI solution for dual-GPU computers."

    Multicore doesn't make much sense for GPUs because its not cost effective, and because GPUs do not have the same problems as CPUs. With a GPU you can just double the number of pipelines and your throughput more or less doubles (though bandwidth can be an issue here), and for a fraction the cost of two discreet boards or two seperate GPUs. That approach doesn't work well with CPUs, hence the interest in dual core CPUs.

    "Isn't a high IPC-count also a form of parallelism? If so, then beyond a certain count won't it be just as hard to take advantage of a high IPC-count."

    Yup. High IPC means you have a high degree of instruction level parallelism. Easily multithreaded code means you have a high degree of thread level parallelism. They each represent part of the parallelism in a piece of code/algorythm, etc.
  • Fricardo - Monday, March 14, 2005 - link

    "While Dual core CPUs are more expensive to manufacture, they are far more easier to design than turning a single core CPU into an even more wider complex CPU issue."

    Nice grammer ;)

    Informative article though. Good work.
  • suryad - Monday, March 14, 2005 - link

    Dang...good thing I have not bought a new machine yet. I am going to stick with my Inspiron XPS Gen1 for a good 3-4 years when my warranty runs out before I go run out and by another top of the line laptop and a desktop.

    It will be extremely interesting how these things turn out. Things had been slowing down quiet a lot in the technology envelope front last year but AMD with its FX line of processors were giving me dual cores...I want an 8 cored AMD FX setup. I think beyond 8 the performance increases will be zip.

    I am sure by the end of 2006 we will have experienced quiet a massive paradigm shift with multi cored systems and software taking advantage of it. I am sure the MS DirectX developers for WinFX or DirectX Next or WGF 1.0 or whatever the heck it is called are not going to be sitting on their thumbs and not fixing up the overheads associated as mentioned in the article with the current Direct3D drivers. So IMHO we are going to see a paradigm shift.

    Good stuff. And as far as threads over processes, I would take threads, lightweight...thats the main thing. Threading issues are a pain in the rear though but I am quiet confident that problem will be taken care of sooner or later. Interesting stuff.

    Great article by the way. Tim Sweeney seems quiet humble for a guy with such knowhow. I wonder if Doom's next engine will be multithreaded. John Carmack i am sure is not going to let the UE 3.0 steal all the limelight. What I would love to see is the next Splinter Cell game based on the UE 3.0 engine. I think that would be the bomb!!
  • stephenbrooks - Monday, March 14, 2005 - link

    In the conclusion - some possibly bad wording:

    --[The easiest part of multithreading is using threads that are running completely independent, that don't share any data. But this source of threading is probably already being used almost to the fullest.]--

    It'll still provide large performance increases when you go to multi-cores, though. You can't "already use" the concept of little-interacting threads when you don't have multiple cores to run them on! This is probably actually one of the more exciting increases we'll see from multi-core.

    The stuff that needs a lot of synchronising will necessarily be a bit of a compromise.
  • Matthew Daws - Monday, March 14, 2005 - link

    #26: I don't think that's true:

    This suggests (and I'm certain I've read this for a fact elsewhere) that each *core* has it's own cache: this means that cache contention will still be an issue, as it is in dual-CPU systems. I'm not sure about the increased interconnection speed: it would certainly seem that this *should* increase, but I've also read that, in particular, Intel's first dual-core chips will be a real hack in regards to this.

    In the future, sure, dual-core should be much better than dual-cpu.


Log in

Don't have an account? Sign up now