Threads & Performance

"Threads" is a popular discussed subject. Therefore, we like to give a small introduction to those of you who are not familiar with threads. To understand threads, you first must understand processes. Any decent OS controls the memory allocation to the different programs or processes. A process gets its own private, virtual address space in memory from the OS. Thus, a process cannot communicate/exchange data with other processes without the help of the kernel, the heart of the OS that controls everything. Processes can split up in threads, parallel tasks that share the virtual address space, which can exchange data very quickly without intervention of the OS (global, static, and instance fields, etc.).

The thread is the entity to which the modern operating system (Windows NT based, Solaris, Linux) assigns CPU time. While you could split a CPU intensive program in processes (modern OS sees it as 1 process consisting of one thread), threads of the same process have much less overhead and synchronize data much quicker. The operating system assigns CPU time to running threads based on their priority. Performance gains of multi-CPU or multi-core CPU configurations are only high if: You have more than one CPU intensive thread; The threads are balanced - there is not one very intensive and a few others that are hardly CPU intensive; Synchronization between threads (shared data) either happens quickly, thanks to fast interconnects, or little synchronization is necessary; The OS provides well-tuned load-balanced scheduling; The threads are cache friendly (memory latency!) and do not push the memory bandwidth to the limits. In that case, you may typically expect a 70% to 99% performance speed-up, thanks to the second core. Be warned that Intel was already showing performance increases, which are not realistic "up to 124%". [1]


The benchmarks compare a Pentium 4 EE 840, a Dual Core Pentium 4 3.2 GHz (1 MB L2), to a 3.73 GHz Pentium 4 EE with 2 MB L2. Especially in the last benchmark, a game running in foreground with two PVR (Personal Video Recorders), and tuners running in the background gives a very weird result. How can a slower Dual core be more than 100% faster than a single core with a higher clock speed, bigger caches and a faster FSB? When we first asked Intel, they pointed to the platform (newer chipset, etc.), but no new chipset can make up for a 33 % slower FSB.

We suspected that different thread priorities (giving the game thread a higher priority) might have been the explanation, but Intel's engineers had another interesting explanation. They pointed out that the Windows scheduler can sometimes be inefficient when running many heavy tasks on a single CPU and might have given the game less CPU time than normal. The Windows scheduler didn't have that problem when two CPUs were present: less context switching between threads, and no reason to give the game not enough CPU time. Prepare for a load of hard-to-interprete benchmarks on the Internet...

Threads & Programming

Programming in Threads brings many advantages, especially on dual-cores. Threads with long running CPU intensive processing are not able to the give the system a sluggish unresponsive feeling when you want you do something else at the same time. The OS scheduler should take care of that as long as the CPU is fast enough, but the Intel benchmarks above show you that that is only true in theory. Dual and multi-core can definitely help here. Threads make a system more responsive and offer a very nice performance boost on multi-CPU systems. But the other side of the medal is complexity. Running separate tasks in separate threads that do not need to share data is the easiest part of making a program more suitable to multi-core CPUs. But that has been done a long time ago, and the real challenge is to handle threads that have to share data. The programmer also has to watch over the fact that high amounts of threads introduce overhead in the form of (unnecessary) context switches even on dual core CPUs.

A nasty problem that might pop-up is a "deadlock", when two threads are each waiting for the other to complete, resulting in neither thread ever completing. A race between two threads might sound speedier, but it means that the result of a program's operation depends on which of two or more threads completes first. The problem becomes exponentionally worse if more and more threads are able to run into these problems. Both the Java and .Net ("Threadpool") platform provide classes and tools to deal with thread management - programmers are not left on their own. The problem is not creating threads, but debugging the multithreaded programs. The result is that multithreading has been used sparingly and with as few threads as possible to keep complexity down. But the right tools are coming, right?

Multi-threading toolbox

Intel does provide a few interesting tools for multithreading.

OpenMP is the industry standard for "portable" multi-threaded application development, and can do fine grain (loop level) and large grain (function level) threading.

The newest Intel compilers are even capable of Auto-Parallelization. That sounds fantastic - would multithreading be as easy as using the right compiler? After all, Intel's compiler is able to vectorize existing FP code too. Just recompile your FP intensive code with the right compiler flags and you get speed-ups of 100% and more as the Intel compiler is able to replace x87 instructions by faster SSE-2 alternatives.

Let us see what Intel says about auto-parallelization:
"Improve application performance on multiprocessor systems using auto-parallelization for automatic threading of loops. This option detects parallel loops capable of being executed safely in parallel and automatically generates multi-threaded code. Automatic parallelization relieves the user from having to deal with the low-level details of iteration partitioning, data sharing, thread scheduling and synchronizations. It also provides the benefit of the performance available from multiprocessor systems and systems that support Hyper-Threading Technology."
So, it is just a matter of using the right tools? A chicken and egg problem? When the hardware is there, the software will follow? Is it just a matter of having the right tools and enough market penetration of multi-core CPUs? We asked Tim Sweeney, founder of Epic and a multi-threaded game engine programming guru.

Index Unreal 3
Comments Locked

49 Comments

View All Comments

  • NullSubroutine - Monday, March 14, 2005 - link

    I think there is a few things that most people overlook when looking at multi-cpu/multi-core, almost all benchmarks that I have seen are written and tested on systems with clean installs, and have no other programs running (anti-virus, aim, msn, teamspeak, IRC, p2p software, firewall, decode human genome :b, etc). I would think that most people do leave many programs open, such as those above, when playing games.

    With this in mind, people will find an increase of system performance when leaving multiple programs running. It wont be an increase for performance for benchmark testbeds so much, as an increase in real world performance.

    So basically it won't increase speed in these circumstances, but limit the decrease of fps while running many different programs.
  • fitten - Monday, March 14, 2005 - link

    Article: "Be warned that Intel was already showing performance increases, which are not realistic "up to 124%"."

    #5, there's another explanation as well, but it's a more rare condition. Suppose you had two processors (doesn't even have to be dual core), each with 1M L2 cache. Suppose you also had a problem that has data that is 1.5M in size and is very coarse grained (very parallelizable). One processor cannot fit all the data into L2 cache so it will have to run at main memory speeds most/all of the time. With two processors, each gets 768K, which can easily fit into its L2 cache, which enables each processor to run at L2 cache speeds. This would show up as a superlinear speedup (two cores = more than 2X as fast). This is an extreme example, but one I expect to find in published marketing propaganda.


    #13 " A though! I still think threads are rubbish, that processes and better schedulers are the way forward. "

    Well, with threads you get shared memory for "free", if you've ever written processes that use shared memory, well, there you are. However, since a threaded kernel and a process based kernel are pretty much the same when a process has only one thread, there's little difference between the two for single-threaded executables and you can continue to use your multi-process model without any problems.

    As with #17... like it or not, multi-core/multi-processor boxes are what's coming. You can choose to use what resources are available to you or you can stick to one process programming. Some groups will choose to use what resources are available and some won't. The marketplace will sort out the winners/losers based on which solution is better.

    #18 The PPU is just another form of multiprocessing (just like GPUs are). It's just Asymmetric Multiprocessing (AMP) instead of Symmetric Multiprocessing (SMP). It's not new or anything. I do agree, though, that the PPU has a lot of potential and, just out of my own preferences, goes by the idea that adding specialized hardware (cheaply) usually is a bigger win than adding more generalized hardware. Just think of graphics cards today. Adding a relatively cheap graphics card will make your game run much better/prettier than adding another P4 or Opteron.

    Basically, my thoughts are this: The gaming industry has already gone "multi-threaded" in an asymmetric way simply because of 3D video cards. They already have solved some problems by abstracting parts of their problem. This is simply adding more resources that they can take advantage of, or not, as they see fit. Having dual-core or dual processor systems doesn't prevent them from writing as they've done today. The main issue, for the short term, is that they will need to know whether or not they are on a dual core machine and write accordingly. The main reason that multithreaded games haven't really caught on as of yet is because 99% (or more) of the target audience has only one core. Spending the amount of time/effort to optimize for dual processors for less than 1% of your target market doesn't make sense. If 90% of the market had dual processors, then it would probably be worth the effort to plan to use the resources available. Since both major CPU houses are going dual core and it looks like that's the "way it's gonna be", there will be a rocky period for a while while dual core machines are rare, but they will get more common until the point where they are in the majority. At that time, it will make sense to consider single core machines as the degenerate case and, basically, make single cores the exception instead of the rule.
  • Calin - Monday, March 14, 2005 - link

    #20, a multicore implementation could have shared cache, and also have very fast inter processor communications. You could write a program with small interdependent threads that wait to end both and update parts of some common data. The data used stays in the common cache, and every update is made extremely fast.
    Compare this to a dual processor, that must maintain its caches in synchronization. After a fraction of a millisecond (or less) or work, the processors update different portions of the common data. And there goes: invalidation of cache lines, writing of modified cache lines to memory, the processors must fight for a single FSB (the case with Intel Pentium processors), and so on. You can see that there are some cases (even if somehow artificial) when a proper implementation of dual core can be much faster than multiprocessor.
    The best advantage the multicore will have over multiprocessor would be in numerical tasks like weather prediction, and other highly interdependant computation tasks
  • hzmonte - Tuesday, October 4, 2005 - link

    "a multicore implementation could have shared cache and also have very fast inter processor communications... Compare this to a dual processor, that must maintain its caches in synchronization." Is this the real reason that multi-core multiprocessing is better than multi-chip multiprocessing (the traditional SMP)? A multi-core chip can have dedicated caches (per core) too, and that requires synchronization. And multi-chip SMP could also have shared cache and fast inter-chip communication. Well, you may argue that it is easier to make inter-core communication faster than inter-chip communication. But is this really the fundamental reason why multicore is better than multichip? Could someone explain why a processor manufacturer and a consumer would prefer making/buying a multicore than multichip processors? As far as power consumption and leakage is concerned, isn't it true that multichip is more manageable? In a paper "Planning Considerations for Multicore Processor Technology" by John Fruehe (May 2005) in dell.com/powersolutions, the author compares the effective performance level of a multicore and multichip processors. (But he does not address my question.) Without giving reason, he assume that the core-to-core scalability is 70% (that is, the second core delivers 70% of its processor power due to overhead) whereas the estimated socket-to-socket (i.e. chip-to-chip) scalability is 80% (that is, the dual processors achieve 180% of their combined processing power). That is kind of interesting. I really want to see a comparison between multi-core multiprocessing vs. multi-chip multiprocessing.
  • ksherman - Monday, March 14, 2005 - link

    at80eighty, by sexy, I mean SCARY AS ALL FREAKIN REASON!!! ;-)
  • Calin - Monday, March 14, 2005 - link

    High IPC is not the form of parallelism from the article - the focus of the article was on running a process on two (or more) different cores. The idea is that high IPC profits all the programs, no matter how written. Multi thread is different - the idea is to have parts of a program that execute simultaneously but with very few interrelations (you can have a thread to paint the interface in a game, while having another thread to paint the rest of the screen. The threads would be with almost no correlations (except for sending commands).
    High IPC is not a solution in x86 world because the code tends to have dependencies close to each other, so you can start executing 100 instructions at a time, but 99 of them needs to wait for the execution of one. You simply have those moments when all execution must wait for an instruction to end.
    EPIC (Itanium) will help with that, as the high IPC could be guaranteed by the instructions - at every clock you can execute one instruction = equivalent to several x86 instructions. So, the performance would be the clock speed multiplied by an IPC of 3 or 4, unlike the Athlon (let's say) that have a performance generated by its larger clock speed multiplied by 1 IPC or something.
  • Kensei - Monday, March 14, 2005 - link

    Wonderful article! I loved the "hardware meets software" focus of this piece. I've had many questions about the practicality of multi-threaded applications and this article answered many of them. Also, loved the interview with Sweeny.

  • bob661 - Monday, March 14, 2005 - link

    #9
    I am offended by the word "steam".
  • Matthew Daws - Monday, March 14, 2005 - link

    #20:MarriedMan - Yes, I think so. This is actually an interesting question. As I understand it, I think both AMD and Intel are using pretty much the same technology in both, so that communication channels on the motherboard (in the dual CPU case) will be replaced by communication channels on the CPU die. I think AMD's approach is better only because HT etc. lends itself to dual-core much better than Intel's older technology. I guess the next generation of dual-core chips might be somewhat different though. Anyone else know anything?
  • MarriedMan - Monday, March 14, 2005 - link

    I assume that when a program is multi-threaded to take advantage of dual core CPUs, it will automatically take advantage of dual CPU systems as well.

    Is that a correct assumption? Will the Unreal 3 engine use multiple single core CPUs on an MP system?

Log in

Don't have an account? Sign up now