Tuesday, March 24, 2009

Concurency on modern hardware

Multi-threading is a challenge for most programmers. Just the basics in threading are a little confusing. Then there is the challenge of thinking about shared resources, memory, thread-contention, cache coherency, memory bus stalls, and starvation. I'd like to briefly discuss each of these topics to create a sense of familiarity for the reader.

Overview:
What is the point of concurrency? Why use multi-threading? So how does it all fit together?

Al of these are valid questions and I'll start with the simplest question first: what is the point of concurrency.
Concurrency exists to take advantage of fast hardware. It does so often by putting different applications on different pieces of hardware. When multiple pieces of hardware are not available, it staggers the execution of applications allowing each to run for a short period. The basic idea is that CPUs are available so the OS will try to move applications around to make each appliaction run faster.

Concurrency also exists to add the appearance of "stability". What can happen in OSs that don't support concurrency is the appearance that an application is 'hung' and not executing properly. You have probably seen this in Windows from time to time when you log into a website and the browser never returns. Many things can cause this like long-waits, a program caught in an endless loop, or even a bad memory access. In any case, the program hangs. Without concurrency, the entire computer would lockup, not just your program and then you'd spend more time rebooting.

We'll discuss the other two questions throughout the rest of this discussion.

Basics:
Processes [Wiki] are the equivalent of a computer program. These are usually applications but if you bring up your Activity monitor (or Task manager), you'll see a lot of processes running which are not visible applications. These are often things related to networking, the OS, time applications, and so on. There are other things called cron processes, batch processes, and drivers which also take up CPU time but don't always appear in your Activity monitor. Each process has its own memory, performance characteristics, needs for the file system, and many other characteristics which have the potential to interfere with other processes. The differences between a thread are subtle and beyond the cope of this discussion: just consider them to be the same for now.

Concurrency [Wiki] allows us to take advantage of multiple pieces of hardware when available. When not available, all modern OS's allow multi-threading which is similar. The concept is one of allowing your CPU to execute multiple processes simultaneously. When the hardware is available, such as a Dual core Pentium or a Six core XBox360, each process can execute on a separate piece of hardware and stay out of the way of other processes, theoretically. The reality is much more complex. When hardware is not available (only one CPU for example), then the OS will stagger execution of programs making them take turns. Process1 will work for a while, then the OS will interrupt Process1 to allow Process2 to run a little bit. After a short time, the OS will interrupt Process2 and continue running Process1. This is a highly simplified example, but illustrates the basics of what all modern OSs must do.

Interrupt [Wiki] is a hardware-driven change of direction. Basically, an interrupt is like a telephone and no matter what you're doing, when that phone rings you stop and answer the phone. An interrupt usually comes from some external hardware and notifies the CPU that something has changed. In modern CPUs, this is most often a timer that tells the CPU that a certain amount of time has elapsed and this is often the cue for the CPU to change from one process to another. Even when multiple CPUs are available, the interrupts are going crazy telling the CPUs that a file has finished loading, that a DMA has completed, that a timer has gone off, or that a user has hit a key on the keyboard.

Kernel [Wiki] is the low-level code that makes determinations about which process has a chance to run every time it receives an interrupt. It routes events, suspends processes, and makes timing choices.

Process dispatch or allocation has to do with putting processes on different pieces of hardware (CPUs) for execution. This is a decision made by the OS (OSs make algorithmic choices, not true decisions) and can be overridden by configuration files. In fact, on the XBox360, you can specify on which CPU any threads will be created. Many other OSs allow similar dispatch models and configuration capabilities.

Parallel computing [Wiki] allows us to setup situations where a single task can be completed by different processes running simultaneously, often on different CPUs or chips. This is typically accomplished by the concept of scatter-and-gather where a single task is subdivided into multiple smaller tasks, the results are computed on different CPUs, and then the results are pulled together into a final result by one of the CPUs.

Race conditions [Wiki] specify some of the basic problems with threading and concurrency in general. These happen when two processes compete for the same resource and can lead to deadlock or starvation. In either case, things don't work as planned and can cause some processes to stall or not execute at all.

Synchronization [Wiki] is the traffic cop that you need to put into place to make all of this work. Basically, synchronization objects work directly with hardware to control who is accessing any given resource at any given time. This can be a piece of memory, a file, or even a section of code that only one process should execute at a time. These synchronization objects are all similar and are built on the concept of atomics which are CPU-level operations that can be executed in a single clock-cycle (very fast).

Memory bus [Wiki] is the access to RAM and can be a major bottleneck in computers with multiple CPUs.

Why use multi-threading?
Concurrency is hard. Even very experienced programmers can have great difficulty managing 'wait states', synchronization, performance problems, and design. This makes the justification of the effort to do multi-threading even more tenuous. Still, it's hard to argue with the ability to do things twice as fast (or more with more hardware). Also, the appearance of an application that responds to user interaction while opening and reading a file increases the likelihood that consumers will consider your software more professional and stable.

Other considerations are things like allowing your application to do two things simultaneously are useful since a user can only type so quickly. If you consider a word processor, it can look up words as you type, spell check, grammer check, and auto save. All of this is managed through multi-threading. Gone are the days where you had a separate step of spell-checking before printing your document (does anyone but me still remember those days?)

Also, other applications like MSN Messenger, AIM, and others allow you to stay connected to the internet while typing, adding smiley faces, and updating friend-status. In a manner of speaking, all of this is just plain magic.

How does it all fit together?
Most OSs offer a method that looks like this:
bool CreateThread (FunctionPointer, StackSize, PriorityLevel);

This can take a variety of parameters, but often it takes a stack pointer (just a pointer to RAM or a RAM size request), a pointer to a function or some code, and a priority for the thread. That's all there is to the OS support. The rest is up to you. You need to manage the synchronization but the OS will usually do the rest. Some specialized hardware has special Macros or methods to allow you to tell the OS where to put the thread (which CPU), but generally, you don't worry about such things.

A critical determination is the priority. Another may be the stack size. Let me explain why these are important. The priority affects the Kernel and helps the OS decide whether to allow a process or thread run. All threads having equal priority means that the Kernel simply let's each run in turn in a round-robin manner. Once a thread has higher priority than others, it has the potential to prevent other threads from running very much (called starvation) but can also mean that a small piece of code that you want to run quickly will receive as much of the CPU as the Kernel will give it. Low priority threads run "in the background" and can accomplish tasks with little effect on user experience. For most purposes, you should set your threads to "Below normal" thread priority unless you are certain that you don't mind affecting the user experience.

The stack size can be important too depending on what you are doing. The stack size affects stack frame which is the amount of memory available for a series of function calls available on the stack, not the free store. Every time you call a function, the current function you are in must push its data onto the stack and then load up some values which are also pushed onto the stack for inside the new function. If you are using recursion, this size can grow rather quickly. You will probably need a minimum size of 4K, but plan on needing a lot more like 16k or maybe as much as a meg is you are sorting some large dataset.

There are a few synchronization tricks and foibles to watch for and I will detail the half dozen of these in a future blog. For now, just remember to use the mutex, use it in any access to a resource or variable that you have, and use it very close to the access, not in the earliest access case or broadly. Code for this often looks like this:

void SomeBigMultithreadedClass::Foo()
{
m_Mutex.Lock();

// do something expensive like allocate RAM, access a file, etc.

m_Mutex.Unlock();
}
Many more details will come in my next blog.

No comments: