I’ve been working recently on improving the performance of locking in D (the D programming language). At this moment the synchronized statement in D simply uses the underlying OS facilities–CriticalSection on Windows and pthread mutex on Linux. This incurs a substantial performance hit for multithreaded programming.
I looked at research in the Java community and found some interesting solutions. They are all versions of Thin Locks, first described by Bacon et al. These guys figured out that when a program is entering a synchronized section, in about 80% of the cases the section is not locked by anybody. The next most common case is nested locking–the section is locked by the same thread recursively. Only rarely there is actual contention for a lock. And when that happens, it’s very likely that it will happen again and again. Such contended lock is most likely part of a shared data structure designed to be accessed by multiple threads.
Thin Locks optimize the most common case of no contention. Every Java Object has a word in its header that is used as a thin lock. If this field is zero (I’m skipping details here), the object is not locked. When a thread enters a synchronized section, it optimistically assumes that the lock is zero and tries to flip it to a non-zero value. This is done in one atomic operation, Compare And Swap (CAS). Most processor either have such an instruction built in, or provide primitives to implement it.
CAS checks the value in memory, comparing it to the “expected value.” If the comparison succeeds, the “new value” is written in its place. In our case, we expected the value of the thin lock to be zero, and the new value we want to put there is the (non-zero) thread ID of the lock taker. If the CAS succeeds, we are done, we owne the lock.
If the CAS fails, the lock has already been taken. Again, the most likely case is that our own thread holds it. So we check if our thread ID is stored in the thin lock. If so, we increment the count field of the thin lock (several fields are cleverly mashed together into one word). We’re done!
If we don’t find our own thread ID in the thin lock, we know we have contention. We have to inflate the lock–allocate an additional object (the fat lock) that holds the general-purpose OS-based mutex. Of course we also check if the thin lock hasn’t already been inflated, in which case we just lock the fat lock.
The inflation process is a little tricky–we can’t just modify the thin lock while it’s bein held by another thread. Instead we spin wait for it to be released, then try to acquire it ourselves, and then inflate it. Once inflated, the lock remains inflated forever; which usually is the right thing to do anyway, since a lock that’s been contended once, is likely to be contended many times. The one-time spinning is amortized across many accesses.
Of course there are details that I’ve omitted, but I gave you the gist of the algorithm. It makes un-contended locking so cheap (one CAS) that in most cases there is no reason to implement two versions, one for single- and one for multithreaded use, of the same data structure. A single-threaded program doesn’t have to pay the multithreaded penalty when using general-purpose libraries.
July 24, 2008 at 5:41 pm
So basically you use a CAS spin-lock that switches over to a (hopefully non-spinning) OS mutex when contention is detected. One question, should we switch back to CAS spin-lock if we detect that contention is not happening anymore?
July 24, 2008 at 7:43 pm
Actually, you don’t spin on a CAS. You either grab it on not. If you fail, that’s either because you already own this lock, and there’s no need for further synchronization; or some other thread owns it, in which case you have to inflate it. Even if, in the meanwhile, the other thread releses the lock, you are still going to inflate it.
You only spin the first time you detect contention. This spin is based on a benign race. You keep reading and testing, and if you miss something, you’ll catch up later.
To perform the actual inflation, you have to grab the lock first. If several threads are racing to inflate the lock, one of them will always succeed, and all the other ones will fall back on the fat lock.
Whether the OS mutex is spinning or not is up to the OS, which has more information about what’s optimal. On a multi-core machine I would expect a spin lock with backoff.
July 25, 2008 at 9:07 am
Are you sure this is necessary on Linux? AFAIK pthread_mutex_lock() does this trick (or something similar, in case of no contention no kernel code is used, it’s resolved very fast in userspace) by itself.
July 25, 2008 at 9:17 am
There it is, I think this is what a futex are.
More info: man 7 futex
July 25, 2008 at 11:33 am
Ouch! That made my head hurt.
July 26, 2008 at 9:13 am
This is good question. A futex is a primitive from which one can build an efficient lock. http://people.redhat.com/drepper/futex.pdf shows such an example. The code is similar to the implementation of the user part of the Thin Lock–it does an optimistic CAS, and it it falis, Futex is used to do the inflation part.
One could in principle implement Thin Lock using futex. However, futex is a general purpose primitive and Thin Lock is a very specialized algorithm. We know exactly how monitors are used in Java (and similarly in D), so we can optimize the heck out of it. Futex has essentially two states visible on the outside: zero and non-zero. Thin Lock stores a thread ID and a counter. This is essential for the optimization of the recursive case. For the inflated Fat Lock on Linux we use a pthread lock, which in turn uses a futex, so we do take advantage of its performance.
But there’s more to the D thin lock–I’ll describe it in the next post.
September 3, 2008 at 11:55 pm
I’ve been doing concurrent code in Java for some time, and the decision to make every object lockable is something I now regard as regrettable.
Pretty much all of my code now looks like :
because if you lock on “this”, other code can mess you around by locking on YOU.
On another note, there were some recent papers about locking improvements to the “thin lock” model in Java1.6 that centered around making fat locks more efficient.
The idea was that most locks tends to be taken again and again on the same thread, so rather than unlocking each time, they move the fat lock into a “locked but available” mode, which optimises for the common case.
September 2, 2009 at 6:02 am
You must be careful with thin-locks.
Effectively pthread_mutex_lock does not thin-lock.
To explain why, and why this is required I can show some kind of workaround like this:
It is more efficient, I alredy have tested it, but behaviour is different.
pthread_mutex_trylock will never block, so is quite close to perform a single CAS operation. But, what if there was a thread already blocked and owner thread just leaved? You stole the lock.
That is what usually happens with thin-locks: locks are usually stolen. That means that you may be not fair with all threads. That’s the reason because pthread_mutex_lock is not as simple as a thin-lock, they are fair.
But, luckily, you don’t need to be fair: so, welcome efficient thin-locks.
December 24, 2010 at 6:15 am
I wonder if these think-locks really bring a major performance improvement on Windows. As far as know, windows critical sections are implemented in the same as you described in the article: it means that EnterCriticalSection() checks whether there are any threading conflicts and only if they exist it uses a system mutex for synchronization.
January 25, 2012 at 2:58 am
As rride pointed out, I don’t think it will improve performance on windows since critical section does just that ( using an event only if needed). It even let you spin for a while busy waiting before using an event. If I remember correctly futex in is the same but I’m not sure.
Microsoft also got a patent for a smarter cache aware version useful for NUMA architecture.
November 13, 2013 at 12:15 pm
[…] The cost of synchronization varies depending on how much contention there is. If there are only a few threads modifying a shared list, collisions are rare and the cost of a counter update is just one CAS (Compare And Swap) or an equivalent atomic operation. The overhead is different on different processor, but the important observation is that it’s the same overhead as in an efficient implementation of a mutex in the absence of contention (the so called thin locks or futexes require just one CAS to enter the critical section — see my blog about thin locks). […]