I’ve been working recently on improving the performance of locking in D (the D programming language). At this moment the synchronized statement in D simply uses the underlying OS facilities–CriticalSection on Windows and pthread  mutex on Linux. This incurs a substantial performance hit for multithreaded programming.

I looked at research in the Java community and found some interesting solutions. They are all versions of Thin Locks, first described by Bacon et al. These guys figured out that when a program is entering a synchronized section, in about 80% of the cases the section is not locked by anybody. The next most common case is nested locking–the section is locked by the same thread recursively. Only rarely there is actual contention for a lock.  And when that happens, it’s very likely that it will happen again and again. Such contended lock is most likely part of a shared data structure designed to be accessed by multiple threads.

Thin Locks optimize the most common case of no contention. Every Java Object has a word in its header that is used as a thin lock. If this field is zero (I’m skipping details here), the object is not locked. When a thread enters a synchronized section, it optimistically assumes that the lock is zero and tries to flip it to a non-zero value. This is done in one atomic operation, Compare And Swap (CAS). Most processor either have such an instruction built in, or provide primitives to implement it.

CAS checks the value in memory, comparing it to the “expected value.” If the comparison succeeds, the “new value” is written in its place. In our case, we expected the value of the thin lock to be zero, and the new value we want to put there is the (non-zero) thread ID of the lock taker. If the CAS succeeds, we are done, we owne the lock.

If the CAS fails, the lock has already been taken. Again, the most likely case is that our own thread holds it. So we check if our thread ID is stored in the thin lock. If so, we increment the count field of the thin lock (several fields are cleverly mashed together into one word). We’re done!

If we don’t find our own thread ID in the thin lock, we know we have contention. We have to inflate the lock–allocate an additional object (the fat lock) that holds the general-purpose OS-based mutex. Of course we also check if the thin lock hasn’t already been inflated, in which case we just lock the fat lock.

The inflation process is a little tricky–we can’t just modify the thin lock while it’s bein held by another thread. Instead we spin wait for it to be released, then try to acquire it ourselves, and then inflate it. Once inflated, the lock remains inflated forever; which usually is the right thing to do anyway, since a lock that’s been contended once, is likely to be contended many times. The one-time spinning is amortized across many accesses.

Of course there are details that I’ve omitted, but I gave you the gist of the algorithm.  It makes un-contended locking so cheap (one CAS) that in most cases there is no reason to implement two versions, one for single- and one for multithreaded use, of the same data structure. A single-threaded program doesn’t have to pay the multithreaded penalty when using general-purpose libraries.