July 2008

Andrei brought up the idea of encoding the sharing of an object between threads in the type of the object. After months of discussions we are still not sure how far we want to go with it. One thing is for sure, letting the programmer mark objects for sharing can help the compiler prevent a lot of concurrency bugs.

One of the common concurrency errors is accidental sharing. Some data structures are designed for multi-threaded access, e.g., objects with synchronized methods; and they usually work just fine (except for deadlocks). The problem is when a chunk of data that was not designed for sharing is accessed by multiple threads. There is no easy way to detect this error since, in general, concurrency bugs are hard to reproduce.

The proposal is to make accidental sharing impossible. This requires that all objects, by default, be thread local. For instance, if you declare a global object and initialize it in one thread, another thread will see a different version of this object. In most cases it will see a null pointer or an unitialized object handle, and you’ll get an easy to reproduce null-reference error.

If you consciously want to share an object, you have to declare it “shared”. This type modifier is transitive (just like const and invariant in the D programming language), so you  can’t have references to non-shared object inside a shared object. It simply won’t compile.

A function may declare its (reference) parameter as “shared”, in which case the compiler won’t let you pass a non-shared object to it. Conversely, if the parameter is declared as non-shared (the default), no shared argument may be passed in its place. There is a guarantee that it will be thread-local. (See however “unsharing”.)

Let me discuss potential objections to this scheme.

The first is performance–not for shared objects, mind you, but for the non-shared ones. Walter tells us that accessing a thread-local static variable adds between 1 to 3 instructions of overhead. That seems quite reasonable. Especially considering that in multi-threaded environment the use of global non-shared variables is rarely correct.

There is also a performance penalty when starting a new thread–all static variables it has access to have to be default-initialized, plus all module constructors have to be called. This might amount to quite a bit. We will recommend not to overuse global variables and module constructors. The way to amortize this cost is to create thread pools.

What about invariant objects (ones that are never modified)? Those can be safely shared, so they must be allocated as not thread-local. It is okay for a shared object to contain references to invariant objects.

Can a shared object be “unshared”? This is a tricky one. There are situations when threads hand over objects to each other. The object is only shared during the handover, but otherwise is accessed by one thread at a time. The currently owning thread should be able to call regular library functions (that don’t expect sharing) with such objects. So we need some kind of share/unshare cast. On the other hand, such cast creates a wormhole into accidental sharing. There is an interesting SharC paper that discusses runtime techniques to make “unsharing” safe. Safe casting from temporarily non-shared to shared is even more tricky. I’ll talk more about it in my next post.

Finally, there is an unexpected bonus from this scheme for the garbage collector. We will be able to use a separate shared heap (which will also store invariant objects), and separate per-thread heaps for non-shared objects. Since there can’t be any references going from the shared/invariant heap to non-shared ones, per-thread garbage collection will be easy. Only occasional collection of the shared heap would require the cooperation of all threads, and even that could be done without stopping the world.

I’ve been working recently on improving the performance of locking in D (the D programming language). At this moment the synchronized statement in D simply uses the underlying OS facilities–CriticalSection on Windows and pthread  mutex on Linux. This incurs a substantial performance hit for multithreaded programming.

I looked at research in the Java community and found some interesting solutions. They are all versions of Thin Locks, first described by Bacon et al. These guys figured out that when a program is entering a synchronized section, in about 80% of the cases the section is not locked by anybody. The next most common case is nested locking–the section is locked by the same thread recursively. Only rarely there is actual contention for a lock.  And when that happens, it’s very likely that it will happen again and again. Such contended lock is most likely part of a shared data structure designed to be accessed by multiple threads.

Thin Locks optimize the most common case of no contention. Every Java Object has a word in its header that is used as a thin lock. If this field is zero (I’m skipping details here), the object is not locked. When a thread enters a synchronized section, it optimistically assumes that the lock is zero and tries to flip it to a non-zero value. This is done in one atomic operation, Compare And Swap (CAS). Most processor either have such an instruction built in, or provide primitives to implement it.

CAS checks the value in memory, comparing it to the “expected value.” If the comparison succeeds, the “new value” is written in its place. In our case, we expected the value of the thin lock to be zero, and the new value we want to put there is the (non-zero) thread ID of the lock taker. If the CAS succeeds, we are done, we owne the lock.

If the CAS fails, the lock has already been taken. Again, the most likely case is that our own thread holds it. So we check if our thread ID is stored in the thin lock. If so, we increment the count field of the thin lock (several fields are cleverly mashed together into one word). We’re done!

If we don’t find our own thread ID in the thin lock, we know we have contention. We have to inflate the lock–allocate an additional object (the fat lock) that holds the general-purpose OS-based mutex. Of course we also check if the thin lock hasn’t already been inflated, in which case we just lock the fat lock.

The inflation process is a little tricky–we can’t just modify the thin lock while it’s bein held by another thread. Instead we spin wait for it to be released, then try to acquire it ourselves, and then inflate it. Once inflated, the lock remains inflated forever; which usually is the right thing to do anyway, since a lock that’s been contended once, is likely to be contended many times. The one-time spinning is amortized across many accesses.

Of course there are details that I’ve omitted, but I gave you the gist of the algorithm.  It makes un-contended locking so cheap (one CAS) that in most cases there is no reason to implement two versions, one for single- and one for multithreaded use, of the same data structure. A single-threaded program doesn’t have to pay the multithreaded penalty when using general-purpose libraries.