D Programming Language



Why const?

With judicious use of const you can catch many bugs and improve code documentation and maintenance. For me the classic example is reading from and writing to files. Look at these two interfaces in stylized C++:

bool Read(File & file, char * buffer, unsigned & size)
bool Write(File & file, char const * buffer, unsigned const & size);

The first one tells me that the buffer may be modified by the function. In fact, the Read function will most likely write characters into it. It is also allowed to modify the parameter size, putting the number of characters read into it. The incorrect call

Read("abc", 3);

will not compile.

The second declaration promises that the buffer will not be modified and neither will size. So the following (correct) call will compile:

Write("abc", 3);

What’s more, when compiling the implementation of Write, the compiler will ensure that the arguments are not modified (unless the programmer subverts the type system using const_cast).

What does const guarantee? Not much!

This optimistic picture is marred by a few gotchas. You’d think that if you call a function taking a const reference, the object you are passing won’t change (assume there’s no use of const_cast or mutable members). Think again!

  1. In a multithreaded program, another thread might change the object concurrently. You may avoid this problem by preventing concurrent access using critical sections.
  2. But even in a single threaded program there is no guarantee the const object won’t change. That’s because there might be a non-const alias to the argument you’re passing. This alias might be accessible to the function either through a global variable, or through another argument. The classic example is copying elements between overlapping ranges. The source for the copy is passed as const, the target as non-const, but if they point at overlapping regions, the source may end up modified.

    There is a stronger version of constness that might guarantee immutability. The qualifier is usually called immutable and it’s available in the D programming language and in some dialects of Java. I’ll come back to it later.

  3. On top of those gotchas, even in a single-threaded C++ program with no aliasing of arguments, a const function argument might get mutated. That’s because constness is shallow–it doesn’t protect indirect data members (pointers or references). See next section.

Transitivity or “deep” const

Consider a list that is defined recursively:

class List {
public:
  List * GetNext() const { return _next; }
  void SetNext(List * next) { _next = next; }
private:
  List * _next;
};
void clip(List const * list) {
  List * next = list.GetNext();
  if (next)
    next.SetNext(0);
}

Function clip takes a const list and blatantly modifies it. That’s because the indirect member, _next, of a const List is not const. C++ constness is not transitive! This fact is reflected in GetNext being a const method and returning a non-const pointer to _next. So even if this is const inside GetNext, indirect parts of this are open to modification.

I always found the non-transitive definition of const counter-intuitive, and made sure that in my code constness was always transitive. For instance, I would implement GetNext to return a const pointer:

List const * GetNext() const { return _next; }

and maybe provide another non-const method that returns a non-const pointer. Notice the need for code duplication in this approach.

The depths of immutability

I have an even larger problem with non-transitive immutability (as opposed to constness). I expect a const method of an immutable object to leave the object unchanged. In particular I expect such method to behave like a pure function (not to be confused with pure virtual function). A pure function returns the same value every time it’s called with the same arguments. I expect the length of an immutable string to never change. Every call to size should return the same value. I would even expect the compiler to eliminate redundant calls to size is such a case. Here’s a hypothetical example:

immutable string("Hello!");
doSomething(string.data(), string.size());
doSomethingElse(string.data(), string.size());

A smart compiler could cache the results of data and size from the first set of calls and reuse them when calling doSomethingElse. However, if immutable is not transitive, such an optimization is incorrect. Just imagine a bizarre implementation of string that stores the length indirectly:

class string {
public:
  string(char const * src) {
    _length = new unsigned;
    *_length = strlen(src);
    ...
  }
  unsigned size() const {
    ++*_length;
    return *_length;
  }
private:
  unsigned * _length;
};

The D-language model

Taking all the above arguments into account, we decided to implement both const and immutable as transitive type modifiers in the D programming language. This enabled a lot of interesting compiler optimizations.

Making const and immutable transitive is essential for D’s support for functional-style programming. It’s important that const methods of an immutable object be pure functions. It’s also important to let the compiler optimize functional-style code, which otherwise tends to be less efficient than its imperative-style equivalent.

Functional programming is one of the safest paradigms for concurrency. Pure functions and immutable objects don’t require locking.

The price of transitivity

Having said that, transitivity of const imposes some constraints on the programming style. It enforces the view that anything reachable from an object is part of the object. When you are given a const linked list, all links in it must necessarily be const. That makes logical sense–the links are part of the list.

The picture gets muddier when the object contains references to some external services. Just by calling them “external,” I’m admitting that they are not part of the object.

Can an object that contains a Windows handle be const? Is it enough that it doesn’t modify the handle itself? Or shouldn’t it also refrain from modifying the window or the file behind the handle? Handles are often implemented as opaque pointers so, strictly speaking, constness should propagate through them. But how do I know which APIs don’t mutate the state of Windows and which do? This is an extreme example, but there are many in-between cases.

In an editor you might have an undo stack that stores editing commands. When you call the undo method of the command object, you may pass it a reference to the document object on which it is to operate. In that case undo may be declared as const. But if you store a reference to the document within each command, the undo method can no longer be const–it modifies the document. The classic Command Pattern uses the latter approach.

It hasn’t been decided yet if and how D will support an escape mechanism from const transitivity (and constness in general). Following C++, D could use the mutable type qualifier for that purpose. This would also solve the problem of memoizing and caching inside a const object.

In the next installment, I’ll describe more complexities of immutability and discuss a dialect of Java that supports immutability constraints.

You can vote for this article on reddit


In the previous two installments I tried to explain the origins of rvalue references in C++. If you were left with the impression that each design step was a reaction to some previously unanticipated problem, you’re right. As it often happens in C++, the design was reactive rather than proactive.

Passing by value

If you think about it long enough, you realize that the process that led to rvalue references started with the way C++ implemented passing by value. In C, passing by value was simple–a shallow copy of data was made. “Shallow” means, no pointers were followed in the process of copying. C++ decided to provide hooks for the programmer to override this behavior. If you define a copy constructor it will be used instead of shallow copy. A user-defined copy constructor may for instance follow pointers or references and clone the pointed-to objects, thus implementing “deep copy”. It may also do bookkeeping–count references, etc.

Whenever the compiler needs to copy an object–and that may happen during return by value or pass by value–it inserts the appropriate call to the copy constructor (if one is not provided by the programmer, the default shallow copy is used).

This is a very general mechanism, which however leads to some awkwardness and inefficiencies. For instance, the writer of a copy constructor must remember to copy all of the shallow data members–something that the default shallow copy does automatically. The inefficiencies show up when copying rvalues.

Return value optimizations

Let’s revisit the example from my previous post:

 auto_ptr<Foo> factory() {
    return auto_ptr<Foo>(new Foo);
 }
 // client code
 auto_ptr<Foo> ap = factory();

Here, a temporary auto_ptr is created inside the function, copied using a copy constructor into the caller’s auto_ptr, and then immediately destroyed.

This is so obviously inefficient that the compiler will often optimize it away by constructing the return auto_ptr directly in the space allocated by the caller (a pointer to which is passed as a hidden argument to the function). This is called return value optimization (RVO) and it eliminates the call to the copy constructor and to the destructor.

But what about more general cases, like this one:

 auto_ptr<Foo> factory() {
    auto_ptr<Foo> src;
    ...
    src.reset(new Foo);
    ...
    return src;
 }
 // client code
 auto_ptr<Foo> ap = factory();

Some compilers will attempt to optimize this code as well, using the named return value optimization (NRVO).

The C++ Standard allows such optimizations, which elide the calls to a copy constructor and a destructor, but it doesn’t mandate them. Moreover, NRVO cannot be applied to functions that have multiple return statements, with different objects being returned from each.

But notice an interesting thing: If (N)RVO were mandated by the language and applicable in every case, we could almost implement auto_ptr without the rvalue reference copy constructor. We wouldn’t need to reach back at the source auto_ptr from the copy constructor to turn off its destruction. The destructor (together with the copy constructor) would be guaranteed to be elided.

In fact, if we don’t insist on using caller’s space for creating the return value, we could cover the case of multiple return statements. And how would we transfer bits between the two spaces? We’d do a shallow, bitwise, copy!

An alternative return-by-value scheme

Here’s the scheme that was proposed for the D programming language:

  1. The compiler adds a hidden parameter to every function that returns a non-trivial data structure by value. This parameter is the address of the stack space allocated by the caller for the return value. (This is a temporary space for the rvalue returned by the function) This part is common with the C++ implementation
  2. The function allocates its local variables on the stack, as usual, including the ones that are (or might be–in case of multiple return statements) returned. RVO may be used to eliminate some of these allocations. This is still in common with C++.
  3. When executing the return statement, the value to be returned is bit-copied into the caller’s space.
  4. No copy constructor is called. No destructor is called for the variable that’s been returned.

Look ma! No copy constructor!

What is amazing about this scheme, is that it may be applied to all situations where the source of the copy is an rvalue. This is also true when passing an rvalue to a function, as in this example:

  struct S;
  void f(S s); // by value
  // caller's code
  f(S(1, 2)); // passing an rvalue

Suddenly there is no need whatsoever for an rvalue-taking copy constructor because it would never be called. Since an rvalue-taking copy constructor was the main motivation for introducing rvalue references to C++, in D we can get away without them.

It takes some getting used to thinking of passing rvalues without calling a copy constructor. Notice however that, even in C++, there is no guarantee that the copy constructor will be called (remember RVO?).

But what about non-trivial copy constructors, like in ref_ptr (reference-counting smart pointer)? It must be called to keep count of the reference count, right? Not when the source is an rvalue! Remember that we are eliding not only the copy constructor (which increments the reference count) but also the destructor of the source (which decrements the reference count). The net result is no change to the reference count.

This is a very important optimization, especially in multithreaded code, when reference counting incurs expensive synchronization overhead.

Location transparency

There is one gotcha in this scheme–we are assuming that a shallow copy will always result in a valid object. This is not true if the object contains a pointer to itself. A shallow copy will duplicate the pointer, the object itself will move to a new location. but the pointer will point at the old location. This could be a disaster.

For this and several other reasons D assumes location transparency. An object cannot point at itself (or any of its shallow members), because then it couldn’t be transparently moved to a new location.

Unfortunately this requirement cannot be enforced statically. A dynamic check during copying is possible, but it’s not clear that the overhead is always worth paying. This is still an open design area.

Notice that, in D, the scope of this problem is narrowed to structs and fixed arrays that contain pointers. There is no direct way to manipulate the location of class objects (they are always accessed indirectly)–except by the garbage collector. A generational garbage collector that moves objects in memory could definitely profit from location transparency.

The binding table

The binding table in D is slightly different than the one in C++:

  • references bind to lvalues (same as C++)
  • const references bind only to lvalues (different)
  • “values” (arguments passed by value) bind to
    • lvalues (same as C++)
    • rvalues (same as C++, except that no copy constructor is called)

Notice that in D we don’t want a const reference to bind to an rvalue. This is an inherently dangerous feature of C++. Consider the following C++ example:

  struct S;
  const S& f(const S& x) {
     return x;
  }
  // caller's code
  const S& y = f(S(1, 2));
  y.access(); // kaboom!

Instead, if you want to have a version of a function that binds to rvalues, you declare its arguments as passed by value, as in this D example:

  S f(S x) {
    return x;
  }
  // caller's code
  auto x = f(S(1, 2));
  x.access; // fine, it's a copy

In the worst case, the overhead of the D solution is a bitwise copy of a struct (which can often be optimized away) against the indirect access of the C++ solution. But considering that it buys us safety, it’s a trade off well worth it.

I’d like to thank Andrei Alexandrescu for his comments and, in general, for driving the design of pass-by-value in D.


With the multicore explosion in the making, are we going to be running hundreds of thousands of threads in a single program? Erlang programmers would emphatically say, yes! C++ and Java programmers would probably say no.

Why this discrepancy?

Thread model based on heavy-duty OS threads and mutexes has its limitations. You can ask server writers, or Google for “thread per connection” to convince yourself. Servers use thread pools exactly because of that.

Thread pools are an admission of defeat for the thread model. Having to pool threads tells you that:

  • Thread creation is not fast enough
  • Threads’ consumption of resources is substantial, so it makes sense to keep their numbers down

Granted, these are technical problems that might be overcome in future by improvements in operating systems.

The more fundamental problem with threads has its root in memory sharing. It seems like sharing offers great advantage in terms of performance, but sharing requires locking. It’s a well known fact that locking doesn’t scale (or compose). Between races and deadlocks, it’s also extremely hard to get right.

Here’s what Erlang does

Erlang gives up on sharing!

Threads that don’t share memory are called processes. We tend to think of processes as heavyweight beasts implemented by operating systems. That’s because one needs the operating system to strictly enforce the no-sharing policy (the isolation of address spaces). Only the OS can manage separate address spaces and the passing of data between them.

But that’s not the only way. The isolation might instead be enforced by the language. Erlang is a functional language with strict copy semantics and with no pointers or references. Erlang processes communicate by message passing. Of course, behind the scenes, messages are passed through shared memory, thus avoiding a large performance hit of inter-process communication. But that’s invisible to the client.

Erlang rolls out its own threads!

Erlang interpreter provides lightweight processes (so lightweight that there’s a benchmark running 20 million Erlang processes).

And there is a bonus: Erlang code that uses lightweight processes will also work with heavyweight processes and in a distributed environment.

Why don’t we all switch to Erlang?

As far as I know there are two main reasons:

  • It’s a weird language. Don’t get me wrong, I love functional programming for its mathematical beauty, terseness, and elegance. But I had to rewire my brain to be able to write pure functional programs. Functional paradigm is as alien to our everyday experience as quantum mechanics. (CS grad students: you’re not typical programmers.)
  • Messages have to be copied. You can’t deep-copy a large data structure without some performance degradation, and not all copying can be optimized away (it requires behind-the-scenes alias analysis). This is why mainstream languages (and I will even include Scala in this category) don’t abandon sharing. Instead they rely on programmer’s discipline or try to control aliasing.

Conclusions

Native threads are expensive. Interpreted languages, like Erlang, can afford to implement their own lightweight threads and schedulers. I don’t see that happening in compiled languages. The hope is that  operating systems will improve their implementations of threads–I hear Linux’s NPTL already is a big improvement in this area. In the meanwhile, we have to rely on thread pools.

Shared memory concurrency model is the reason why multithreaded programming is so difficult and error-prone. Separating address spaces simplifies programming, but at a performance cost. I believe that some kind of compromise is possible. A message-passing or an actor model can work with sharing, as long as aliasing is under control.

Inspiration for this post

After my last post on thin lock implementation, I was asked the question, why I used such a small number, 1024, for the maximum number of threads. It was actually a number I’ve found in the D compiler runtime. A thin lock could easily work with a much larger number of threads. In fact I’m going to substantially increase the number of bits for thread ID at the expense of recursion count. But this question got me thinking about scalability of threading in general.

Fibers

Fibers (as well as Java’s green threads) are not an alternative to heavyweight threads. Fibers can’t run in parallel, so they have no way to use multiple processors. They are more of a flow-of-control construct, often used interchangeably with coroutines.

Interesting reading


Previously I have described a thin lock as a very efficient implementation of object monitor. It’s time to flesh out the design. Since most of the thin lock implementation is itself lock free, all kinds of multicore subtleties come into play. I did my best to analyze every step, but I’m only human. The code hasn’t yet been written in D, much less tested.

Object Layout

Every object (instance of a class) in the D programming language inherits from Object. Object contains a hidden header consisting of two fields:

  • pointer to vtable
  • pointer-sized thin lock

I’m not adding any new fields, just replacing an existing pointer field with the thin lock. The pointer used to point to a lazily evaluated Monitor object.

Thin lock is a union of a bit field and a pointer to a FatLock struct. The union is discriminated by the two lowest bits–if they are zero, it’s a pointer. A null pointer has special meaning–the object can never be shared (it was created as non-shared). See sharing and unsharing n D.

Here’s the thin lock bit-field layout (the low 32-bits; the whole field is either 32- or 64-bit wide, depending on native pointer size).

Bits Name Purpose
11 thread index One-based index into the global thread table
19 recursion counter Used for recursive locking by the same thread
1 currently shared Set during the creation of a shared object. It can be turned off and back on with the help of sharing casts.
1 created shared Set only if the object was created as shared

D runtime has a fixed-size global array of threads. An index into this array is used to identify a thread. This is different from the operating-system-defined thread ID.

Sharing policy

To share an object between threads, it must be created as shared (possibly allocated from a separate, shared heap). Any global object that is not declared shared is only accessible through thread-local handles.

A shared object’s thin lock (which hasn’t been inflated to a fat lock, see later) has the two lowest bits set. Sharing can be cast away, at which point the currently-shared bit is cleared, but the created-shared bit is still on. The object can then be cast back to shared; however, if the created-shared bit is not on, such a cast will throw an exception.

This guarantees that an object that was not created for sharing can never be cast to shared.

When sharing is cast away, the casting thread’s identifier is remembered in the thread index field. The object becomes exclusively owned by the current thread (essentially, locked by it). Any attempt by a different thread to lock such an object will result in an exception.

Locking a never-shared object

An object that was created as non-shared (the default) has zero in its thin lock. This state is permanent and not reachable from any other state.

Originally I thought that the comparison (thinlock == 0) didn’t require any synchronization. I totally spaced out on publication safety (thank you, Andrei, for a reality check!). In general, this test has to be preceded by a read fence. Fortunately, on an x86, publication safety is guaranteed without fencing, so at least on that platform, the check is lightning fast. Because the thin lock is co-located with the rest of the object–next to the vtable pointer–even the overhead of fetching it from main memory is rarely incurred.

Therefore the incentive for writing two versions of a class–one with synchronized, and the other with non-synchronized methods–is practically absent in D.

I mentioned before that the D compiler might be able to elide synchronization of non-shared objects altogether. If that happens, the testing of the two sharing bits in the thin lock would be redundant. There is however a code-size/performance trade-off between the two solutions.

Locking algorithm

- Test thin lock for zero (on an x86, without any synchronization, otherwise precede it with a read fence). If zero, return. This is the most common case: the object was not created for sharing and will never be accessed by another thread.

- Fetch current compact thread ID, which is conveniently stored in thread-local memory. Compact thread ID is pre-calculated for each thread using an 11-bit thread index (see above) shifted left by 21 and OR’ed with 3 (the lowest two bits are set). This is an optimization. Notice that an XOR with a compact thread ID flips the two lowest bits of the thin lock, which makes subsequent tests look logically inverted.

- XOR thin lock with this ID. If the result is 2 (remember the inversion), return. This is the case when the object was created shared, but was cast to non-shared by the current thread. This operation doesn’t require any synchronization, because only the current thread could cast the object back to shared.

if (thinlock ^ compactThreadId == 2)
  return;

- If we are past this point, we know that the object is shared, or we are attempting (incorrectly) to lock an object that is exclusive to (cast to non-shared by) another thread.

- Perform a CAS (atomic Compare And Swap operation) on thinlock. The value we are expecting is 3 (two lowest bits set and noting else), the replacement value is the compact thread ID of the current thread. This operation will succeed in the next most common case–the object is shared but is not currently locked. The resulting thin lock state has the thread index filled with a non-zero value, and the two lowest bits set (see the description of compact thread ID). This state is interpreted as “locked once by a given thread”.

- If the CAS fails (thinlock didn’t have the expected value and the swap did not occur), we know that the object is locked (or the thin lock has been inflated to a fat lock, see later).

- Try the next most common case–the object is locked by the current thread (recursive locking). First XOR the value of the thin lock with the current compact thread ID to isolate the count field. If the result is less that the maximum count shifted left by 2 and the two lowest bits are zero (meaning, they were set before the XOR), then increment the count. These operations don’t require any synchronization, because they only succeed when the lock is owned by the current thread.

uint tmp = thinlock ^ compactTID;
if (tmp < MAX_COUNT_MASK && (tmp & 3) == 0)
{
  thinlock += COUNT_INCREMENT;
  return;
}

- Check for the error of trying to lock an object that is owned exclusively by another thread.

if ((thinlock & 3) == 1)
  throw new ExclusiveLockViolation;

- Check if the lock has been inflated (the two lowest bits are zero). If so, interpret the thin lock as a pointer to FatLock and lock it. Fat lock is implemented using the operating system locking primitives and can deal with contention. Return.

- If we reach this point, we know that there is contention for the lock or the count has overflowed. In either case we have to inflate the lock. We have to preserve one invariant–the lock can only be modified by the thread that holds it, otherwise we are open to all kinds of races. Therefore we have to busy-wait for the lock to be released.

while (thinlock != 3 && (thinlock & 3) != 0)
  compiler_fence();

Notice that busy waiting requires that the compiler not optimize this loop–we need a “compiler fence”. However, no processor memory fences are required, because there is no ordering dependence. It’s enough that, when another thread modifies the lock, the new value eventually becomes visible to the spinning thread.

It’s possible to miss the unlocked state and spin longer than necessary. Starvation is theoretically possible if the other thread keeps unlocking and locking without discovering a contention; with the current thread repeatedly missing the unlocked state. (This part of the algorithm may be optimized further by introducing exponential backup.)

- The loop is exited if either the lock has been released (thinlock == 3) or another thread managed to inflate the lock (the two lowest bits are zero, signifying that the value stored is a pointer to FatLock).

- Try to acquire the lock using CAS (the arguments are the same as in the original attempt).

- If it succeeds, allocate the fat lock, lock it, and atomically store the pointer to it in the thin lock. Return. Once the lock has been inflated, it will remain so for the lifetime of the object.

- If we reach this point, we know that: either the lock has been inflated, another thread is in the process of inflating it, or another thread has acquired the lock without contention while we were busy waiting.

- Try again: go back to busy waiting.

Unlocking

When unlocking, we have the guarantee that the current thread owns the lock, so we don’t need any additional synchronization.

  • If the thin locks is zero, return. The object was not created for sharing.
  • XOR thin lock with the compact thread ID.
  • If the result is 2, we own the object exclusively. Return.
  • If the result is zero, we hit the next most common case–the lock has been taken once by the current thread. Store 3 in the thin lock and return.
  • If the result of XOR is non-zero and its lowest two bits are clear, decrement the recursion count.
  • Otherwise, the lock has been inflated. Unlock the fat lock.

FatLock must also contain a field exclusively-owned-by. This field is filled with a compact thread ID when sharing is cast away. When sharing is re-established, this field goes back to zero. Therefore, after locking the fat lock, additional checking is done: If the field is non-zero and the result of XOR with the current compact thread ID is also non-zero, an exception is thrown (attempt to lock an exclusively owned object).


I already mentioned the paper, SharC: Checking Data Sharing Strategies for Multithreaded C, by Anderson at al. The authors describe a strategy for checking multithreaded correctness of C programs. They were able to classify sharing modes into five categories:

  • Private (to current thread)
  • Read only
  • Shared under a specific lock (the lock is part of the type)
  • Racy (no checking)
  • Dynamically checked (to be either read-only or private)

The programmer makes strategic annotations by adding sharing type qualifiers to declarations of shared data. The SharC tool then derives the rest, and flags inconsistencies. There is also a run-time component for checking dynamically shared data and allowing safe casting between sharing modes.

You might notice some similarities to the D sharing model. For one, SharC assumes that all unannotated sharing is unintended and treats it as an error. In D, if you don’t annotate something as shared, it will be allocated from thread-private pool and it will be invisible to other threads.

The annotation system of SharC was not designed to be very practical. The authors didn’t expect the programmer to precisely annotate all shared variables–it’s clearly too much work. Instead they fell back on global program analysis, which is quite expensive and requires access to all sources.

In D we’ll have to make a few compromises to get maximum benefit from the types system without over-burdening the programmer with tedious annotations. (Making the right trade-offs is the hardest part of language design. You know you got it right when half of the programmers hate you for making it too strict, and the other half for making it too relaxed.) D reduces sharing annotations to just two type qualifiers: shared and invariant. I talked about shared in one of my previous post. The invariant type modifier is already well defined and in use in D 2.0.

The most interesting part of any sharing scheme based on types is the transitions between modes. A common example where such transitions might be desirable is in the producer consumer queue. Objects may be passed between threads–through the queue–either by value or by reference. When they are passed by reference, they have to be shared–multiple threads may access them concurrently. However, once the consumer gets exclusive access to such an object, she might want to treat it as non-shared. Why? Because she might want to pass itto a library function, which was not designed to deal with shared objects.

For obvious reasons we don’t want conversions between shared and non-shared data to be implicit. That would pretty much defeat the whole scheme. So we are left with explicit casting. Here’s my current thinking (which hasn’t been peer-reviewed yet).

There are two types of casts, unchecked and checked. An unchecked cast always succeeds (assuming it compiles), a checked one might throw an exception. C++ dynamic_cast is checked. So is D cast when applied to class objects. There are unchecked casts for numeric types in D, but the standard library provides a checked template to (in the module std.conv), which checks for over- and underflows.

How can you check a cast that strips the shared modifier– i.e., privatizes the object? SharC offers one solution–reference counting.

You can be 100% sure that no other thread has access to your shared data when you can prove that you have the only reference to it. SharC’s run-time uses a very clever reference counting scheme borrowed from a garbage collector to accomplish just that. Could we do it in D? We probably could, if we committed to the reference-counting GC, but that’s rather unlikely.

What’s the next best thing? Locking the object! This will only work for class objects, which have a built-in monitor but, at least in the SafeD subset, we expect class objects to play a major role. Locking the object when privatizing it has several advantages.

  • It will fail if the object is already locked by another thread
  • If another thread still has shared access to it after privatization, and tries to lock it, it will fail
  • The reverse operation, casting an object back to shared, can be checked.

Most of this can be easily accomplished by slightly modifying thin locks (see my previous post). I’ll provide more details later.

Now, let’s talk about casting back to shared (I’m using “back” deliberately, as we don’t want to share objects that were created as non-shared). This cast is much more tricky, since a lot of bad things might have happen while the object was unshared. We can check for some of them and trust the programmer not to do others.

We have to trust the programmer not to squirrel away non-shared aliases to a temporarily unshared object (or its internals). Such aliases become unprotected back doors to the object when it becomes shared again. Remember that, even if you call a synchronized method on an object that is not declared as shared, the synchronization is statically elided by the compiler (at least that’s the plan).

Another danger is that non-shared objects may be inserted into a temporarily unshared object. For that we could check during casting, if we used a different heap for allocating shared objects. We could ask the garbage collector if a given pointer points into the shared heap or not. For class objects, this check can be made much faster by testing a special bit in the thin lock.

Checked sharing casts in both directions have to be recursive, since the shared qualifier is transitive. When casting from shared to non-shared, each class object must have its thin lock put in the “exclusively owned by current thread” state. When casting back to shared, each class object must have its thin lock put back in the sharing state, and all pointers must be checked against the shared heap.

Conveniently enough, in D, such checked casts can be implemented in the library using unchecked casts and reflection.


Publication safety is the core issue in the famously non-intuitive Double-Checked Locking Pattern.

What’s publication? In a nutshell, one thread prepares data and publishes it–other threads check if the data has been published and use it. Common scenario is the creation of a shared object (this example is written in the D programming language, but it’s pretty self-explanatory).

shared Foo foo = new shared Foo();

When a thread creates an object, it first runs its constructor (Foo()) and then points the shared handle (foo) to it. Other threads check the handle for non-null and then happily access the object.

if (foo !is null) foo.doSomething();

Naturally, in our naivete, we tend to assume that if the second thread can see a non-null handle, the construction of the object must have completed. That belief is known as publication safety and, guess what!, it’s not guaranteed on modern multi-processors that use relaxed memory models.

To understand what’s happening, let’s simplify the problem even further and write it in pseudo assembly. Initially the globals x and ready are zero. R is a thread-local variable (register). Think of writing to x as part of the construction of an object and writing to ready as the publication (the initialization of a shared handle).

Thread 1 Thread 2
x = 1
ready = 1
if ready == 1
R = x

Can Thread 2 see ready == 1 and x == 0? Yes, for two separate reasons. On a relaxed-memory-model multiprocessor

  1. writes to memory can be completed out of order and
  2. reads from memory can be satisfied out of order.

Imagine processors sending e-mail messages to memory. Thread 1 sends a message instructing the memory to write 1 to x. Then it sends another message instructing it to write 1 to ready. It’s perfectly possible on modern processors that the first message gets delayed and the write to ready completes before the write to x.

The way to make sure this doesn’t happen is to separate the two writes by a memory barrier, or fence. Every relaxed-memory-model multiprocessor offers some ways to do it. The x86′s (x > 3) have such instructions (mfence, lfence, and sfence), even though they implement processor-order memory model.

But beware, even if the writes are ordered by a (write) fence, the reads in Thread 2 may still execute out of order. Imagine that Thread 2 sends two e-mail messages asking for the values of ready and x. The second message arrives first, before any writes by Thread 1 are done. The memory sends back an e-mail with the value 0 for x. Next, the two writes by Thread 1 are committed. Then the first read message (fetch ready) arrives, and the memory responds with the value 1. Thread 2 sees a non-zero value of ready, but a zero (uninitialized) value of x. We’re in trouble!

Notice that the read of x is speculative. The processor issues the read request just in case the branch ready == 1 were taken. If it’s not, it can always abandon the speculation.

Again, the way to ensure that the two reads are satisfied in program order is to put a fence between them. Here’s the pseudocode.

Thread 1 Thread 2
x = 1
write fence
ready = 1
if ready == 1
read fence
R = x

Both fences are necessary!

The write fence is easier to remember. In our publication example,  it makes sense to put it at the end of the constructor. It has the connotation of flushing all the writes performed during construction, before the public handle is initialized.

It’s the need for the read fence that is often overlooked. It’s not immediately obvious that every time you access a published shared variable you have to use a fence. It’s the “every time” part that seems excessive, especially if your code initializes the handle only once (as in the double-checked locking pattern). Sure, there are a few cases when a benign race is acceptable, but even the best of us get it wrong half of the time.

Why is this whole low-level discussion relevant? Very few programmers will be inserting (non-portable) fences into their code. Most programmer will use monitors and locks, which have appropriate fences (or their equivalents) built in. Java programmers will mark shared variables volatile, which will tell the compiler to issue memory fences on every access. C++ and D programmers will occasionally use atomics, which are implemented with all the fencing in place.

But look at it this way: This is a cautionary story for high-level programmers too. Do not elide synchronization even in the simplest, seemingly obvious cases! Don’t try to be clever! The processors (and the compilers) are out there to get you. The slightest slip and they will “optimize” your code in a way that is contrary to your intuitions.


Most modern multi-processors implement relaxed memory models. How much should you care about that?

When you program in a high-level language, you shouldn’t have to worry about your processor’s memory model, as long as you follow some simple rules.

In Java, for instance, you must make sure that access to shared variables (both for reading and writing) is protected by locks (e.g., synchronized sections). If you are adventurous, and want to share variables without locking, you must declare them as volatile. Finally, if you try to share non-volatile variables without locking, you are introducing data races and your program is considered incorrect (it might still work on some processors). This is what the Java memory model is about, in a nutshell.

C++ memory model proposal takes a similar approach, except that the use of volatile is replaced by the atomic library. Renegade Java programmers beware–C++ volatile has no multi-thread connotations (except in some vendor-specific extensions).

The D programming language tries to bridge the gap between safe and simple (Java-like), and unsafe and efficient (systems) programming. That almost requires two different memory models.

In the SafeD subset, we’d ideally want to ban all races and deadlocks. That could be accomplished only by eliminating lock-based programming. It seems like a drastic step until you consider that (a) you could still use certified libraries that internally use locking, and (b) you’d have futures, message passing, and STM (Software Transactional Memory) at your disposal. There are languages, like Erlang, that base all concurrency on message passing. And there’s a lot of code written in Erlang, despite its idiosyncratic syntax.

On the systems-programming end, D might as well follow the C++ memory model, down to “raw” atomics, which let you introduce low-level data races.

The volatile keyword is reserved in D, but it’s not clear what it should mean. We might simply get rid of it or change its definition. The problem is that volatile means many different things. In Java, it’s both a directive to the compiler to insert appropriate fences (memory barriers), as well as a directive to the optimizer not to perform code motion or caching. This is also how Microsoft implemented volatile in their C++ compiler. In principle, code motion and fencing are orthogonal issues. Hans Boehm’s implementation of atomics, for instance, uses a low level primitive AO_compiler_barrier, which prevents code movement without introducing memory fences (it’s implementation is very compiler-specific).

Of course, the D atomics library would not be certified for use in SafeD; but some lock-free data structures based on it, would.


Andrei brought up the idea of encoding the sharing of an object between threads in the type of the object. After months of discussions we are still not sure how far we want to go with it. One thing is for sure, letting the programmer mark objects for sharing can help the compiler prevent a lot of concurrency bugs.

One of the common concurrency errors is accidental sharing. Some data structures are designed for multi-threaded access, e.g., objects with synchronized methods; and they usually work just fine (except for deadlocks). The problem is when a chunk of data that was not designed for sharing is accessed by multiple threads. There is no easy way to detect this error since, in general, concurrency bugs are hard to reproduce.

The proposal is to make accidental sharing impossible. This requires that all objects, by default, be thread local. For instance, if you declare a global object and initialize it in one thread, another thread will see a different version of this object. In most cases it will see a null pointer or an unitialized object handle, and you’ll get an easy to reproduce null-reference error.

If you consciously want to share an object, you have to declare it “shared”. This type modifier is transitive (just like const and invariant in the D programming language), so you  can’t have references to non-shared object inside a shared object. It simply won’t compile.

A function may declare its (reference) parameter as “shared”, in which case the compiler won’t let you pass a non-shared object to it. Conversely, if the parameter is declared as non-shared (the default), no shared argument may be passed in its place. There is a guarantee that it will be thread-local. (See however “unsharing”.)

Let me discuss potential objections to this scheme.

The first is performance–not for shared objects, mind you, but for the non-shared ones. Walter tells us that accessing a thread-local static variable adds between 1 to 3 instructions of overhead. That seems quite reasonable. Especially considering that in multi-threaded environment the use of global non-shared variables is rarely correct.

There is also a performance penalty when starting a new thread–all static variables it has access to have to be default-initialized, plus all module constructors have to be called. This might amount to quite a bit. We will recommend not to overuse global variables and module constructors. The way to amortize this cost is to create thread pools.

What about invariant objects (ones that are never modified)? Those can be safely shared, so they must be allocated as not thread-local. It is okay for a shared object to contain references to invariant objects.

Can a shared object be “unshared”? This is a tricky one. There are situations when threads hand over objects to each other. The object is only shared during the handover, but otherwise is accessed by one thread at a time. The currently owning thread should be able to call regular library functions (that don’t expect sharing) with such objects. So we need some kind of share/unshare cast. On the other hand, such cast creates a wormhole into accidental sharing. There is an interesting SharC paper that discusses runtime techniques to make “unsharing” safe. Safe casting from temporarily non-shared to shared is even more tricky. I’ll talk more about it in my next post.

Finally, there is an unexpected bonus from this scheme for the garbage collector. We will be able to use a separate shared heap (which will also store invariant objects), and separate per-thread heaps for non-shared objects. Since there can’t be any references going from the shared/invariant heap to non-shared ones, per-thread garbage collection will be easy. Only occasional collection of the shared heap would require the cooperation of all threads, and even that could be done without stopping the world.


I’ve been working recently on improving the performance of locking in D (the D programming language). At this moment the synchronized statement in D simply uses the underlying OS facilities–CriticalSection on Windows and pthread  mutex on Linux. This incurs a substantial performance hit for multithreaded programming.

I looked at research in the Java community and found some interesting solutions. They are all versions of Thin Locks, first described by Bacon et al. These guys figured out that when a program is entering a synchronized section, in about 80% of the cases the section is not locked by anybody. The next most common case is nested locking–the section is locked by the same thread recursively. Only rarely there is actual contention for a lock.  And when that happens, it’s very likely that it will happen again and again. Such contended lock is most likely part of a shared data structure designed to be accessed by multiple threads.

Thin Locks optimize the most common case of no contention. Every Java Object has a word in its header that is used as a thin lock. If this field is zero (I’m skipping details here), the object is not locked. When a thread enters a synchronized section, it optimistically assumes that the lock is zero and tries to flip it to a non-zero value. This is done in one atomic operation, Compare And Swap (CAS). Most processor either have such an instruction built in, or provide primitives to implement it.

CAS checks the value in memory, comparing it to the “expected value.” If the comparison succeeds, the “new value” is written in its place. In our case, we expected the value of the thin lock to be zero, and the new value we want to put there is the (non-zero) thread ID of the lock taker. If the CAS succeeds, we are done, we owne the lock.

If the CAS fails, the lock has already been taken. Again, the most likely case is that our own thread holds it. So we check if our thread ID is stored in the thin lock. If so, we increment the count field of the thin lock (several fields are cleverly mashed together into one word). We’re done!

If we don’t find our own thread ID in the thin lock, we know we have contention. We have to inflate the lock–allocate an additional object (the fat lock) that holds the general-purpose OS-based mutex. Of course we also check if the thin lock hasn’t already been inflated, in which case we just lock the fat lock.

The inflation process is a little tricky–we can’t just modify the thin lock while it’s bein held by another thread. Instead we spin wait for it to be released, then try to acquire it ourselves, and then inflate it. Once inflated, the lock remains inflated forever; which usually is the right thing to do anyway, since a lock that’s been contended once, is likely to be contended many times. The one-time spinning is amortized across many accesses.

Of course there are details that I’ve omitted, but I gave you the gist of the algorithm.  It makes un-contended locking so cheap (one CAS) that in most cases there is no reason to implement two versions, one for single- and one for multithreaded use, of the same data structure. A single-threaded program doesn’t have to pay the multithreaded penalty when using general-purpose libraries.

« Previous Page