The question that’s been bugging me lately was: How does C++ make the use atomic variables both portable and efficient.

I knew how Java volatile worked–it enforced sequential consistency, which is not always the most efficient thing to do.

C++0x has atomic variables which also enforce sequential consistency when used in the default mode. Without special ordering annotations they work almost like Java’s volatile (interestingly, Java’s volatile doesn’t enforce atomicity–there is an atomic library in Java, though, which does).

However, C++ offers various degrees of relaxation of sequential consistency which, if used correctly, should produce more efficient code.

After studying a bit of the x86 memory model, I realized that some basic lock-free patterns (like the one in double-checked locking) will work without any fences. There should be a way of coding them in C++ such that, when compiled for the x86, no fences are produced. On the other hand, the same C++ code, when compiled for, let’s say, the alpha or the Power PC, should produce the necessary fencing.

To make things more interesting, there are some other algorithms, like the Peterson lock, which do require memory fences on the x86 (see my previous post). So it’s not just the matter of skipping all fencing.

I reduced my question to: How does one write portable code in C++ that results in just the right amount of fencing on the x86?

Specifying memory ordering in C++

The C++0x atomics library proposal underwent a lot of changes before it settled into its current shape. There is now a complete C++0x draft. It’s not an easy read so I’m grateful to Hans Boehm for clarifying some of the fine points.

Without further ado, this is how the publication safety pattern (the mother of all double-checked locking patterns) could be written in C++:

  atomic<bool> ready = false;
  atomic<int> data = 0;

Thread 0:

  data.store(1);
  ready.store(true);

Thread 1:

  if (ready.load())
    assert (data.load() == 1);

As you can see, atomic objects have methods store and load that are used for writing and reading underlying shared data. By default, these methods enforce sequential consistency. What it means is that this code has the same semantics as if it were written in Java using volatile variables. Consequently, on the x86, it will generate memory fences after each store.

But we know that these fences are not necessary for publication safety. How can we write code that would produce minimal fencing on the x86?

To make such fine tuning possible, the C++0x atomics library allows the programmer to specify ordering requirements for each load and store. I’ll explain various ordering options in a moment; for now let’s have a look at the optimized version of the publication pattern:

  atomic<bool> ready = false;
  atomic<int> data = 0;

Thread 0:

  data.store(1, memory_order_release);
  ready.store(true, memory_order_release);

Thread 1:

  if (ready.load(memory_order_acquire))
    assert (data.load(memory_order_acquire) == 1);

What’s important is that this code will work correctly on any major processor, but it will produce no fences on the x86.

Warning: Even if you know that your program will only ever run on an x86, you can’t remove the atomics and the ordering constraints from your code. They are still necessary to prevent the compiler from doing the reordering.

Fine-tuning memory ordering

Memory order can be specified using the following enumeration:

namespace std {
 typedef enum memory_order {
   memory_order_relaxed,
   memory_order_consume,
   memory_order_acquire,
   memory_order_release,
   memory_order_acq_rel,
   memory_order_seq_cst
 } memory_order; 
}

The default for all operations on atomic variables is memory_order_seq_cst which, just like Java’s volatile, enforces sequential consistency. Other orderings are used to relax sequential consistency and often squeeze better performance from lock-free algorithms.

  • memory_order_acquire: guarantees that subsequent loads are not moved before the current load or any preceding loads.
  • memory_order_release: preceding stores are not moved past the current store or any subsequent stores.
  • memory_order_acq_rel: combines the two previous guarantees.
  • memory_order_consume: potentially weaker form of memory_order_acquire that enforces ordering of the current load before other operations that are data-dependent on it (for instance, when a load of a pointer is marked memory_order_consume, subsequent operations that dereference this pointer won’t be moved before it (yes, even that is not guaranteed on all platforms!).
  • memory_order_relaxed: all reorderings are okay.

As I discussed before, the x86 guarantees acquire semantics for loads and release semantics for stores, so load(memory_order_acquire) and store(x, memory_order_release) produce no fences. So, indeed, our optimized version of the publication pattern will produce optimal assembly on the x86 (and I presume on other processors too).

But I have also shown before that the Peterson’s algorithm won’t work on the x86 without fencing. So it’s not enough to use just memory_order_acquire and memory_order_release to implement it. In fact, the problem with the Peterson lock was the possibility of reordering loads with stores. The weakest constraint that prevents such reordering is memory_order_acq_rel.

Now here the funny story begins. Without much thinking, I decided that just slapping memory_order_acq_rel on any of the writes would fix the problem. Here’s what I originally wrote:

[begin quote] The correct C++ code in this case is:

Peterson::Peterson() {
   _victim.store(0, memory_order_release);
   _interested[0].store(false, memory_order_release);
   _interested[1].store(false, memory_order_release);
}
void Peterson::lock() {
   int me = threadID; // either 0 or 1
   int he = 1 – me; // the other thread
   _interested[me].exchange(true, memory_order_acq_rel);
   _victim.store(me, memory_order_release);
   while (_interested[he].load(memory_order_acquire)
       && _victim.load(memory_order_acquire) == me)
      continue; // spin
}

The ordering memory_order_acq_rel produces an mfence instruction on the x86. This fence will prevent the movement of the load of _interested[he] before the store to _interested[me], which could otherwise happen (and break the algorithm).[end quote]

You can read comments to this post–especially the ones by Anthony and Dmitriy, who made me eat crow. In short, the point is that the “acquire” part of memory_order_acq_rel has no meaning unless another thread writes to (“releases”) the same variable. Since only thread 0 writes to _interested[0] and only thread 1 one writes to _interested[1], this synchronization accomplished nothing (outside of the x86). Dmitriy’s implementation, which synchronizes on _victim, is correct (but I had to ask many questions before understanding Anthony’s proof of it).

Conclusion

What strikes me about this example is how non-trivial the reasoning behind this implementation was. I had to actually go through the x86 assembly to realize that I needed those and not some other ordering constraints (and I still got it wrong!). In comparison to that, the old fashioned assembly programming looks like a piece of cake.

Any time you deviate from sequential consistency, you increase the complexity of the problem by orders of magnitude.

Microsoft volatile

No, I’m not talking about the stock market. I mentioned before that C++ volatile has nothing to do with multithreading. This is not entirely true because some compiler vendors took the liberty to add non-standard semantics to volatile (out of pure desperation, since they had to support multi-processor code, and pre-0x C++ offered no help in this area). This new semantics, at least in the case of Microsoft compilers, did not include sequential consistency. Instead it guaranteed acquire and release semantics (which, on the x86, didn’t result in any fencing). So, although the publication pattern will work on a Microsoft compiler with volatile variables, the Peterson lock won’t! I thought that was an interesting piece of trivia.