December 2008



In my last post I made a mistake of publishing a piece of code that used C++ weak atomics–which are part of the new C++ Memory Model. I qualified the post by saying that I had no proof of correctness, and the main purpose of it was to discourage the use of weak atomics. In a way the post was successful because it did prove that weak atomics are very tricky indeed. As was pointed out to me by Anthony and Dmitriy, my implementation was incorrect (I modified the post retroactively to make this point clear). Additionally, in his blog, Anthony published a proof of the algorithm posted by Dmitriy. And that’s when the fun started…

Before I describe what happened, let me give you a bird’s eye view of the C++ Memory Model.

The Model

The purpose of a memory model is to define the semantics of concurrent operations at an abstract level–independent of the platform on which the program is running. Essentially, one has to define in what order program statements may be executed, and what memory values may be observed. The basic memory model, the one used by Java, is based on sequential consistency. Program statements are executed in program order on each processor, and the actions of multiple threads are interleaved in some (global) order. A memory load can only see the value that was written last, according to this global order.

The C++ strong atomics (the default) follow this model.

It’s the weak atomics that are trouble. Weak atomics let the programmer specify memory ordering of loads and stores that is less than sequentially consistent. Unlike the sequentially consistent model, the model for weak atomics cannot be described in terms of a state machine (this was pointed out and criticized by Herb Sutter in his Prism proposal). Instead the model is formulated as a set of global constraints imposed on executions.

In a state-machine formulation, a step the program takes is based on previous history. The step doesn’t have to be uniquely defined–the machine may be non-deterministic. The execution may split into multiple branches. What is important, though, is that the step taken does not depend on the future of the execution.

This is not true in the weak-atomics model. The way to think about it is that, in a non-sequentially-consistent execution, the values seen by memory loads may come either from the past or from the future. A processor may speculate–make a guess at the values that have not been stored yet. A speculation may be discarded if the guess turns out to be wrong, or accepted if it was right. So the acceptability of a given execution can only be determined by looking at the whole execution, from start to end. Understandably, this makes any correctness proofs extremely tricky.

The Proof

The proposed implementation of the Peterson lock using weak atomics can be reduced to the following:

// Thread 0
zeroWants.save(true, memory_order_relaxed);
r0 = victim.exchange(0, memory_order_acq_rel);
r1 = oneWants.load(memory_order_acquire);

// Thread 1
oneWants.save(true, memory_order_relaxed);
r2 = victim.exchange(1, memory_order_acq_rel);
r3 = zeroWants.load(memory_order_acquire);

(The memory_order_acquire directives in the two loads are needed for synchronization with the unlock part of the algorithm, which is not included here.)

The hardest part of the proof is to show mutual exclusion–it should be impossible for both threads to enter the critical section at the same time. In Peterson’s algorithm, the entrance to the critical section is predicated on the values loaded into r1 and r3. If both values are false, the algorithm is broken.

Let’s consider the situation when both threads try to acquire the lock for the first time. The values of zeroWants and oneWants are initially false, and the value of victim is zero. Is it possible for both loads (zeroWants and oneWants) to see the initial values, rather than the ones saved by the opposing thread? The short answer is yes, as long as there is no intervening inter-thread synchronization.

In this case, the only synchronization that may happen is between a store-release in one thread and the corresponding load-acquire in the other. For instance, if the load part of the exchange operation on victim in thread 0 sees the value written by the store part of the exchange in thread 1 (that is, r0 is equal to 1), than these two operations are in a “synchronizes-with” relationship. We can then establish the following chronology:

  1. The save of true to oneWants in T1 happens before the store of 1 to victim (part of exchange) in T1. This is guaranteed by program order.
  2. The store of 1 into victim in T1 happens before the load of victim in T0 (part of exchange). This is because r0 is 1 in this part of the proof.
  3. The load of victim in T0 happens before the load of oneWants in T0. This is guaranteed by program order.

Because the happens-before relation is transitive, we can conclude that the load of oneWants in T0 (item 3) must see the value stored in T1 (item 1), which is true, and therefore T0 won’t enter the critical section. So far so good.

That was the simple case. Here’s the controversial one: Both r0 and r2 are zero. In particular, it means that r0 came from the initial value of victim. What about r2? It could have come either from the exchange in T0 or from the initial value of victim (both are zero!). According to the memory model, we have to consider both cases!

If r0 came from the exchange in T1 then we can use the same reasoning as before to conclude that T1 cannot enter the critical section.

But if the source of the zero in r0 is the original value of victim then there is no synchronization between T0 and T1. Both threads can enter the critical section!

You might be thinking: Either the exchange in T1 comes first and the exchange in T2 must see its value, or the exchange in T2 comes first and the one in T1 must see its value–they can’t both see the original value. This just shows you how ingrained sequentially consistent thinking is. There is no meaning to “comes first” in the world of weak atomics. Two events may only be sequenced if they occur in the same thread, or there is a synchronization path between them. Here we are still at the stage of trying to establish whether the two exchanges are synchronized or not.

The Controversy

Let me spell it out: Both exchanges see the initial value of victim. Since they didn’t interact, there is no synchronizes-with relationship between them. Therefore we can’t establish any inter-thread chronology, and it’s perfectly possible for both threads to see the original values of zeroWants and oneWants.

// Thread 0
zeroWants.save(true, memory_order_relaxed);
r0 = victim.exchange(0, memory_order_acq_rel); // sees initial value, 0
r1 = oneWants.load(memory_order_acquire); // sees initial value, false

// Thread 1
oneWants.save(true, memory_order_relaxed);
r2 = victim.exchange(1, memory_order_acq_rel); // sees initial value, 0
r3 = zeroWants.load(memory_order_acquire); // sees initial value, false

I had concluded that the implementation was broken, but I let Anthony defend his proof.

Here’s the argument he presented, based on the modification order of victim. The C++ memory model enforces global modification order for every memory location. The victim variable starts at zero and then is modified twice by the exchanges in T0 and T1. If thread T1 sees the original value, zero, of victim then the same exchange modifies it to one. The modification order is therefore (T1: 1, T0: 0). If we assume that an atomic exchange must always see the preceding value in the modification order, then we must conclude that the exchange in T0 must read 1. Therefore it synchronizes with the exchange in T1 and mutual exclusion is assured.

And here’s the clincher: The draft C++ Standard did not specify what value must be seen by the atomic exchange (or, more generally, by an atomic Read-Modify-Write operation). I confirmed that with Hans Boehm, who concluded that this apparent hole in the standard must be corrected.

The next draft of the Standard will say that a RMW operation must see the last value in the modification order of the memory object in question.

Only then the proof of correctness of the weak-atomics implementation of Peterson lock will be complete.

Conclusion

I had no idea what I was getting myself into when attempting to reason about C++ weak atomics. The theory behind them is so complex that it’s borderline unusable. It took three people (Anthony, Hans, and me) and a modification to the Standard to complete the proof of a relatively simple algorithm. Imagine doing the same for a lock-free queue based on weak atomics!

Hans, who was instrumental in creating the C++ memory model, hopes that weak atomics will be a stopgap measure until processor designers find more efficient ways of implementing sequential consistency. For now, it is strongly recommended to leave weak atomics to top experts whose numbers can be counted on the fingers of one hand.

I’m grateful to (in order of appearance) Anthony Williams, Dmitriy V’jukov, Hans Boehm, and others who contributed to this post.


The question that’s been bugging me lately was: How does C++ make the use atomic variables both portable and efficient.

I knew how Java volatile worked–it enforced sequential consistency, which is not always the most efficient thing to do.

C++0x has atomic variables which also enforce sequential consistency when used in the default mode. Without special ordering annotations they work almost like Java’s volatile (interestingly, Java’s volatile doesn’t enforce atomicity–there is an atomic library in Java, though, which does).

However, C++ offers various degrees of relaxation of sequential consistency which, if used correctly, should produce more efficient code.

After studying a bit of the x86 memory model, I realized that some basic lock-free patterns (like the one in double-checked locking) will work without any fences. There should be a way of coding them in C++ such that, when compiled for the x86, no fences are produced. On the other hand, the same C++ code, when compiled for, let’s say, the alpha or the Power PC, should produce the necessary fencing.

To make things more interesting, there are some other algorithms, like the Peterson lock, which do require memory fences on the x86 (see my previous post). So it’s not just the matter of skipping all fencing.

I reduced my question to: How does one write portable code in C++ that results in just the right amount of fencing on the x86?

Specifying memory ordering in C++

The C++0x atomics library proposal underwent a lot of changes before it settled into its current shape. There is now a complete C++0x draft. It’s not an easy read so I’m grateful to Hans Boehm for clarifying some of the fine points.

Without further ado, this is how the publication safety pattern (the mother of all double-checked locking patterns) could be written in C++:

  atomic<bool> ready = false;
  atomic<int> data = 0;

Thread 0:

  data.store(1);
  ready.store(true);

Thread 1:

  if (ready.load())
    assert (data.load() == 1);

As you can see, atomic objects have methods store and load that are used for writing and reading underlying shared data. By default, these methods enforce sequential consistency. What it means is that this code has the same semantics as if it were written in Java using volatile variables. Consequently, on the x86, it will generate memory fences after each store.

But we know that these fences are not necessary for publication safety. How can we write code that would produce minimal fencing on the x86?

To make such fine tuning possible, the C++0x atomics library allows the programmer to specify ordering requirements for each load and store. I’ll explain various ordering options in a moment; for now let’s have a look at the optimized version of the publication pattern:

  atomic<bool> ready = false;
  atomic<int> data = 0;

Thread 0:

  data.store(1, memory_order_release);
  ready.store(true, memory_order_release);

Thread 1:

  if (ready.load(memory_order_acquire))
    assert (data.load(memory_order_acquire) == 1);

What’s important is that this code will work correctly on any major processor, but it will produce no fences on the x86.

Warning: Even if you know that your program will only ever run on an x86, you can’t remove the atomics and the ordering constraints from your code. They are still necessary to prevent the compiler from doing the reordering.

Fine-tuning memory ordering

Memory order can be specified using the following enumeration:

namespace std {
 typedef enum memory_order {
   memory_order_relaxed,
   memory_order_consume,
   memory_order_acquire,
   memory_order_release,
   memory_order_acq_rel,
   memory_order_seq_cst
 } memory_order; 
}

The default for all operations on atomic variables is memory_order_seq_cst which, just like Java’s volatile, enforces sequential consistency. Other orderings are used to relax sequential consistency and often squeeze better performance from lock-free algorithms.

  • memory_order_acquire: guarantees that subsequent loads are not moved before the current load or any preceding loads.
  • memory_order_release: preceding stores are not moved past the current store or any subsequent stores.
  • memory_order_acq_rel: combines the two previous guarantees.
  • memory_order_consume: potentially weaker form of memory_order_acquire that enforces ordering of the current load before other operations that are data-dependent on it (for instance, when a load of a pointer is marked memory_order_consume, subsequent operations that dereference this pointer won’t be moved before it (yes, even that is not guaranteed on all platforms!).
  • memory_order_relaxed: all reorderings are okay.

As I discussed before, the x86 guarantees acquire semantics for loads and release semantics for stores, so load(memory_order_acquire) and store(x, memory_order_release) produce no fences. So, indeed, our optimized version of the publication pattern will produce optimal assembly on the x86 (and I presume on other processors too).

But I have also shown before that the Peterson’s algorithm won’t work on the x86 without fencing. So it’s not enough to use just memory_order_acquire and memory_order_release to implement it. In fact, the problem with the Peterson lock was the possibility of reordering loads with stores. The weakest constraint that prevents such reordering is memory_order_acq_rel.

Now here the funny story begins. Without much thinking, I decided that just slapping memory_order_acq_rel on any of the writes would fix the problem. Here’s what I originally wrote:

[begin quote] The correct C++ code in this case is:

Peterson::Peterson() {
   _victim.store(0, memory_order_release);
   _interested[0].store(false, memory_order_release);
   _interested[1].store(false, memory_order_release);
}
void Peterson::lock() {
   int me = threadID; // either 0 or 1
   int he = 1 – me; // the other thread
   _interested[me].exchange(true, memory_order_acq_rel);
   _victim.store(me, memory_order_release);
   while (_interested[he].load(memory_order_acquire)
       && _victim.load(memory_order_acquire) == me)
      continue; // spin
}

The ordering memory_order_acq_rel produces an mfence instruction on the x86. This fence will prevent the movement of the load of _interested[he] before the store to _interested[me], which could otherwise happen (and break the algorithm).[end quote]

You can read comments to this post–especially the ones by Anthony and Dmitriy, who made me eat crow. In short, the point is that the “acquire” part of memory_order_acq_rel has no meaning unless another thread writes to (“releases”) the same variable. Since only thread 0 writes to _interested[0] and only thread 1 one writes to _interested[1], this synchronization accomplished nothing (outside of the x86). Dmitriy’s implementation, which synchronizes on _victim, is correct (but I had to ask many questions before understanding Anthony’s proof of it).

Conclusion

What strikes me about this example is how non-trivial the reasoning behind this implementation was. I had to actually go through the x86 assembly to realize that I needed those and not some other ordering constraints (and I still got it wrong!). In comparison to that, the old fashioned assembly programming looks like a piece of cake.

Any time you deviate from sequential consistency, you increase the complexity of the problem by orders of magnitude.

Microsoft volatile

No, I’m not talking about the stock market. I mentioned before that C++ volatile has nothing to do with multithreading. This is not entirely true because some compiler vendors took the liberty to add non-standard semantics to volatile (out of pure desperation, since they had to support multi-processor code, and pre-0x C++ offered no help in this area). This new semantics, at least in the case of Microsoft compilers, did not include sequential consistency. Instead it guaranteed acquire and release semantics (which, on the x86, didn’t result in any fencing). So, although the publication pattern will work on a Microsoft compiler with volatile variables, the Peterson lock won’t! I thought that was an interesting piece of trivia.