In my last post I made a mistake of publishing a piece of code that used C++ weak atomics–which are part of the new C++ Memory Model. I qualified the post by saying that I had no proof of correctness, and the main purpose of it was to discourage the use of weak atomics. In a way the post was successful because it did prove that weak atomics are very tricky indeed. As was pointed out to me by Anthony and Dmitriy, my implementation was incorrect (I modified the post retroactively to make this point clear). Additionally, in his blog, Anthony published a proof of the algorithm posted by Dmitriy. And that’s when the fun started…

Before I describe what happened, let me give you a bird’s eye view of the C++ Memory Model.

## The Model

The purpose of a memory model is to define the semantics of concurrent operations at an *abstract level*–independent of the platform on which the program is running. Essentially, one has to define in what order program statements may be executed, and what memory values may be observed. The basic memory model, the one used by Java, is based on sequential consistency. Program statements are executed in program order on each processor, and the actions of multiple threads are interleaved in some (global) order. A memory load can only see the value that was written last, according to this global order.

The C++ strong atomics (the default) follow this model.

It’s the weak atomics that are trouble. Weak atomics let the programmer specify memory ordering of loads and stores that is less than sequentially consistent. Unlike the sequentially consistent model, the model for weak atomics cannot be described in terms of a state machine (this was pointed out and criticized by Herb Sutter in his Prism proposal). Instead the model is formulated as a set of global constraints imposed on executions.

In a state-machine formulation, a step the program takes is based on previous history. The step doesn’t have to be uniquely defined–the machine may be non-deterministic. The execution may split into multiple branches. What is important, though, is that the step taken does not depend on the *future* of the execution.

This is not true in the weak-atomics model. The way to think about it is that, in a non-sequentially-consistent execution, the values seen by memory loads may come either from the past or from the future. A processor may speculate–make a guess at the values that have not been stored yet. A speculation may be discarded if the guess turns out to be wrong, or accepted if it was right. So the acceptability of a given execution can only be determined by looking at the whole execution, from start to end. Understandably, this makes any correctness proofs extremely tricky.

## The Proof

The proposed implementation of the Peterson lock using weak atomics can be reduced to the following:

// Thread 0 zeroWants.save(true, memory_order_relaxed); r0 = victim.exchange(0, memory_order_acq_rel); r1 = oneWants.load(memory_order_acquire); // Thread 1 oneWants.save(true, memory_order_relaxed); r2 = victim.exchange(1, memory_order_acq_rel); r3 = zeroWants.load(memory_order_acquire);

(The `memory_order_acquire` directives in the two loads are needed for synchronization with the unlock part of the algorithm, which is not included here.)

The hardest part of the proof is to show mutual exclusion–it should be impossible for both threads to enter the critical section at the same time. In Peterson’s algorithm, the entrance to the critical section is predicated on the values loaded into r1 and r3. If both values are false, the algorithm is broken.

Let’s consider the situation when both threads try to acquire the lock for the first time. The values of `zeroWants` and `oneWants` are initially false, and the value of `victim` is zero. Is it possible for both loads (zeroWants and oneWants) to see the initial values, rather than the ones saved by the opposing thread? The short answer is *yes*, as long as there is no intervening inter-thread synchronization.

In this case, the only synchronization that may happen is between a store-release in one thread and the corresponding load-acquire in the other. For instance, if the load part of the exchange operation on `victim` in thread 0 sees the value written by the store part of the exchange in thread 1 (that is, r0 is equal to 1), than these two operations are in a “synchronizes-with” relationship. We can then establish the following chronology:

- The save of
`true`to`oneWants`in T1*happens before*the store of 1 to`victim`(part of exchange) in T1. This is guaranteed by program order. - The store of 1 into
`victim`in T1*happens before*the load of`victim`in T0 (part of exchange). This is because r0 is 1 in this part of the proof. - The load of
`victim`in T0*happens before*the load of`oneWants`in T0. This is guaranteed by program order.

Because the happens-before relation is transitive, we can conclude that the load of `oneWants` in T0 (item 3) must see the value stored in T1 (item 1), which is `true`, and therefore T0 won’t enter the critical section. So far so good.

That was the simple case. Here’s the controversial one: Both r0 and r2 are zero. In particular, it means that r0 came from the initial value of `victim`. What about r2? It could have come either from the exchange in T0 or from the initial value of `victim` (both are zero!). According to the memory model, we *have to* consider both cases!

If r0 came from the exchange in T1 then we can use the same reasoning as before to conclude that T1 cannot enter the critical section.

But if the source of the zero in r0 is the original value of `victim` then there is *no* synchronization between T0 and T1. Both threads can enter the critical section!

*You might be thinking: Either the exchange in T1 comes first and the exchange in T2 must see its value, or the exchange in T2 comes first and the one in T1 must see its value–they can’t both see the original value. This just shows you how ingrained sequentially consistent thinking is. There is no meaning to “comes first” in the world of weak atomics. Two events may only be sequenced if they occur in the same thread, or there is a synchronization path between them. Here we are still at the stage of trying to establish whether the two exchanges are synchronized or not.*

## The Controversy

Let me spell it out: Both exchanges see the initial value of `victim`. Since they didn’t interact, there is no synchronizes-with relationship between them. Therefore we can’t establish any inter-thread chronology, and it’s perfectly possible for both threads to see the original values of `zeroWants` and `oneWants`.

// Thread 0 zeroWants.save(true, memory_order_relaxed); r0 = victim.exchange(0, memory_order_acq_rel); // sees initial value, 0 r1 = oneWants.load(memory_order_acquire); // sees initial value, false // Thread 1 oneWants.save(true, memory_order_relaxed); r2 = victim.exchange(1, memory_order_acq_rel); // sees initial value, 0 r3 = zeroWants.load(memory_order_acquire); // sees initial value, false

I had concluded that the implementation was broken, but I let Anthony defend his proof.

Here’s the argument he presented, based on the *modification order* of `victim`. The C++ memory model enforces global modification order for every memory location. The `victim` variable starts at zero and then is modified twice by the exchanges in T0 and T1. If thread T1 sees the original value, zero, of `victim` then the same exchange modifies it to one. The modification order is therefore (T1: 1, T0: 0). If we assume that an atomic exchange must always see the preceding value in the modification order, then we must conclude that the exchange in T0 must read 1. Therefore it synchronizes with the exchange in T1 and mutual exclusion is assured.

And here’s the clincher: The draft C++ Standard did not specify what value must be seen by the atomic exchange (or, more generally, by an atomic Read-Modify-Write operation). I confirmed that with Hans Boehm, who concluded that this apparent hole in the standard must be corrected.

The next draft of the Standard will say that a RMW operation must see the last value in the modification order of the memory object in question.

Only then the proof of correctness of the weak-atomics implementation of Peterson lock will be complete.

## Conclusion

I had no idea what I was getting myself into when attempting to reason about C++ weak atomics. The theory behind them is so complex that it’s borderline unusable. It took three people (Anthony, Hans, and me) and a modification to the Standard to complete the proof of a relatively simple algorithm. Imagine doing the same for a lock-free queue based on weak atomics!

Hans, who was instrumental in creating the C++ memory model, hopes that weak atomics will be a stopgap measure until processor designers find more efficient ways of implementing sequential consistency. For now, it is strongly recommended to leave weak atomics to top experts whose numbers can be counted on the fingers of one hand.

I’m grateful to (in order of appearance) Anthony Williams, Dmitriy V’jukov, Hans Boehm, and others who contributed to this post.