The Inscrutable C++ Memory Model

In my last post I made a mistake of publishing a piece of code that used C++ weak atomics–which are part of the new C++ Memory Model. I qualified the post by saying that I had no proof of correctness, and the main purpose of it was to discourage the use of weak atomics. In a way the post was successful because it did prove that weak atomics are very tricky indeed. As was pointed out to me by Anthony and Dmitriy, my implementation was incorrect (I modified the post retroactively to make this point clear). Additionally, in his blog, Anthony published a proof of the algorithm posted by Dmitriy. And that’s when the fun started…

Before I describe what happened, let me give you a bird’s eye view of the C++ Memory Model.

The Model

The purpose of a memory model is to define the semantics of concurrent operations at an abstract level–independent of the platform on which the program is running. Essentially, one has to define in what order program statements may be executed, and what memory values may be observed. The basic memory model, the one used by Java, is based on sequential consistency. Program statements are executed in program order on each processor, and the actions of multiple threads are interleaved in some (global) order. A memory load can only see the value that was written last, according to this global order.

The C++ strong atomics (the default) follow this model.

It’s the weak atomics that are trouble. Weak atomics let the programmer specify memory ordering of loads and stores that is less than sequentially consistent. Unlike the sequentially consistent model, the model for weak atomics cannot be described in terms of a state machine (this was pointed out and criticized by Herb Sutter in his Prism proposal). Instead the model is formulated as a set of global constraints imposed on executions.

In a state-machine formulation, a step the program takes is based on previous history. The step doesn’t have to be uniquely defined–the machine may be non-deterministic. The execution may split into multiple branches. What is important, though, is that the step taken does not depend on the future of the execution.

This is not true in the weak-atomics model. The way to think about it is that, in a non-sequentially-consistent execution, the values seen by memory loads may come either from the past or from the future. A processor may speculate–make a guess at the values that have not been stored yet. A speculation may be discarded if the guess turns out to be wrong, or accepted if it was right. So the acceptability of a given execution can only be determined by looking at the whole execution, from start to end. Understandably, this makes any correctness proofs extremely tricky.

The Proof

The proposed implementation of the Peterson lock using weak atomics can be reduced to the following:

// Thread 0
zeroWants.save(true, memory_order_relaxed);
r0 = victim.exchange(0, memory_order_acq_rel);
r1 = oneWants.load(memory_order_acquire);

// Thread 1
oneWants.save(true, memory_order_relaxed);
r2 = victim.exchange(1, memory_order_acq_rel);
r3 = zeroWants.load(memory_order_acquire);

(The memory_order_acquire directives in the two loads are needed for synchronization with the unlock part of the algorithm, which is not included here.)

The hardest part of the proof is to show mutual exclusion–it should be impossible for both threads to enter the critical section at the same time. In Peterson’s algorithm, the entrance to the critical section is predicated on the values loaded into r1 and r3. If both values are false, the algorithm is broken.

Let’s consider the situation when both threads try to acquire the lock for the first time. The values of zeroWants and oneWants are initially false, and the value of victim is zero. Is it possible for both loads (zeroWants and oneWants) to see the initial values, rather than the ones saved by the opposing thread? The short answer is yes, as long as there is no intervening inter-thread synchronization.

In this case, the only synchronization that may happen is between a store-release in one thread and the corresponding load-acquire in the other. For instance, if the load part of the exchange operation on victim in thread 0 sees the value written by the store part of the exchange in thread 1 (that is, r0 is equal to 1), than these two operations are in a “synchronizes-with” relationship. We can then establish the following chronology:

The save of true to oneWants in T1 happens before the store of 1 to victim (part of exchange) in T1. This is guaranteed by program order.
The store of 1 into victim in T1 happens before the load of victim in T0 (part of exchange). This is because r0 is 1 in this part of the proof.
The load of victim in T0 happens before the load of oneWants in T0. This is guaranteed by program order.

Because the happens-before relation is transitive, we can conclude that the load of oneWants in T0 (item 3) must see the value stored in T1 (item 1), which is true, and therefore T0 won’t enter the critical section. So far so good.

That was the simple case. Here’s the controversial one: Both r0 and r2 are zero. In particular, it means that r0 came from the initial value of victim. What about r2? It could have come either from the exchange in T0 or from the initial value of victim (both are zero!). According to the memory model, we have to consider both cases!

If r0 came from the exchange in T1 then we can use the same reasoning as before to conclude that T1 cannot enter the critical section.

But if the source of the zero in r0 is the original value of victim then there is no synchronization between T0 and T1. Both threads can enter the critical section!

You might be thinking: Either the exchange in T1 comes first and the exchange in T2 must see its value, or the exchange in T2 comes first and the one in T1 must see its value–they can’t both see the original value. This just shows you how ingrained sequentially consistent thinking is. There is no meaning to “comes first” in the world of weak atomics. Two events may only be sequenced if they occur in the same thread, or there is a synchronization path between them. Here we are still at the stage of trying to establish whether the two exchanges are synchronized or not.

The Controversy

Let me spell it out: Both exchanges see the initial value of victim. Since they didn’t interact, there is no synchronizes-with relationship between them. Therefore we can’t establish any inter-thread chronology, and it’s perfectly possible for both threads to see the original values of zeroWants and oneWants.

// Thread 0
zeroWants.save(true, memory_order_relaxed);
r0 = victim.exchange(0, memory_order_acq_rel); // sees initial value, 0
r1 = oneWants.load(memory_order_acquire); // sees initial value, false

// Thread 1
oneWants.save(true, memory_order_relaxed);
r2 = victim.exchange(1, memory_order_acq_rel); // sees initial value, 0
r3 = zeroWants.load(memory_order_acquire); // sees initial value, false

I had concluded that the implementation was broken, but I let Anthony defend his proof.

Here’s the argument he presented, based on the modification order of victim. The C++ memory model enforces global modification order for every memory location. The victim variable starts at zero and then is modified twice by the exchanges in T0 and T1. If thread T1 sees the original value, zero, of victim then the same exchange modifies it to one. The modification order is therefore (T1: 1, T0: 0). If we assume that an atomic exchange must always see the preceding value in the modification order, then we must conclude that the exchange in T0 must read 1. Therefore it synchronizes with the exchange in T1 and mutual exclusion is assured.

And here’s the clincher: The draft C++ Standard did not specify what value must be seen by the atomic exchange (or, more generally, by an atomic Read-Modify-Write operation). I confirmed that with Hans Boehm, who concluded that this apparent hole in the standard must be corrected.

The next draft of the Standard will say that a RMW operation must see the last value in the modification order of the memory object in question.

Only then the proof of correctness of the weak-atomics implementation of Peterson lock will be complete.

Conclusion

I had no idea what I was getting myself into when attempting to reason about C++ weak atomics. The theory behind them is so complex that it’s borderline unusable. It took three people (Anthony, Hans, and me) and a modification to the Standard to complete the proof of a relatively simple algorithm. Imagine doing the same for a lock-free queue based on weak atomics!

Hans, who was instrumental in creating the C++ memory model, hopes that weak atomics will be a stopgap measure until processor designers find more efficient ways of implementing sequential consistency. For now, it is strongly recommended to leave weak atomics to top experts whose numbers can be counted on the fingers of one hand.

I’m grateful to (in order of appearance) Anthony Williams, Dmitriy V’jukov, Hans Boehm, and others who contributed to this post.

Joe Says:

December 24, 2008 at 1:36 am

> The basic memory model, the one used by Java, is based on sequential consistency.

Java’s memory model definitely doesn’t guarantee SC, unless you declare all your variables as volatile.

Chris Jefferson Says:

December 24, 2008 at 4:23 pm

I’ve been trying to review the new standard. While there are some really nice things, there are other bits like this which I’m sure are going to become warts that no-one will every understand, with both users and compilers misunderstanding and misimplementing things $:\$

Top Posts « WordPress.com Says:

December 25, 2008 at 5:02 pm

[…] The Inscrutable C++ Memory Model In my last post I made a mistake of publishing a piece of code that used C++ weak atomics–which are part of the […] […]

Dmitriy V'jukov Says:

January 16, 2009 at 6:06 am

Re: And here’s the clincher: The draft C++ Standard did not specify what value must be seen by the atomic exchange

Another curious moment: In early drafts it was stated that fence() on atomic object executes RMW on the memory location. But it was not stated that that RMW must be ‘identity’, i.e. fence() was allowed to execute basically any RMW as side-effect 🙂

January 16, 2009 at 6:11 am

Re: It took three people (Anthony, Hans, and me) and a modification to the Standard to complete the proof of a relatively simple algorithm. Imagine doing the same for a lock-free queue based on weak atomics!

This is why I am not trying to do this anymore, since I’ve implemented formal rules in the Relacy Race Detector:
http://groups.google.com/group/relacy

Now, I am unable to beat Relacy in formal verification, however Relacy can beat me in so many cases. 10^6 verified interleavings per second is quite powerful indeed, and it was also revealed some problems in C++ drafts.

Joerg Says:

January 22, 2009 at 6:16 am

Djen dobrui!

Have you ever considered using a single 32 bit dword to store all 3 variables?
Shouldn’t writes to the same location then have a total order (given the compiler won’t ‘optimze’ the access)?
However I cannot find the coressponding section in the x86-64 manual now…

1024cores Says:

January 22, 2009 at 7:47 am

Storing all variables in single word must work on x86 for now. Read posts #2 and # 7 here:
http://software.intel.com/en-us/forums/threading-on-intel-parallel-architectures/topic/62973/

However, I believe that ‘collocation trick’ (partial store-to-load forwarding) is not documented feature for IA-32 and Intel 64 archs (probably it’s documented on SPARC because SUN JVM hackers use it).

And collocation trick is not actually any faster than mfence, so it makes little sense for x86.

And collocation trick makes no sense at all in the context of portable C++0x memory model.

—
Dmitriy V’jukov

January 22, 2009 at 7:50 am

Just to make it explicit. Collocation trick does work on SPARC, and it actually faster than #StoreLoad membar. It’s a kind of fine-grained store-load fence. You can find details here:

Click to access QRL-OpLocks-BiasedLocking.pdf

(search by “collocation”)

January 22, 2009 at 8:32 am

Thanks, I did not know this has even a proper name!

I totally agree to you that it makes no much sense in the C++ memory model context, however on x86 I actually observe differences compared to mfence. Time to bring up the disassembler I suppose and find my bug…

Thanks for the materials!

January 23, 2009 at 3:04 am

PS: Just did a few test runs on Xeon-8x and AMD-2x machines:
Peterson with ‘Collocation’ is ~30% faster than using an mfence instead on both machines.

So how do you came to the opposite conclusion?

January 23, 2009 at 3:34 am

I’ve concluded it from post #7 here:
http://software.intel.com/en-us/forums/threading-on-intel-parallel-architectures/topic/62973/

Well, I think that I was too fast generalizing that collocation is no faster than mfence on x86 in general. The tests was done on Intel Core i7. However with every generation of processors Intel reports that locked instructions and fences work faster, so I think now they match.

“C++ is a superpower” | /dev/rant Says:

July 27, 2009 at 7:37 pm

[…] intervention and the modification of the standard just to get it to work. In his post titled The Inscrutable C++ Memory Model, he recounts just how complicated it is: I had no idea what I was getting myself into when […]

Xin Huang Says:

December 29, 2013 at 2:29 am

Hi, Bartosz

If I understand correctly, the cause is the value returned by exchange is not specified in the old draft, it can be a value already expired, or a value just set by another thread?

BTW, I think on x86 platform, a `lock exchg` will always yield the latest value, so there will not be any problem. (But not on other platforms) Right?

Thanks!
Xin Huang

Larry Says:

December 11, 2014 at 6:42 am

If RMW operations can load the value set by another thread, does it mean that the following implementation also work?

// Thread 0
zeroWants.save(true, memory_order_release);
r0 = victim.exchange(0, memory_order_relaxed);
r1 = oneWants.fetch_add(0, memory_order_acq_rel);

// Thread 1
oneWants.save(true, memory_order_release);
r0 = victim.exchange(1, memory_order_relaxed);
r1 = zeroWants.fetch_add(0, memory_order_acq_rel);

Bartosz Milewski's Programming Cafe