C++ | Bartosz Milewski's Programming Cafe

December 1, 2008

C++ atomics and memory ordering

Posted by Bartosz Milewski under C++, Concurrency, Multicore, Programming | Tags: Atomics, C++ |
[39] Comments

The question that’s been bugging me lately was: How does C++ make the use atomic variables both portable and efficient.

I knew how Java volatile worked–it enforced sequential consistency, which is not always the most efficient thing to do.

C++0x has atomic variables which also enforce sequential consistency when used in the default mode. Without special ordering annotations they work almost like Java’s volatile (interestingly, Java’s volatile doesn’t enforce atomicity–there is an atomic library in Java, though, which does).

However, C++ offers various degrees of relaxation of sequential consistency which, if used correctly, should produce more efficient code.

After studying a bit of the x86 memory model, I realized that some basic lock-free patterns (like the one in double-checked locking) will work without any fences. There should be a way of coding them in C++ such that, when compiled for the x86, no fences are produced. On the other hand, the same C++ code, when compiled for, let’s say, the alpha or the Power PC, should produce the necessary fencing.

To make things more interesting, there are some other algorithms, like the Peterson lock, which do require memory fences on the x86 (see my previous post). So it’s not just the matter of skipping all fencing.

I reduced my question to: How does one write portable code in C++ that results in just the right amount of fencing on the x86?

Specifying memory ordering in C++

The C++0x atomics library proposal underwent a lot of changes before it settled into its current shape. There is now a complete C++0x draft. It’s not an easy read so I’m grateful to Hans Boehm for clarifying some of the fine points.

Without further ado, this is how the publication safety pattern (the mother of all double-checked locking patterns) could be written in C++:

  atomic<bool> ready = false;
  atomic<int> data = 0;

Thread 0:

  data.store(1);
  ready.store(true);

Thread 1:

  if (ready.load())
    assert (data.load() == 1);

As you can see, atomic objects have methods store and load that are used for writing and reading underlying shared data. By default, these methods enforce sequential consistency. What it means is that this code has the same semantics as if it were written in Java using volatile variables. Consequently, on the x86, it will generate memory fences after each store.

But we know that these fences are not necessary for publication safety. How can we write code that would produce minimal fencing on the x86?

To make such fine tuning possible, the C++0x atomics library allows the programmer to specify ordering requirements for each load and store. I’ll explain various ordering options in a moment; for now let’s have a look at the optimized version of the publication pattern:

  atomic<bool> ready = false;
  atomic<int> data = 0;

Thread 0:

  data.store(1, memory_order_release);
  ready.store(true, memory_order_release);

Thread 1:

  if (ready.load(memory_order_acquire))
    assert (data.load(memory_order_acquire) == 1);

What’s important is that this code will work correctly on any major processor, but it will produce no fences on the x86.

Warning: Even if you know that your program will only ever run on an x86, you can’t remove the atomics and the ordering constraints from your code. They are still necessary to prevent the compiler from doing the reordering.

Fine-tuning memory ordering

Memory order can be specified using the following enumeration:

namespace std {
 typedef enum memory_order {
   memory_order_relaxed,
   memory_order_consume,
   memory_order_acquire,
   memory_order_release,
   memory_order_acq_rel,
   memory_order_seq_cst
 } memory_order; 
}

The default for all operations on atomic variables is memory_order_seq_cst which, just like Java’s volatile, enforces sequential consistency. Other orderings are used to relax sequential consistency and often squeeze better performance from lock-free algorithms.

memory_order_acquire: guarantees that subsequent loads are not moved before the current load or any preceding loads.
memory_order_release: preceding stores are not moved past the current store or any subsequent stores.
memory_order_acq_rel: combines the two previous guarantees.
memory_order_consume: potentially weaker form of memory_order_acquire that enforces ordering of the current load before other operations that are data-dependent on it (for instance, when a load of a pointer is marked memory_order_consume, subsequent operations that dereference this pointer won’t be moved before it (yes, even that is not guaranteed on all platforms!).
memory_order_relaxed: all reorderings are okay.

As I discussed before, the x86 guarantees acquire semantics for loads and release semantics for stores, so load(memory_order_acquire) and store(x, memory_order_release) produce no fences. So, indeed, our optimized version of the publication pattern will produce optimal assembly on the x86 (and I presume on other processors too).

But I have also shown before that the Peterson’s algorithm won’t work on the x86 without fencing. So it’s not enough to use just memory_order_acquire and memory_order_release to implement it. In fact, the problem with the Peterson lock was the possibility of reordering loads with stores. The weakest constraint that prevents such reordering is memory_order_acq_rel.

Now here the funny story begins. Without much thinking, I decided that just slapping memory_order_acq_rel on any of the writes would fix the problem. Here’s what I originally wrote:

[begin quote] The correct C++ code in this case is:

Peterson::Peterson() {
   _victim.store(0, memory_order_release);
   _interested[0].store(false, memory_order_release);
   _interested[1].store(false, memory_order_release);
}
void Peterson::lock() {
   int me = threadID; // either 0 or 1
   int he = 1 – me; // the other thread
   _interested[me].exchange(true, memory_order_acq_rel);
   _victim.store(me, memory_order_release);
   while (_interested[he].load(memory_order_acquire)
       && _victim.load(memory_order_acquire) == me)
      continue; // spin
}

The ordering memory_order_acq_rel produces an mfence instruction on the x86. This fence will prevent the movement of the load of _interested[he] before the store to _interested[me], which could otherwise happen (and break the algorithm).[end quote]

You can read comments to this post–especially the ones by Anthony and Dmitriy, who made me eat crow. In short, the point is that the “acquire” part of memory_order_acq_rel has no meaning unless another thread writes to (“releases”) the same variable. Since only thread 0 writes to _interested[0] and only thread 1 one writes to _interested[1], this synchronization accomplished nothing (outside of the x86). Dmitriy’s implementation, which synchronizes on _victim, is correct (but I had to ask many questions before understanding Anthony’s proof of it).

Conclusion

What strikes me about this example is how non-trivial the reasoning behind this implementation was. I had to actually go through the x86 assembly to realize that I needed those and not some other ordering constraints (and I still got it wrong!). In comparison to that, the old fashioned assembly programming looks like a piece of cake.

Any time you deviate from sequential consistency, you increase the complexity of the problem by orders of magnitude.

Microsoft volatile

No, I’m not talking about the stock market. I mentioned before that C++ volatile has nothing to do with multithreading. This is not entirely true because some compiler vendors took the liberty to add non-standard semantics to volatile (out of pure desperation, since they had to support multi-processor code, and pre-0x C++ offered no help in this area). This new semantics, at least in the case of Microsoft compilers, did not include sequential consistency. Instead it guaranteed acquire and release semantics (which, on the x86, didn’t result in any fencing). So, although the publication pattern will work on a Microsoft compiler with volatile variables, the Peterson lock won’t! I thought that was an interesting piece of trivia.

November 11, 2008

Who ordered sequential consistency?

Posted by Bartosz Milewski under C++, Concurrency, Java, Multicore, Programming | Tags: Atomics, C++, Concurrency, Java, mfence, Multicore, volatile |
[19] Comments

How does Java do it? Motivation for C++ programmers

Things like double-checked locking pattern and Peterson lock work out of the box in Java, as long as you declare all shared variables volatile. (Do not confuse Java volatile with C++ volatile–the latter has nothing to do with multithreading.) Obviously the Java compiler knows what kind of fences are necessary on relaxed memory architectures, including the x86. I thought it was worth looking at.

Why the sudden interest in Java? Because Java memory model is very relevant to C++0x. The keyword here is sequential consistency. Java enforces sequential consistency on all access to volatile variables. C++0x introduces atomic objects which, by default, also follow sequential consistency. So C++ atomics will, by default, behave almost exactly like Java volatile variables.

Also, why even bother studying the quirks of the x86? Can’t we just stick to locks and, occasionally, to default atomics and let the compiler/library writers worry about how they translate into fences? Absolutely!

This should be the end of the story, except that a lot of programmers seem to “know better.” They will try to optimize multithreaded algorithms and get into a world of trouble. So if you know anybody who might be tempted to write or optimize lock-free algorithms, read on.

Why the x86? Not only because it’s the most common chip, but also because its memory model is deceptively “normal.” It’s much more likely for programmers to attempt to play games with the x86 than, for instance, with the alpha. The rest of this post should serve as motivation to stay away form such games.

What is Sequential Consistency?

Sequential consistency is how we naturally think about multithreaded programs. It’s also how we think about the world. If A happens before B then it cannot be true that B happens before A. If one processor stores 1 in variable x, and another processor stores 1 in variable y, than either the sequence of events is:

x flips to 1 and then y flips to 1, or
y flips to 1 and then x flips to 1

(assume both are initially zero). Which sequence actually happened could be observed by a third processor, which loads both variables. If, for instance, it sees x == 1 and y == 0, it concludes that the write to x happened earlier. If it sees x == 0 and y == 1, it concludes that the write to y happened earlier (if it sees x == y, it can’t tell).

Now, in a sequentially consistent world, a fourth processor could not see the two events in a different order than what the third processor observed. We assume that, for each particular run of the program, there is a global sequence of events that is seen by all processors.

Of course, on multicore processors many things can happen simultaneously, and it usually doesn’t matter–except when memory access is involved. Sequentially consistent model assumes that there is a single switch between all processors and memory, and only one processor at a time can access it. This imaginary switch serves as the serialization point.

The x86 is not sequentially consistent

Welcome to the world of special relativity! Two observers (cores) might see two events (memory writes) in different order. Even on an x86.

The technical explanation is that, instead of one bus serializing all memory accesses, each core uses its own memory cache. Writes propagate from one cache to another with finite speed (measured in clock cycles). Reads snoop around the caches before going to shared memory. All this caching is done because a trip to main memory very expensive–hundreds of cycles.

I don’t care, says Java

The Java memory model requires that all access to shared volatile variables be sequentially consistent across all threads, even when they run on different cores. How do they enforce it?

The primary source for this kind of information is Doug Lea’s excellent The JSR-133 Cookbook for Compiler Writers.

Here’s the big picture: When Java is translated into bytecode, special codes are issued around volatile-variable access. These codes tell the Java runtime where memory fences should be inserted. The bytecodes are supposed to be processor independent; the fences are highly processor dependent. I’ll call the special bytecodes memory barriers as opposed to processor-specific memory fences. Memory barriers are inserted conservatively–the compiler assumes the most relaxed memory model (in other words, the bytecode has to run correctly on an alpha, whose memory model is so relaxed you’d think it was conceived in a hot tub).

Fences should go between memory accesses, so Java barriers are named after the kind of accesses (load or store) they separate. There is a LoadLoad barrier between volatile reads, LoadStore barrier between a read and a write, StoreLoad barrier between a write and a read, and StoreStore barrier between writes.

Here’s the first problem: the compiler can often only see one part of the pair. In that case it has to assume the worst and use the most restrictive barrier. Considering that, here are some of the cookbook rules:

Issue a StoreStore barrier before each volatile store.
Issue a StoreLoad barrier after each volatile store.
Issue LoadLoad and LoadStore barriers after each volatile load.

That’s a lot of barriers! Fortunately, the compiler can optimize some of them away using dataflow analysis.

The next step is to run the bytecode on a particular processor (or compile it to native code). If the processor is an alpha, practically all Java barriers translate into some kind of fences. But if it’s an x86, all but the StoreLoad barriers are turned into no-ops. The StoreLoad barrier is either translated into an mfence or a locked instruction (lock xchg).

When executed on an x86, the barrier rules boil down to this one: Issue an mfence after each volatile store (or turn it into lock xchg).

Peterson lock on an x86 in Java

Figuring the details of the Peterson lock in Java was not easy. I’m grateful to Doug Lea for patiently explaining to me some of the gotchas. I am solely responsible for any mistakes though.

In my previous post I came to the conclusion that for Peterson lock to work on an x86, an mfence is needed. Here’s the relevant pseudo-code:

Thread 0	Thread 1
store(zeroWants, 1) mfence store(victim, 0) // Java puts another fence here r0 = load(oneWants) r1 = load(victim)	store(oneWants, 1) mfence store(victim, 1) // Java puts another fence here r0 = load(zeroWants) r1 = load(victim)

In Java, all three variables, zeroWants, oneWants, and victim, would be declared volatile. Using the x86 Java translation rules, that would mean putting an mfence after every store to these variables.

The fences after the writes to zeroWant and oneWant are just what the doctor ordered. Form my previous post we know why they are needed on an x86.

The mfence after the write to victim is something new though. It isn’t strictly necessary on an x86, but to see that you really have to analyze the whole algorithm–something that’s beyond capabilities of modern compilers.

My original (turns out, incorrect) reasoning was that no fence is needed between the write to victim and the subsequent read because, on an x86, reads and writes to the same location are never reordered. Well, that’s not the whole story because…

Intra-processor forwarding is allowed

Consider the following example from the x86 spec. Initially x == y == 0.

Processor 0	Processor 1
mov [_x], 1 mov r1, [_x] mov r2, [_y]	mov [_y], 1 mov r3, [_y] mov r4, [_x]

Processor 0

Processor 1

mov [_x], 1
mov r1, [_x]
mov r2, [_y]

mov [_y], 1

mov r3, [_y]
mov r4, [_x]

The result r2 == 0 and r4 == 0 is allowed.

Let’s analyze that. First of all, because reads and writes to the same location are never reordered, r1 and r3 must both be 1.

If r2 == 0, we would naturally conclude that the write to y at processor 1 hasn’t happened yet. It means: processor 1 hasn’t executed its code yet. When it does, it should see the earlier write to x (processor 0 saw it!). Guess what–sometimes it doesn’t.

The technical reason for this behavior is that the write to x is immediately visible at processor 0, before it propagates to processor 1. Similarly, the write to y is immediately visible at processor 1, before it propagates to processor 0. So it is possible for both processor to see each other’s writes delayed, even though they see their own writes immediately.

This result violates sequential consistency and therefore cannot be tolerated in Java. Hence the need for the mfence between the write to victim and the subsequent read of victim.

Gosh darn it! I want to optimize it!

Having said that, I must point out that, in this particular case, lack of sequential consistency is not a deal breaker. It’s okay for thread 0 of Peterson lock to read stale (local) value of victim, even if “in the meanwhile” thread 1 overwrote it. All it means is that some unnecessary spins might be performed. This doesn’t violate the lock principles and doesn’t lead to a deadlock as long as the new value of victim eventually reaches thread 0.

I’m pretty sure of this reasoning–at least at the time of this writing. However, I wouldn’t be totally surprised is somebody found a fault in it.

The bottom line is that even such an “obvious” optimization is not obviously correct. The proof of correctness of the Peterson lock is based on the assumption of sequential consistency. If we relax this assumption even slightly, we have to redo the whole proof.

Any volunteers?

Bartosz Milewski's Programming Cafe