How does Java do it? Motivation for C++ programmers

Things like double-checked locking pattern and Peterson lock work out of the box in Java, as long as you declare all shared variables volatile. (Do not confuse Java volatile with C++ volatile–the latter has nothing to do with multithreading.) Obviously the Java compiler knows what kind of fences are necessary on relaxed memory architectures, including the x86. I thought it was worth looking at.

Why the sudden interest in Java? Because Java memory model is very relevant to C++0x. The keyword here is sequential consistency. Java enforces sequential consistency on all access to volatile variables. C++0x introduces atomic objects which, by default, also follow sequential consistency. So C++ atomics will, by default, behave almost exactly like Java volatile variables.

Also, why even bother studying the quirks of the x86? Can’t we just stick to locks and, occasionally, to default atomics and let the compiler/library writers worry about how they translate into fences? Absolutely!

This should be the end of the story, except that a lot of programmers seem to “know better.” They will try to optimize multithreaded algorithms and get into a world of trouble. So if you know anybody who might be tempted to write or optimize lock-free algorithms, read on.

Why the x86? Not only because it’s the most common chip, but also because its memory model is deceptively “normal.” It’s much more likely for programmers to attempt to play games with the x86 than, for instance, with the alpha. The rest of this post should serve as motivation to stay away form such games.

What is Sequential Consistency?

Sequential consistency is how we naturally think about multithreaded programs. It’s also how we think about the world. If A happens before B then it cannot be true that B happens before A. If one processor stores 1 in variable x, and another processor stores 1 in variable y, than either the sequence of events is:

  • x flips to 1 and then y flips to 1, or
  • y flips to 1 and then x flips to 1

(assume both are initially zero). Which sequence actually happened could be observed by a third processor, which loads both variables. If, for instance, it sees x == 1 and y == 0, it concludes that the write to x happened earlier. If it sees x == 0 and y == 1, it concludes that the write to y happened earlier (if it sees x == y, it can’t tell).

Now, in a sequentially consistent world, a fourth processor could not see the two events in a different order than what the third processor observed. We assume that, for each particular run of the program, there is a global sequence of events that is seen by all processors.

Of course, on multicore processors many things can happen simultaneously, and it usually doesn’t matter–except when memory access is involved. Sequentially consistent model assumes that there is a single switch between all processors and memory, and only one processor at a time can access it. This imaginary switch serves as the serialization point.

The x86 is not sequentially consistent

Welcome to the world of special relativity! Two observers (cores) might see two events (memory writes) in different order. Even on an x86.

The technical explanation is that, instead of one bus serializing all memory accesses, each core uses its own memory cache. Writes propagate from one cache to another with finite speed (measured in clock cycles). Reads snoop around the caches before going to shared memory. All this caching is done because a trip to main memory very expensive–hundreds of cycles.

I don’t care, says Java

The Java memory model requires that all access to shared volatile variables be sequentially consistent across all threads, even when they run on different cores. How do they enforce it?

The primary source for this kind of information is Doug Lea’s excellent The JSR-133 Cookbook for Compiler Writers.

Here’s the big picture: When Java is translated into bytecode, special codes are issued around volatile-variable access. These codes tell the Java runtime where memory fences should be inserted. The bytecodes are supposed to be processor independent; the fences are highly processor dependent. I’ll call the special bytecodes memory barriers as opposed to processor-specific memory fences. Memory barriers are inserted conservatively–the compiler assumes the most relaxed memory model (in other words, the bytecode has to run correctly on an alpha, whose memory model is so relaxed you’d think it was conceived in a hot tub).

Fences should go between memory accesses, so Java barriers are named after the kind of accesses (load or store) they separate. There is a LoadLoad barrier between volatile reads, LoadStore barrier between a read and a write, StoreLoad barrier between a write and a read, and StoreStore barrier between writes.

Here’s the first problem: the compiler can often only see one part of the pair. In that case it has to assume the worst and use the most restrictive barrier. Considering that, here are some of the cookbook rules:

  • Issue a StoreStore barrier before each volatile store.
  • Issue a StoreLoad barrier after each volatile store.
  • Issue LoadLoad and LoadStore barriers after each volatile load.

That’s a lot of barriers! Fortunately, the compiler can optimize some of them away using dataflow analysis.

The next step is to run the bytecode on a particular processor (or compile it to native code). If the processor is an alpha, practically all Java barriers translate into some kind of fences. But if it’s an x86, all but the StoreLoad barriers are turned into no-ops. The StoreLoad barrier is either translated into an mfence or a locked instruction (lock xchg).

When executed on an x86, the barrier rules boil down to this one: Issue an mfence after each volatile store (or turn it into lock xchg).

Peterson lock on an x86 in Java

Figuring the details of the Peterson lock in Java was not easy. I’m grateful to Doug Lea for patiently explaining to me some of the gotchas. I am solely responsible for any mistakes though.

In my previous post I came to the conclusion that for Peterson lock to work on an x86, an mfence is needed. Here’s the relevant pseudo-code:

Thread 0 Thread 1
store(zeroWants, 1)
mfence
store(victim, 0)
// Java puts another fence here
r0 = load(oneWants)
r1 = load(victim)
store(oneWants, 1)
mfence
store(victim, 1)
// Java puts another fence here
r0 = load(zeroWants)
r1 = load(victim)

In Java, all three variables, zeroWants, oneWants, and victim, would be declared volatile. Using the x86 Java translation rules, that would mean putting an mfence after every store to these variables.

The fences after the writes to zeroWant and oneWant are just what the doctor ordered. Form my previous post we know why they are needed on an x86.

The mfence after the write to victim is something new though. It isn’t strictly necessary on an x86, but to see that you really have to analyze the whole algorithm–something that’s beyond capabilities of modern compilers.

My original (turns out, incorrect) reasoning was that no fence is needed between the write to victim and the subsequent read because, on an x86, reads and writes to the same location are never reordered. Well, that’s not the whole story because…

Intra-processor forwarding is allowed

Consider the following example from the x86 spec. Initially x == y == 0.

Processor 0 Processor 1
mov [_x], 1
mov r1, [_x]
mov r2, [_y]
mov [_y], 1

mov r3, [_y]
mov r4, [_x]

The result r2 == 0 and r4 == 0 is allowed.

Let’s analyze that. First of all, because reads and writes to the same location are never reordered, r1 and r3 must both be 1.

If r2 == 0, we would naturally conclude that the write to y at processor 1 hasn’t happened yet. It means: processor 1 hasn’t executed its code yet. When it does, it should see the earlier write to x (processor 0 saw it!). Guess what–sometimes it doesn’t.

The technical reason for this behavior is that the write to x is immediately visible at processor 0, before it propagates to processor 1. Similarly, the write to y is immediately visible at processor 1, before it propagates to processor 0. So it is possible for both processor to see each other’s writes delayed, even though they see their own writes immediately.

This result violates sequential consistency and therefore cannot be tolerated in Java. Hence the need for the mfence between the write to victim and the subsequent read of victim.

Gosh darn it! I want to optimize it!

Having said that, I must point out that, in this particular case, lack of sequential consistency is not a deal breaker. It’s okay for thread 0 of Peterson lock to read stale (local) value of victim, even if “in the meanwhile” thread 1 overwrote it. All it means is that some unnecessary spins might be performed. This doesn’t violate the lock principles and doesn’t lead to a deadlock as long as the new value of victim eventually reaches thread 0.

I’m pretty sure of this reasoning–at least at the time of this writing. However, I wouldn’t be totally surprised is somebody found a fault in it.

The bottom line is that even such an “obvious” optimization is not obviously correct. The proof of correctness of the Peterson lock is based on the assumption of sequential consistency. If we relax this assumption even slightly, we have to redo the whole proof.

Any volunteers?