Who ordered sequential consistency?

November 11, 2008

Who ordered sequential consistency?

Posted by Bartosz Milewski under C++, Concurrency, Java, Multicore, Programming | Tags: Atomics, C++, Concurrency, Java, mfence, Multicore, volatile |
[19] Comments

How does Java do it? Motivation for C++ programmers

Things like double-checked locking pattern and Peterson lock work out of the box in Java, as long as you declare all shared variables volatile. (Do not confuse Java volatile with C++ volatile–the latter has nothing to do with multithreading.) Obviously the Java compiler knows what kind of fences are necessary on relaxed memory architectures, including the x86. I thought it was worth looking at.

Why the sudden interest in Java? Because Java memory model is very relevant to C++0x. The keyword here is sequential consistency. Java enforces sequential consistency on all access to volatile variables. C++0x introduces atomic objects which, by default, also follow sequential consistency. So C++ atomics will, by default, behave almost exactly like Java volatile variables.

Also, why even bother studying the quirks of the x86? Can’t we just stick to locks and, occasionally, to default atomics and let the compiler/library writers worry about how they translate into fences? Absolutely!

This should be the end of the story, except that a lot of programmers seem to “know better.” They will try to optimize multithreaded algorithms and get into a world of trouble. So if you know anybody who might be tempted to write or optimize lock-free algorithms, read on.

Why the x86? Not only because it’s the most common chip, but also because its memory model is deceptively “normal.” It’s much more likely for programmers to attempt to play games with the x86 than, for instance, with the alpha. The rest of this post should serve as motivation to stay away form such games.

What is Sequential Consistency?

Sequential consistency is how we naturally think about multithreaded programs. It’s also how we think about the world. If A happens before B then it cannot be true that B happens before A. If one processor stores 1 in variable x, and another processor stores 1 in variable y, than either the sequence of events is:

x flips to 1 and then y flips to 1, or
y flips to 1 and then x flips to 1

(assume both are initially zero). Which sequence actually happened could be observed by a third processor, which loads both variables. If, for instance, it sees x == 1 and y == 0, it concludes that the write to x happened earlier. If it sees x == 0 and y == 1, it concludes that the write to y happened earlier (if it sees x == y, it can’t tell).

Now, in a sequentially consistent world, a fourth processor could not see the two events in a different order than what the third processor observed. We assume that, for each particular run of the program, there is a global sequence of events that is seen by all processors.

Of course, on multicore processors many things can happen simultaneously, and it usually doesn’t matter–except when memory access is involved. Sequentially consistent model assumes that there is a single switch between all processors and memory, and only one processor at a time can access it. This imaginary switch serves as the serialization point.

The x86 is not sequentially consistent

Welcome to the world of special relativity! Two observers (cores) might see two events (memory writes) in different order. Even on an x86.

The technical explanation is that, instead of one bus serializing all memory accesses, each core uses its own memory cache. Writes propagate from one cache to another with finite speed (measured in clock cycles). Reads snoop around the caches before going to shared memory. All this caching is done because a trip to main memory very expensive–hundreds of cycles.

I don’t care, says Java

The Java memory model requires that all access to shared volatile variables be sequentially consistent across all threads, even when they run on different cores. How do they enforce it?

The primary source for this kind of information is Doug Lea’s excellent The JSR-133 Cookbook for Compiler Writers.

Here’s the big picture: When Java is translated into bytecode, special codes are issued around volatile-variable access. These codes tell the Java runtime where memory fences should be inserted. The bytecodes are supposed to be processor independent; the fences are highly processor dependent. I’ll call the special bytecodes memory barriers as opposed to processor-specific memory fences. Memory barriers are inserted conservatively–the compiler assumes the most relaxed memory model (in other words, the bytecode has to run correctly on an alpha, whose memory model is so relaxed you’d think it was conceived in a hot tub).

Fences should go between memory accesses, so Java barriers are named after the kind of accesses (load or store) they separate. There is a LoadLoad barrier between volatile reads, LoadStore barrier between a read and a write, StoreLoad barrier between a write and a read, and StoreStore barrier between writes.

Here’s the first problem: the compiler can often only see one part of the pair. In that case it has to assume the worst and use the most restrictive barrier. Considering that, here are some of the cookbook rules:

Issue a StoreStore barrier before each volatile store.
Issue a StoreLoad barrier after each volatile store.
Issue LoadLoad and LoadStore barriers after each volatile load.

That’s a lot of barriers! Fortunately, the compiler can optimize some of them away using dataflow analysis.

The next step is to run the bytecode on a particular processor (or compile it to native code). If the processor is an alpha, practically all Java barriers translate into some kind of fences. But if it’s an x86, all but the StoreLoad barriers are turned into no-ops. The StoreLoad barrier is either translated into an mfence or a locked instruction (lock xchg).

When executed on an x86, the barrier rules boil down to this one: Issue an mfence after each volatile store (or turn it into lock xchg).

Peterson lock on an x86 in Java

Figuring the details of the Peterson lock in Java was not easy. I’m grateful to Doug Lea for patiently explaining to me some of the gotchas. I am solely responsible for any mistakes though.

In my previous post I came to the conclusion that for Peterson lock to work on an x86, an mfence is needed. Here’s the relevant pseudo-code:

Thread 0	Thread 1
store(zeroWants, 1) mfence store(victim, 0) // Java puts another fence here r0 = load(oneWants) r1 = load(victim)	store(oneWants, 1) mfence store(victim, 1) // Java puts another fence here r0 = load(zeroWants) r1 = load(victim)

In Java, all three variables, zeroWants, oneWants, and victim, would be declared volatile. Using the x86 Java translation rules, that would mean putting an mfence after every store to these variables.

The fences after the writes to zeroWant and oneWant are just what the doctor ordered. Form my previous post we know why they are needed on an x86.

The mfence after the write to victim is something new though. It isn’t strictly necessary on an x86, but to see that you really have to analyze the whole algorithm–something that’s beyond capabilities of modern compilers.

My original (turns out, incorrect) reasoning was that no fence is needed between the write to victim and the subsequent read because, on an x86, reads and writes to the same location are never reordered. Well, that’s not the whole story because…

Intra-processor forwarding is allowed

Consider the following example from the x86 spec. Initially x == y == 0.

Processor 0	Processor 1
mov [_x], 1 mov r1, [_x] mov r2, [_y]	mov [_y], 1 mov r3, [_y] mov r4, [_x]

Processor 0

Processor 1

mov [_x], 1
mov r1, [_x]
mov r2, [_y]

mov [_y], 1

mov r3, [_y]
mov r4, [_x]

The result r2 == 0 and r4 == 0 is allowed.

Let’s analyze that. First of all, because reads and writes to the same location are never reordered, r1 and r3 must both be 1.

If r2 == 0, we would naturally conclude that the write to y at processor 1 hasn’t happened yet. It means: processor 1 hasn’t executed its code yet. When it does, it should see the earlier write to x (processor 0 saw it!). Guess what–sometimes it doesn’t.

The technical reason for this behavior is that the write to x is immediately visible at processor 0, before it propagates to processor 1. Similarly, the write to y is immediately visible at processor 1, before it propagates to processor 0. So it is possible for both processor to see each other’s writes delayed, even though they see their own writes immediately.

This result violates sequential consistency and therefore cannot be tolerated in Java. Hence the need for the mfence between the write to victim and the subsequent read of victim.

Gosh darn it! I want to optimize it!

Having said that, I must point out that, in this particular case, lack of sequential consistency is not a deal breaker. It’s okay for thread 0 of Peterson lock to read stale (local) value of victim, even if “in the meanwhile” thread 1 overwrote it. All it means is that some unnecessary spins might be performed. This doesn’t violate the lock principles and doesn’t lead to a deadlock as long as the new value of victim eventually reaches thread 0.

I’m pretty sure of this reasoning–at least at the time of this writing. However, I wouldn’t be totally surprised is somebody found a fault in it.

The bottom line is that even such an “obvious” optimization is not obviously correct. The proof of correctness of the Peterson lock is based on the assumption of sequential consistency. If we relax this assumption even slightly, we have to redo the whole proof.

Any volunteers?

19 Responses to “Who ordered sequential consistency?”

Dmitriy V'jukov Says:

November 13, 2008 at 1:02 pm
Well, one can stay away form such games if it’s Ok for him to have 100x performance degradation.

As an example one can take a look on this hash map implementation:
http://groups.google.com/group/lock-free/browse_frm/thread/a826061a4af06dc0
It’s 50x faster that Intel Threading Building Blocks concurrent_hash_map already on quad-core. TBB’s hash map indeed uses ‘sequential consistency’ to simplify things a bit.

Also one can use tools like Relacy Race Detector:
http://groups.google.com/group/relacy
to kick $hit from algorithms like Peterson’s mutex etc. Relacy Race Detector nearly 100% models C++0x’s extremely relaxed memory model. So it can compromise algorithm which fails only on, for example, platforms with non Total Store Order (I believe that most current x86 are actually TSO).
Bartosz Milewski Says:

November 13, 2008 at 5:29 pm
Dmitriy, I have come across your posts in several newsgroups. Let’s face it, you are one of the very few specialists in this area who know what they are talking about. Please tell the readers how many years you have dedicated to studying concurrency and reading and discussing processor specs. I will then say: Anybody with that much experience feel free to play games with memory fences.
Dmitriy V'jukov Says:

November 14, 2008 at 5:00 am
🙂
I think it was 3 or 4 years of just reading/studying/thinking, and only about 1.5 years of active discussions.
Definitely it’s extremely hard stuff… at the beginning. I just wanted to show what it’s all about. I.e. it’s hard, but what one can get if he masters it?
And another interesting consequence is that such low-level details as memory ordering, cache behavior, etc also can give many insights into such things as systems architecture, distributed systems, etc.
Anthony Williams Says:

November 28, 2008 at 6:13 am
The x86 has one of the least-relaxed memory models of common processors. The POWER memory model and Sun SPARC RMO memory model give the processor(s) much more freedom to reorder operations, and this freedom is reflected in the C++0x memory model.

There are some examples in my recent blog entry on the C++0x memory model:

http://www.justsoftwaresolutions.co.uk/threading/memory_models_and_synchronization.html
C++ atomics and memory ordering « Bartosz Milewski’s Programming Cafe Says:

December 1, 2008 at 3:01 pm
[…] are some other algorithms, like the Peterson lock, which do require memory fences on the x86 (see my previous post). So it’s not just the matter of skipping all […]
Dave Abrahams Says:

December 13, 2010 at 9:34 pm
Yeah, I know I’m years late with this post… so sue me.

I think all this stuff gets a lot harder to think about because of the language of “reordering.” The processor vendors have to use that language so people can write low-level primitives in assembly language, but in general, you can’t nail down from your C++ code in what order instructions will be emitted, so the whole idea of being confused by the processor reordering them after they’re emitted seems incongruous to me: you never knew what order they’d happen in to begin with. The only guarantees you get from the language about ordering occurs among events separated by sequence points as they appear to a single thread, or when there’s a synchronization operation in play.

Now, if you’re the Java designers, you apparently think it’s a good idea to sacrifice performance to get a less-“surprising” behavior across threads. IMO, however, even with those extra guarantess, to understand what happens across threads without explicit locking you need to be at least sophisticated enough to understand the world where guarantees are less-strict.
Doug Says:

October 16, 2011 at 4:06 am
This stuff is fun and all, but isn’t it a big waste of time doing this sort of nano-optimization?

If you find yourself working out how to implement parallelism this fine-grained, you are most likely suffering from a pathological case of premature optimization. Real optimization would be working out how to fix your design so your CPU isn’t blasting through some crazy lock-free stuff 10,000,000 times per second, right? If the CPU *isn’t* doing that, then you absolutely are wasting time trying to write lock free code for it. Maybe restructure your design/algorithms so you don’t have to solve horrible concurrency problems.

I’ve realized that fine-grained parallelism beyond a lock free stack is mostly a massive waste of time. Thread management, like returning threads to the thread pool ASAP, and spreading the load across a few locks (and a few separate containers for example) instead of bottlenecking on one, will go miles with perf gain without spending a week figuring out how I’m going to do something without a lock.
Bartosz Milewski Says:

October 16, 2011 at 12:47 pm
@Doug: I agree with you: In 99% of the cases there are other, better ways of improving the performance of your program. However, this kind of low-level optimization is used in systems programming, and it makes a big difference.

There is a niche for lock-free algorithms, and it’s not in protecting massively contested resources (the 10 mln time per second scenario you mentioned). Lock free algorithms are by nature optimistic. They perform well in low contention scenarios and are often inferior to locks in highly contested scenarios.
Beren Minor Says:

November 18, 2011 at 4:34 am
Hi,

I would have a question more general about the problem you’re pointing out here, and the way x86 CPUs ensure cache coherency. I don’t have any precise knowledge about how CPUs work, but I’m trying to understand, so do not hesitate to correct me if I’m wrong.

From what I’ve read, modern x86 CPUs use cache-coherency algorithms such as MESI (http://en.wikipedia.org/wiki/MESI_protocol) to ensure coherent read/writes on different cores.

You say:
> The technical reason for this behavior is that the write to x is immediately visible at processor 0, before it propagates to processor 1. Similarly, the write to y is immediately visible at processor 1, before it propagates to processor 0. So it is possible for both processor to see each other’s writes delayed, even though they see their own writes immediately.

The thing is, that from the MESI protocol, any core that has done a write to a cache line, has it in Modified state, and should snoop any attempted read request from any other core for this cache line and provide it the modified value. Does the MESI protocol and cache-coherent systems ensure that we will never get the result r2 == 0 and r4 == 0?

Is the propagation time you are talking about something that is still possible with cache-coherency protocols, or something only happening on systems that do not ensure cache-coherency? Am I wrong thinking that x86 CPUs all uses MESI and ensure cache-coherency?
Bartosz Milewski Says:

November 19, 2011 at 3:20 pm
What MESI guarantees is that all writes to a given location appear in the same order to all processors. But notice that examples of weak ordering always involve more than one location. MESI doesn’t order writes to different locations.

For the x86 the writes might be modeled using TSO, which looks like the writes are buffered, which is not necessarily true at the transistor level. It is just a convenient model that lets you reason about program execution.,
Beren Minor Says:

November 19, 2011 at 4:18 pm
Alright, if I understand correctly and after having read about that a little bit more, the cache coherency problem comes after the phenomenon you are talking about here.

It is the use of write buffers in processors that is causing the race condition here:
– P0 may have issued a (store x, 1) operation that is still in its write buffer, then loaded x directly from this buffer, and y from its cache (== 0).
– then when P1 executes, it does the same with y, but when it loads x from the cache it gets 0, as even on P0 the cache does hold 0 as x value (assuming the P0 write buffer hasn’t been flushed yet), and even the MESI snooping mechanism can’t do anything about it.

Is that right?
Bartosz Milewski Says:

November 19, 2011 at 4:50 pm
The write-buffer/cache interaction is quite complex, depending on whether the write is a cache hit or a cache miss, but the general idea is correct.

If you’re interested in learning more, there is a book by Jim Handy: The Cache Memory Book, http://www.amazon.com/Memory-Second-Kaufmann-Computer-Architecture/dp/0123229804 .
Beren Minor Says:

November 19, 2011 at 5:04 pm
I wasn’t aware of such store buffers before (and I presume there are others for loads), and I can start to foresee what issues this leads to, regarding to the outcome of a multithreaded scenario.

I’m used to more high level programming, but I think this is an interesting and important subject and high performance multithreading won’t go without a clear understanding of how things are executed by the processor. Thanks, and keep up the good work!
How to fix FindBugs rule – LI_LAZY_INIT_STATIC « the open notebook Says:

February 7, 2012 at 1:53 am
[…] like this one and this one raises concerns about sequential consistency (SC) even after JDK 1.5 but both concludes that […]
Johan Says:

April 13, 2012 at 2:56 pm
Hi Bartosz,
A minor question – You say that “When Java is translated into bytecode, special codes are issued around volatile-variable access”. Is this really true? I couldn’t find any bytecode that looked like a StoreStore for example when I looked here http://en.wikipedia.org/wiki/Java_bytecode_instruction_listings, so I assumed these fences were only seen inside the JVM (as it should know from the info in the class file when it’s accessing volatile fields.

Thanks
Bartosz Milewski Says:

April 13, 2012 at 4:15 pm
I think you’re right, JVM knows which loads/stores require fences from the type of the variable.
I remember when multi-threaded was only on a single core | Windows Live space Says:

April 7, 2013 at 11:22 pm
[…] the authors of the C++ specification have followed Java’s lead, and defined the language to be sequentially consistent for data race free programs – sequential consistency meaning that the result of running code […]
C/C++ Volatile关键词深度剖析 | 机会永远留给有准备的人 Says:

October 13, 2015 at 10:09 am
[…] [8] Bartosz Milewski. Who ordered sequential consistency? […]
C/C++ Volatile关键词深度剖析 – I Given Says:

January 11, 2018 at 9:37 am
[…] [8] Bartosz Milewski. Who ordered sequential consistency? […]