Multiprocessors with relaxed memory models can be very confusing. Writes can be seen out of order, reads can be speculative and return values from the future–what a mess! In order to impose some kind of consistency you have to use memory fences, and there are several kinds of them.
The x86 seems to be an oasis in the perilous landscape of relaxed memory multicores. The Intel x86 memory model, detailed in Intel 64 Architecture Memory Ordering White Paper and the AMD spec, AMD64 Architecture Programmer’s Manual, list a lot of memory ordering guarantees, among them:
- Loads are not reordered with other loads.
- Stores are not reordered with other stores.
- Stores are not reordered with older loads.
- In a multiprocessor system, memory ordering obeys causality (memory ordering respects transitive visibility).
- In a multiprocessor system, stores to the same location have a total order.
- In a multiprocessor system, locked instructions have a total order.
- Loads and stores are not reordered with locked instructions.
The x86 also has (expensive–on the order of 100 cycles) memory fence instructions, mfence, lfence, and sfence; but, considering all those guarantees, why would anyone use them? The famous double-checked locking pattern, on the x86, works just fine without any fences.
This is a very important question, because you don’t want the compiler to over-fence your code and bring it to a crawl. On the other hand, you don’t want it to emit incorrect code. I decided to look for some answers.
There is one important non-guarantee listed in the x86 spec:
Loads may be reordered with older stores to different locations |
Here’s an example from the x86 spec: x and y are in shared memory and they are both initialized to zero, r1 and r2 are processor registers.
Thread 0 | Thread 1 |
---|---|
mov [_x], 1 mov r1, [_y] |
mov [_y], 1 mov r2, [_x] |
When the two threads execute on different cores, the non-intuitive result r1 == 0 and r2 == 0 is allowed. Notice that such result is consistent with the processor executing the loads before the stores (which access different locations).
How important is this example?
I needed to find an algorithm that would be broken by this relaxation. I’ve seen Dekker’s algorithm mentioned in this context. There is a more modern version of it used in the Peterson lock. It’s a mutual exclusion algorithm that works for two threads only. It assumes that each thread has access to a thread-local ID, which is 0 for one thread and 1 for the other. Here’s the version adapted from The Art of Multiprocessor Programming:
class Peterson { private: // indexed by thread ID, 0 or 1 bool _interested[2]; // who's yielding priority? int _victim; public: Peterson() { _victim = 0; _interested[0] = false; _interested[1] = false; } void lock() { // threadID is thread local, // initialized either to zero or one int me = threadID; int he = 1 - me; _interested[me] = true; _victim = me; while (_interested[he] && _victim == me) continue; } void unlock() { int me = threadID; _interested[me] = false; } }
To explain how it works, let me impersonate one of the threads. When I (the thread) want to take a lock, I set my _interested slot to true. I am the only writer of this slot, although the other guy can read it. I also toggle the _victim switch to point to myself. Now I check if the other guy is also interested. As long as he is, and I am the victim, I spin. But once he becomes uninterested or turns himself into a victim, the lock is mine. When I’m done executing critical code, I reset my _interested slot, thus potentially releasing the other guy.
Let me simplify this a little, and partition the code between the two threads. Instead of an array _interested, there are two variables zeroWants and oneWants corresponding to the two slots.
zeroWants = false; oneWants = false; victim = 0;
Thread 0 | Thread 1 |
---|---|
zeroWants = true; victim = 0; while (oneWants && victim == 0) continue; // critical code zeroWants = false; |
oneWants = true; victim = 1; while (zeroWants && victim == 1) continue; // critical code oneWants = false; |
Finally, let me rewrite the initial part of the execution in pseudo assembly.
Thread 0 | Thread 1 |
---|---|
store(zeroWants, 1) store(victim, 0) r0 = load(oneWants) r1 = load(victim) |
store(oneWants, 1) store(victim, 1) r0 = load(zeroWants) r1 = load(victim) |
Now look at the loads and stores to zeroWants and oneWants. They follow the same pattern as in the x86 reordering example. The processor is free to move the read of oneWants before the write to zeroWants (and to victim). Similarly, it can move the read of zeroWants before the write to oneWants. We may end up with the following execution:
Thread 0 | Thread 1 |
---|---|
r0 = load(oneWants) store(zeroWants, 1) store(victim, 0) r1 = load(victim) |
r0 = load(zeroWants) store(oneWants, 1) store(victim, 1) r1 = load(victim) |
Originally, both zeroWants and oneWants are initialized to zero, so both r1 and r2 may end up zero. In that case, the spin loop is never executed and both threads march into the critical section. Peterson lock on the x86 is broken!
Of course, there is a way to fix it. An mfence will force the correct ordering. It can be put anywhere between the store of zeroWants and the load of oneWants in one thread, and between the store of oneWants and the load of zeroWants in the other thread.
So here we have a real-life example that requires an actual fence on an x86. The problem is real!
November 5, 2008 at 5:33 pm
> In a multiprocessor system, stores to the same location have a total order.
Does that mean that stores to different addresses can happen at the same time? BTW, how do you define a ‘total order’? There is certainly some (arbitrary) total order for any of the operations happening in a real computer.
Anyway, interesting post.
November 5, 2008 at 6:40 pm
Interesting article, as always.
Since fences are so expensive, couldn’t you add a dummy assembly instruction that prevents the X86 from re-ordering? So the pseudo assembly for thread 0 might become:
store(zeroWants, 1)
store(victim, 0)
r0 = load(victim) ; NEW DUMMY INSTRUCTION
r0 = load(oneWants)
r1 = load(victim)
Since the dummy instruction is a load, the X86 can’t reorder the other loads before it. Also, since it loads from “victim”, the X86 can’t move it before the store to “victim”.
If you do this to both threads, does that solve the problem?
November 5, 2008 at 8:11 pm
Typo:
oneWants = true;
victim = 1;
// this should be …victim == 1
while (zeroWants && victim == 0)
continue;
// critical code
oneWants = false;
November 5, 2008 at 9:04 pm
To ondra: (I have deleted my first response because it was wrong!). Yes, stores to different locations can happen simultaneously. They might be seen in different order on different processors.
Total order means that for each two events you can tell which one happend before the other in a given execution. So, for instance, if processor 0 stores value a and processor 1 stores value b in the same location, other processor will either all see a happening before b, or all see b happening before a. This is not true if the two locations are different.
November 5, 2008 at 9:22 pm
To Ross C: This is a tough question. On the face of it, it should work, but I wouldn’t bet on it. There are special rules about visibility of stores. They become immediately visible on the same processor, but they don’t propagate immediately to other processors.
Intel “guarantees” are more rules of thumb than strict theorems. The AMD spec, for instance, doesn’t mention those guarantees. So the only thing you should believe is code samples they publish. This is not only my opinion.
November 5, 2008 at 9:27 pm
To dhasenan: Fixed, thank you. (Another victim of copy and paste)
November 6, 2008 at 3:09 am
Ross C: I have experimented a lot with this kind of “soft” memory barriers. I’m not quite sure about your specific example right now but it looks like it should work. The problem is that these are just as slow as “real” barriers, so we don’t need to do tricky things.
November 6, 2008 at 5:06 pm
This makes sense. The cost of a memory barrier it totally dominated by the cost of rearranging memory accesses across cores. If the presence of a dummy instruction has the same net effect, it must cause similar rearranging of memory accesses. All these ordering guarantees on the x86 don’t come free.
November 7, 2008 at 5:04 pm
[…] Who ordered memory fences on an x86? Multiprocessors with relaxed memory models can be very confusing. Writes can be seen out of order, reads can be […] […]
November 10, 2008 at 2:16 am
This is yet another blog entry that reinforces my belief that Michael Jackson was right:
“The First Rule of Program Optimization: Don’t do it. The Second Rule of Program Optimization (for experts only!): Don’t do it yet.”
a) Focusing on x86 only prevents your code from being moved onto something else (like the ARM in your mobile phone).
b) Dekker’s and Peterson’s algorithm are irrelevant in practice, as multi-processor processors have much more effective ways of enforcing mutual exclusion (spin locks using processor primitives not available to standard C).
The whole point of well-designed scheduling libraries (e.g., mutex/condition variables in the monitor tradition of MESA, Modula-3, and Pthreads) is that they can hide the processor-dependent part of synchronization in an efficient way — i.e., uncontested locks/empty condition variables queues don’t require a user-space/kernel transition.
Don’t try to improve on this unless you really know what you are doing *AND* you know that this part of your code is really slowing you down *AND* you are willing to give up portability of your code (but then at least provide a portable fall back).
November 10, 2008 at 12:04 pm
I couldn’t agree with you more. I am very far from encouraging low-level programming and hand optimization of concurrency. Just the opposite! My goal is to show that even the simplest multiprocessor, the x86, is fraught with concurrency perils and trying to ignore them or outsmart it is a way to disaster.
Wait for my next post where I explain this.
November 11, 2008 at 6:04 pm
[…] like double-checked locking pattern and Peterson lock work out of the box in Java, as long as you declare all shared variables volatile. (Do not confuse […]
November 13, 2008 at 1:20 pm
To Ross C:
Re: couldn’t you add a dummy assembly instruction…
No, we can’t. Mutual exclusion on x86 is not possible w/o store-load fence (mfence). No matter how many instructions one will add, and how industriously we will be trying to cheat the hardware. mfence – mutual exclusion. no mfence – no mutual exclusion.
The situation is not that mutual exclusion requires mfence and mfence just accidentally heavy. The situation is that it’s impossible to achieve consensus (mutual exclusion) in distributed system (multi-core chip) in a lightweight way. It just have to be heavyweight.
November 13, 2008 at 4:43 pm
To Dmitriy: Are you talking about the theorem that a mutual exclusion algorithm cannot be built solely from reads and writes? But that theorem is only true when the nubmer of threads is unbound (and storage is finite). Peterson lock has to work only for two threads.
You might be right that, on an x86, you can’t get away without an mfence (or equivalent). That was my intuition too, but I don’t have a proof. And Ross makes a good point–adding a dummy instruction might not be lightweight. It might be equivalent to an mfence and give you the same performance degradation. After all the x86 must use the moral equivalent of acquire and release fences to enforce processor ordering. I would really like to know more about it.
November 21, 2008 at 12:56 pm
I am talking about the fact that mutual exclusion can’t be lightweight. It can be built with stores and loads (Peterson’s algo), but with heavy mfences in between.
I believe that ‘traditional’ mutexes can’t be built w/o mfence or atomic RMW (with built-in mfence anyway). But it’s possible to do interesting things like to move all overhead from readers to writers in reader-writer mutex, i.e. readers are really free of any mfences of atomic RMW. Here is an example:
http://groups.google.com/group/lock-free/browse_frm/thread/1efdc652571c6137
November 21, 2008 at 4:00 pm
Interesting link. It seems like you can remove a fence from one thread (the fast thread) but only with the help of the operating system–interrurpts or page faults.
December 1, 2008 at 11:34 am
[…] studying a bit of the x86 memory model, I realized that some basic lock-free patterns (like the one in double-checked locking) will work […]
May 19, 2009 at 5:59 am
I am confused that Microsoft says memory alignment is required for InterlockedExchange however, Intel documentation says that memory alignment is not required for LOCK. Am i missing something, or whatever the problem?
Thanks
from Microsoft MSDN Library
Platform SDK: DLLs, Processes, and Threads InterlockedExchange
The variable pointed to by the Target parameter must be aligned on a 32-bit boundary; otherwise, this function will behave unpredictably on multiprocessor x86 systems and any non-x86 systems.
from Intel Software Developer’s Manual;
LOCK instruction Causes the processor’s LOCK# signal to be asserted during execution of the accompanying instruction (turns the instruction into an atomic instruction). In a multiprocessor environment, the LOCK# signal insures that the processor has exclusive use of any shared memory while the signal is asserted.
The integrity of the LOCK prefix is not affected by the alignment of the memory field. Memory locking is observed for arbitrarily misaligned fields.
Memory Ordering in P6 and More Recent Processor Families
Locked instructions have a total order.
Software Controlled Bus Locking
The integrity of a bus lock is not affected by the alignment of the memory field. The LOCK semantics are followed for as many bus cycles as necessary to update the entire operand. However, it is recommend that locked accesses be aligned on their natural boundaries for better system performance: •Any boundary for an 8-bit access (locked or otherwise). •16-bit boundary for locked word accesses. •32-bit boundary for locked doubleword accesses. •64-bit boundary for locked quadword accesses.
August 22, 2009 at 6:30 pm
Hi,
You said:
Loads are not reordered with other loads.
Stores are not reordered with other stores.
Stores are not reordered with older loads.
But what you said is different from the link said:
http://www.linuxjournal.com/article/8211
Thanks.
-Yuan
August 23, 2009 at 2:32 pm
I got my information directly from the Intel spec. There are some caveats as far as write/write reordering goes, but I’m not sure they are relevant outside of low-level systems programming. Here’s the exact quote:
Reads are not reordered with other reads.
Writes are not reordered with older reads.
Writes to memory are not reordered with other writes, with the exception of
writes executed with the CLFLUSH instruction,
streaming stores (writes) executed with the non-temporal move instructions (MOVNTI, MOVNTQ, MOVNTDQ, MOVNTPS, and MOVNTPD),
string operations (see Section 8.2.4.1).
September 11, 2009 at 3:36 pm
[…] CPU의 경우 이러한 문제점을 방지하기 위한 몇가지 규칙을 가지고 있는데 Who ordered memory fences on an x86?에 […]
October 22, 2009 at 3:43 am
Hi Bartosz,
thank you very much for this great article! I knew about memory visibility problem, but I was looking for a concrete example that shows it on mainstream processor architectures like x86.
Based on the material of your article, I myself derived another example to stress the importance of using mutex. The article can be found at:
http://www.domaigne.com/blog/computing/mutex-and-memory-visibility/
October 22, 2009 at 3:44 am
[…] Bartosz Milewski. Who ordered memory fences on an x86?. Bartosz’s blog programming cafe has very interesting articles about thread programming, […]
December 4, 2009 at 7:55 pm
Somewhat late to the party, but I believe I can explain why this reasoning was flawed:
“store(zeroWants, 1)
store(victim, 0)
r0 = load(victim) ; NEW DUMMY INSTRUCTION
r0 = load(oneWants)
r1 = load(victim)
Since the dummy instruction is a load, the X86 can’t reorder the other loads before it. Also, since it loads from “victim”, the X86 can’t move it before the store to “victim”.”
The dummy instruction will not serve the intended purpose because it will retire quickly due to inter-processor forwarding (while the store of ‘victim’ is still in the store buffer). After that, as far as the issuing CPU is concerned the pipeline becomes
store(zeroWants, 1)
store(victim, 0)
r0 = load(oneWants)
r1 = load(victim)
and the CPU will not be inconsistent if it decided to reorder 2nd and 3rd lines. I.e. the dummy instruction accomplishes nothing.
February 25, 2010 at 10:38 am
This may be a dumb question, but I haven’t been able to find the answer anywhere else – x86 provides these guarantees, but is it still possible for the compiler to reorder, violating these guarantees? I’m specifically interested in gcc – does any reordering gcc may do uphold these guarantees? If not, is there any way to force the compiler to not reorder but still not use a fence?
Thanks.
February 25, 2010 at 10:56 am
You are right, the compiler is free to make such re-orderings. C++0x (the new standard, which is partially implemented in latest releases of gcc) gives you more control over reorderings. You have to use atomics, but the compiler should be smart enough not to insert fences when processor guarantees are strong enough. In that case atomics just turn off particular reorderings.
However, to really understand the consequences of using the weak atomics (the ones that don’t guarantee sequential consistency), even a Ph.D. in CS is often not enough 😉
April 8, 2010 at 1:05 am
Ross: One option to avoid a full memory barrier is to use atomic operations for the stores and/or loads (but mind diminishing returns).
Davig: No, but there are serious performance implications for this. However, cmpxchg{8,16}b do require {8,16}-byte alignment.
Adam: Yes, compiler may violate the multi-processor guarantees (think of aliasing). It is even possible that your stores and/or loads never actually hit memory! The compiler is free to schedule instructions as it pleases as long as sequential consistency is maintained. Careful use of the volatile qualifier can avoid this (but be wary as not all compilers will respect the vague specification of volatile). This has some nasty performance implications for volatile objects and is still a gamble. The other option is to make use of inline assembly for stores and loads (atomic-ptr, atomic-ops and OS-specific atomic interfaces provide these), this is a good fail-safe.
April 8, 2010 at 1:12 am
Adam: Also, it is possible to cast to volatile-qualified type, see Linux kernel’s ACCESS_ONCE for an example interface (linux/compiler.h). I avoided this because I saw no reference in GCC documentation guaranteeing atomicity of volatile loads and stores.
May 23, 2010 at 11:08 am
Adam: I believe that the following will prevent compiler reordering in gcc:
__asm__ __volatile__(“” : : : “memory” );
According to wikipedia:
this directive forbids GCC compiler to reorder read and write commands around it.
December 13, 2011 at 6:41 am
This problem can be easily solved by using volatile variables. That would stop the compiler from reordering this code. Not a good case for memory fences if you ask me.
December 13, 2011 at 10:26 am
@Judas: In Java, yes. But not in C++ or assembly.
January 10, 2012 at 6:23 am
[…] Who ordered memory fences on an x86 (Bartosz Milewski); speziell hier werden die Folgen für den Peterson-Algorithmus angesprochen […]
February 26, 2012 at 9:56 am
[…] Peterson’s algorithm is like a spinlock for two threads. A neat trick, but largely useless on today’s platforms. I find it noteworthy because Bartosz Milewski used this algorithm as a case study while discussing the finer points of the x86 memory model. […]
May 17, 2013 at 1:09 pm
@GregOlson and @JudasPriest The discussion pertains _specifically_ to x86 CPU architecture. Compilers, let alone volatile, have nothing to do with that.
The point is that the actual CPU instructions loading/storing might get executed out-of-order by the actual hardware.
This is why a compiler will have to add fences in specific circumstances (even in the case of volatile variables), on multi-core (true concurrent) architectures.
Otherwise the result would be undefined/inconsistent regardless of whether the compiler honoured _it’s own_ promise to not reorder execution (at a higher level of abstraction).
May 17, 2013 at 1:53 pm
@GregOlson @JudasPriest http://preshing.com/20120515/memory-reordering-caught-in-the-act describes precisely the distinction, with largely the same example.
Be sure to check the source code:
October 15, 2013 at 6:58 am
[…] [2] https://bartoszmilewski.com/2008/11/05/who-ordered-memory-fences-on-an-x86/ […]
October 31, 2014 at 11:01 am
So in the Peterson lock example given here, a SFENCE would have worked as well — is that right?
November 8, 2014 at 12:37 am
[…] статье рассказывает про перестановки операций LOAD/STORE на x86 […]
February 3, 2015 at 11:05 pm
@alexdowad: nope, a mfence is needed. Empirically, sfence seems to work for Pentium DualCore E5200, but not for Intel(R) Core(TM) i5-3470 CPU, which seems to work with an mfence. I believe there is some difference in how they implement sfence (store buffer drain? vs just preventing store reordering inside the store buffer?)
November 16, 2015 at 5:13 am
[…] Fences […]
March 6, 2016 at 10:47 am
Suppose that _interested[2] has type uint8_t and the read has been changed to *(uint16_t*)_interested. Intuition tells it will prevent CPU from reordering reads, isn’t it?
March 6, 2016 at 12:30 pm
In my experience, intuition has nothing to do with reasoning about C++ concurrency.
March 6, 2016 at 1:30 pm
It seems that even if it will stop reordering, there’s another issue:
“8.2.3.5 Intra-Processor Forwarding Is Allowed
The memory-ordering model allows concurrent stores by two processors to be seen in different orders by those two processors; specifically, each processor may perceive its own store occurring before that of the other.”
that will break the algorithm in the example.
p.s. it’s more about x86 concurrency itself, not C++.
August 29, 2016 at 10:16 pm
[…] To put it differently, a single threaded program will always see memory reads and writes in program order and only loads are reordered with stores to different locations. The following rules hold true (as laid out by Bartosz here): […]
July 20, 2017 at 4:25 pm
The implementation as described is subtly faulty because it victimises the wrong thread. After a thread signals its interest it must give the priority away (victim = other), not keep it for itself – then in the condition it should then test that the priority hasn’t been given back, i.e (interested[other] && victim == other)
September 26, 2017 at 3:57 pm
Here’s a question I haven’t been able to find a good answer to: C++11 defines memory fence operations that prevent the compiler from re-ordering memory accesses in a way that would break your program. On some processors, it may also be necessary to output a memory fence instruction to prevent the processor from doing a similar re-ordering. However, on other processors, like the x86, this would be unnecessary as the processor provides sufficient guarantees about memory access order. Is there any platform-independent memory fence instruction that would provide appropriate semantics to the compiler while also preventing it from outputting a potentially expensive memory fence instruction to the processor when it would be unnecessary?
September 27, 2017 at 11:56 am
The platform independent way is to use atomics, including relaxed memory atomics. The compiler for a specific platform will only output memory fence instructions that are strictly necessary to guarantee the correct semantics. If the processor provides stronger guarantees, like the x86, the compiler will not use unnecessary memory fences.
June 10, 2019 at 10:48 am
[…] Who ordered memory fences on an x86? (2008) 8 by luu | 2 comments on Hacker News. […]
June 10, 2019 at 10:49 am
[…] Who ordered memory fences on an x86? (2008) 8 by luu | 2 comments on Hacker News. […]
June 10, 2019 at 10:50 am
[…] Who ordered memory fences on an x86? (2008) 8 by luu | 2 comments on Hacker News. […]
June 10, 2019 at 10:51 am
[…] Who ordered memory fences on an x86? (2008) 8 by luu | 2 comments on Hacker News. […]
June 10, 2019 at 10:53 am
[…] Who ordered memory fences on an x86? (2008) 8 by luu | 2 comments on Hacker News. […]
June 10, 2019 at 10:58 am
[…] Who ordered memory fences on an x86? (2008) 8 by luu | 2 comments on Hacker News. […]
June 10, 2019 at 11:18 am
[…] Source: bartoszmilewski.com […]