The question that’s been bugging me lately was: How does C++ make the use atomic variables both portable and efficient.
I knew how Java volatile worked–it enforced sequential consistency, which is not always the most efficient thing to do.
C++0x has atomic variables which also enforce sequential consistency when used in the default mode. Without special ordering annotations they work almost like Java’s volatile (interestingly, Java’s volatile doesn’t enforce atomicity–there is an atomic library in Java, though, which does).
However, C++ offers various degrees of relaxation of sequential consistency which, if used correctly, should produce more efficient code.
After studying a bit of the x86 memory model, I realized that some basic lock-free patterns (like the one in double-checked locking) will work without any fences. There should be a way of coding them in C++ such that, when compiled for the x86, no fences are produced. On the other hand, the same C++ code, when compiled for, let’s say, the alpha or the Power PC, should produce the necessary fencing.
To make things more interesting, there are some other algorithms, like the Peterson lock, which do require memory fences on the x86 (see my previous post). So it’s not just the matter of skipping all fencing.
I reduced my question to: How does one write portable code in C++ that results in just the right amount of fencing on the x86?
Specifying memory ordering in C++
The C++0x atomics library proposal underwent a lot of changes before it settled into its current shape. There is now a complete C++0x draft. It’s not an easy read so I’m grateful to Hans Boehm for clarifying some of the fine points.
Without further ado, this is how the publication safety pattern (the mother of all double-checked locking patterns) could be written in C++:
atomic<bool> ready = false; atomic<int> data = 0;
Thread 0:
data.store(1); ready.store(true);
Thread 1:
if (ready.load()) assert (data.load() == 1);
As you can see, atomic objects have methods store and load that are used for writing and reading underlying shared data. By default, these methods enforce sequential consistency. What it means is that this code has the same semantics as if it were written in Java using volatile variables. Consequently, on the x86, it will generate memory fences after each store.
But we know that these fences are not necessary for publication safety. How can we write code that would produce minimal fencing on the x86?
To make such fine tuning possible, the C++0x atomics library allows the programmer to specify ordering requirements for each load and store. I’ll explain various ordering options in a moment; for now let’s have a look at the optimized version of the publication pattern:
atomic<bool> ready = false; atomic<int> data = 0;
Thread 0:
data.store(1, memory_order_release); ready.store(true, memory_order_release);
Thread 1:
if (ready.load(memory_order_acquire)) assert (data.load(memory_order_acquire) == 1);
What’s important is that this code will work correctly on any major processor, but it will produce no fences on the x86.
Warning: Even if you know that your program will only ever run on an x86, you can’t remove the atomics and the ordering constraints from your code. They are still necessary to prevent the compiler from doing the reordering.
Fine-tuning memory ordering
Memory order can be specified using the following enumeration:
namespace std { typedef enum memory_order { memory_order_relaxed, memory_order_consume, memory_order_acquire, memory_order_release, memory_order_acq_rel, memory_order_seq_cst } memory_order; }
The default for all operations on atomic variables is memory_order_seq_cst which, just like Java’s volatile, enforces sequential consistency. Other orderings are used to relax sequential consistency and often squeeze better performance from lock-free algorithms.
- memory_order_acquire: guarantees that subsequent loads are not moved before the current load or any preceding loads.
- memory_order_release: preceding stores are not moved past the current store or any subsequent stores.
- memory_order_acq_rel: combines the two previous guarantees.
- memory_order_consume: potentially weaker form of memory_order_acquire that enforces ordering of the current load before other operations that are data-dependent on it (for instance, when a load of a pointer is marked memory_order_consume, subsequent operations that dereference this pointer won’t be moved before it (yes, even that is not guaranteed on all platforms!).
- memory_order_relaxed: all reorderings are okay.
As I discussed before, the x86 guarantees acquire semantics for loads and release semantics for stores, so load(memory_order_acquire) and store(x, memory_order_release) produce no fences. So, indeed, our optimized version of the publication pattern will produce optimal assembly on the x86 (and I presume on other processors too).
But I have also shown before that the Peterson’s algorithm won’t work on the x86 without fencing. So it’s not enough to use just memory_order_acquire and memory_order_release to implement it. In fact, the problem with the Peterson lock was the possibility of reordering loads with stores. The weakest constraint that prevents such reordering is memory_order_acq_rel.
Now here the funny story begins. Without much thinking, I decided that just slapping memory_order_acq_rel on any of the writes would fix the problem. Here’s what I originally wrote:
[begin quote] The correct C++ code in this case is:
Peterson::Peterson() { _victim.store(0, memory_order_release); _interested[0].store(false, memory_order_release); _interested[1].store(false, memory_order_release); } void Peterson::lock() { int me = threadID; // either 0 or 1 int he = 1 – me; // the other thread _interested[me].exchange(true, memory_order_acq_rel); _victim.store(me, memory_order_release); while (_interested[he].load(memory_order_acquire) && _victim.load(memory_order_acquire) == me) continue; // spin }
The ordering memory_order_acq_rel produces an mfence instruction on the x86. This fence will prevent the movement of the load of _interested[he] before the store to _interested[me], which could otherwise happen (and break the algorithm).[end quote]
You can read comments to this post–especially the ones by Anthony and Dmitriy, who made me eat crow. In short, the point is that the “acquire” part of memory_order_acq_rel has no meaning unless another thread writes to (“releases”) the same variable. Since only thread 0 writes to _interested[0] and only thread 1 one writes to _interested[1], this synchronization accomplished nothing (outside of the x86). Dmitriy’s implementation, which synchronizes on _victim, is correct (but I had to ask many questions before understanding Anthony’s proof of it).
Conclusion
What strikes me about this example is how non-trivial the reasoning behind this implementation was. I had to actually go through the x86 assembly to realize that I needed those and not some other ordering constraints (and I still got it wrong!). In comparison to that, the old fashioned assembly programming looks like a piece of cake.
Any time you deviate from sequential consistency, you increase the complexity of the problem by orders of magnitude. |
Microsoft volatile
No, I’m not talking about the stock market. I mentioned before that C++ volatile has nothing to do with multithreading. This is not entirely true because some compiler vendors took the liberty to add non-standard semantics to volatile (out of pure desperation, since they had to support multi-processor code, and pre-0x C++ offered no help in this area). This new semantics, at least in the case of Microsoft compilers, did not include sequential consistency. Instead it guaranteed acquire and release semantics (which, on the x86, didn’t result in any fencing). So, although the publication pattern will work on a Microsoft compiler with volatile variables, the Peterson lock won’t! I thought that was an interesting piece of trivia.
December 1, 2008 at 2:38 pm
You probably meant “atomic ready” instead of “atomic” etc … Anyhow, I guess I won’t be using orderings other than the default one. It’s more voodoo than I can handle. 🙂
December 1, 2008 at 2:41 pm
“atomic<bool> ready” instead of atomic<ready>
December 1, 2008 at 3:02 pm
Fixed! Thanks.
December 2, 2008 at 4:40 am
You are right that if you want to order stores and loads then you need memory_order_acq_rel. Unfortunately, you can’t apply this to a store as you need a load for the acquire part, so rather than
“_interested[me].store(true, memory_order_acq_rel);”
you need to use a read-modify-write operation such as exchange:
“_interested[me].exchange(true, memory_order_acq_rel);”
Incidentally, I have a C++0x thread library for MSVC 2008 currently in beta at http://www.stdthread.co.uk which includes the atomic operations.
December 2, 2008 at 12:38 pm
Anthony, you’re absolutely right–my mistake. The draft standard states it in 29.4.6 that acq_rel cannot be used with store. I’m correcting my post.
December 2, 2008 at 9:18 pm
[…] C++ atomics and memory ordering The question that’s been bugging me lately was: How does C++ make the use atomic variables both portable and […] […]
December 3, 2008 at 4:55 am
Memory ordering in your implementation of Peterson’s algo is both insufficient and excessive at the same time. In both permits races and contains unnecessary fences.
If you really want to get such things right and really understand how it must be and why it must be that way, I strongly suggest using of Relacy Race Detector (even peer reviews won’t help here) (distribution already contains several Peterson’s algo implementations).
Several correct implementations are possible, but here is my favorite:
void thread(unsigned index)
{
if (0 == index)
{
flag0($).store(1, rl::memory_order_relaxed);
turn($).exchange(1, rl::memory_order_acq_rel);
while (flag1($).load(rl::memory_order_acquire)
&& 1 == turn($).load(rl::memory_order_relaxed))
rl::yield($);
data($) = 1;
flag0($).store(0, rl::memory_order_release);
}
else
{
flag1($).store(1, rl::memory_order_relaxed);
turn($).exchange(0, rl::memory_order_acq_rel);
while (flag0($).load(rl::memory_order_acquire)
&& 0 == turn($).load(rl::memory_order_relaxed))
rl::yield($);
data($) = 2;
flag1($).store(0, rl::memory_order_release);
}
}
Btw, you can use relaxed stores in class constructors, because initially object is not shared. And sharing/publication mechanism will provide all necessary synchronization.
December 3, 2008 at 11:49 am
The only relevant difference that I can see between the two implementations is your use of relaxed ordering in a few places. That would suggest that my implementation is at least correct. Can you point out a specific error?
Whether it is overconstrained, is not as black and white as one could think. I believe your code is correct for your particular implementation of atomics. As I discussed before, fences really should go _between_ memory operations. Atomics can only see a single operation at a time, so they have to decide whether to always put the fence before or after a given operation. Your implementation apparently puts the fence _before_ the exchange operation, so additional fencing of the preceding store operation is not necessary. But the Standard never specifies where the fences go, so I can’t rely on it in my implementation.
December 4, 2008 at 12:52 pm
[…] 4, 2008 A short description of the different memory ordering for concurrency in C++0x; C++ atomics and memory ordering. The default for atomic variables is the familiar sequential ordering […]
December 4, 2008 at 2:49 pm
> As I discussed before, fences really should go _between_ memory operations. Atomics can only see a single operation at a time, so they have to decide whether to always put the fence before or after a given operation.
That’s not true. Your code above need not have *any* fences on an x86, because the exchange is an XCHG instruction, which is serializing. See my post about the C++0x memory model on x86 CPUs. http://www.justsoftwaresolutions.co.uk/threading/intel-memory-ordering-and-c++-memory-model.html
On some architectures, memory_order_acq_rel will require a fence both before AND after the instruction (possibly of different sorts).
> But the Standard never specifies where the fences go, so I can’t rely on it in my implementation.
The Standard doesn’t specify the implementation details, that is true. It does however specify the required ordering constraints in terms of relationships called “happens-before” and “synchronizes-with”.
Incidentally, if you feel that you can reason better with fences, you could write them explicitly: use memory_order_relaxed for all the loads and stores and then use explicit atomic_thread_fence() calls to put the fences in.
December 4, 2008 at 3:52 pm
I realize that, on the x86, the use of “lock exchg” is preferrable for anything stronger than release/acquire–I mentioned that in my previous posts.
In general, however, you can’t make assumption about how atomic operations translate into fences or locked instructions when you don’t know the underlying architecture. The only guarantees you have are, as you mentioned, the semantics of atomics in terms of happens-before and synchronizes-with relations. Notice however that very few people frame their arguments in those terms.
The obvious rules of discussing algorithms written in terms of atomics are: (1) it’s enough to show that it doesn’t work on one processor to prove its incorrectness, and (2) to prove its correctness you can’t use any processor-specific arguments. Of course that’s much harder!
So even though I don’t have a formal proof, I believe my implementation of Peterson lock is correct. For all I know, Dmitriy’s implementation might also be correct, but it’s much harder to prove.
December 5, 2008 at 7:09 am
I have written a post about this over on my blog: http://www.justsoftwaresolutions.co.uk/threading/petersons_lock_with_C++0x_atomics.html
In summary, Dmitriy’s implementation is correct, yours is broken. The key is that he uses exchange on turn (_victim), whereas you use exchange on _interested[me] and a plain store on _victim. You need the store to _victim to be an exchange in order for everything to be guaranteed.
December 6, 2008 at 8:05 am
[…] 6, 2008 I had another read of Bartosz Milewski recent blog post about the different memory orderings of […]
December 6, 2008 at 2:17 pm
I’m glad we are having this exchange. It shows how hard it is to reason about relaxed memory models. I don’t think Anthony’s proof is correct (we are exchanging email about it), but he raised some very good points. When all the dust settles, I’ll write a new blog post about proofs.
[edit] Correction, after a long exchange with experts, and giving up on Java-specific proof techniques, I am convinced now that both Anthony and Dmitriy were right. I’ll try to explain it in my next post.
December 23, 2008 at 11:16 am
[…] Milewski under C++, Concurrency, Memory Model, Multicore, Programming, atomics In my last post I made a mistake of publishing a piece of code that used C++ weak atomics–which are part of […]
December 24, 2008 at 11:36 am
> atomic ready = false;
> atomic data = 0;
In Java only ready flag has to be declared volatile, data can be plain int.
http://www.cs.umd.edu/~pugh/java/memoryModel/jsr-133-faq.html#volatile
December 24, 2008 at 11:57 am
You’re right. If data changes only this once and then remains immutable, it doesn’t have to be atomic or volatile. Similarly if data is subsequently only mutated under a lock.
January 31, 2009 at 5:15 am
I’m glad we are having this exchange. It shows how hard it is to reason about relaxed memory models.
May 24, 2012 at 12:06 am
[…] The best “Plain English” explanation I’ve found for the various memory orderings is Bartoz Milewski’s article on relaxed atomics: https://bartoszmilewski.com/2008/12/01/c-atomics-and-memory-ordering/ […]
September 14, 2012 at 8:21 am
You write:
Why include the additional barriers, marked by /**/, above?
I would think that memory_order_relaxed could be used in both since you don’t care about other loads and stores not already fenced by the acquire and release barriers on ready.
September 16, 2012 at 3:28 pm
Yes, you’re right. The easiest way to see it is by realizing that the variable ready with its acquire/release semantics behaves like a mutex. In this sense variable data is “protected” by the mutex and doesn’t even have to be atomic, if nobody else is accessing it.
December 17, 2012 at 9:43 am
[…] C++ atomics and memory ordering [Bartosz Milewski] […]
June 19, 2013 at 7:01 am
Thank you! Best explanation of the topic.
November 28, 2013 at 1:09 am
You may have a serious performance problem using your array “_interested” because of MESI protocol access to memory zones that are closer than 128 bytes are fully serialized between L2 cores, by hardware.
December 7, 2014 at 12:50 pm
[…] The best "Plain English" reason I've found for a several memory orderings is Bartoz Milewski's essay on loose atomics: https://bartoszmilewski.com/2008/12/01/c-atomics-and-memory-ordering/ […]
December 10, 2016 at 2:30 pm
So according to the order definitions, to take care of ordering in the above publication-safety-pattern, all you need is
atomic ready = false;
int data = 0;
data = 1;
ready.store(true, memory_order_release);
if (ready.load(memory_order_acquire))
assert(data == 1);
correct?
November 22, 2017 at 7:23 am
[…] The best “Plain English” explanation I’ve found for the various memory orderings is Bartoz Milewski’s article on relaxed atomics: https://bartoszmilewski.com/2008/12/01/c-atomics-and-memory-ordering/ […]
April 20, 2018 at 2:32 am
Microsoft discourages using the extended semantics for “volatile”, by the way, they can be disabled with the “/volatile:iso” option, and they were never implemented on ARM. See: https://msdn.microsoft.com/en-us/library/12a04hfd.aspx
I like how you point out that it was a decision based on the needs at the time – when there wasn’t even a true memory model in the standard, nor any notion of threads. There is certainly no evidence of bad intent.
August 9, 2019 at 2:26 pm
just came across your post, very helpful.
however i got a question when i started reading Anthony Williams’s “C++ Concurrency in Action: Practical Multithreading”.
in his book, he has such an example:
where sync2 is supposed to execute after sync1 returns true.
how is that ensured? i remember you said that memory_order_release only enures earlier store finishes, so compiler could move it before sync1?
thanks
May 1, 2020 at 12:09 pm
[…] With the C++11 memory model, the programmer specifies the needed ordering constraints precisely. The compiler can then optimize the program very aggressively, as long as it meets those constraints. For example, acquire and release semantics (the basis of publication safety) will compile to explicit memory fence instructions on ARM, but to nothing on x86. A good writeup is the blog post C++ atomics and memory ordering. […]
June 23, 2021 at 8:32 pm
[…] C ++ 원자 및 메모리 순서 […]
August 2, 2021 at 10:40 pm
No doubt, millions of man hours poured into FIXING dependencies, ad infinitum .. template policies to replace elements by wrapper classes, dealing such abstraction as to collect strings and token positions in multiple states ..that issue amazingly solved .. “suggest downgrading until this is fixed, as having to maintain patches against generated code is usually more trouble than it’s worth ” How many million workarounds do we need to prove Hardware needs to trap more runtime error (or avoid such by better memory management ..at least for leakage /collisions /overflows) more to point, improve compiler workflow by relaxing complex nested structures , into hardware checked boundaries /write permissions / et al..
Burton Smith, sums it up nicely..(apart from lacking answer of language to code MS Office ..heh, heh, how about Cobol ?
https://www.cct.lsu.edu/~estrabd/LACSI2006/Smith.pdf (meawhile BlueScope Piccolo for IoT class 32bit is great, yet RISC also lacks basic hardware blocks that would greatly simplify million hour PER DAY coding resources, wasted today. Code without predictable behavior, no matter what language it’s written , esp C++ Unless using strictly defined atomic types and memory barriers (dying breed of coder)
https://stackoverflow.com/questions/6319146/c11-introduced-a-standardized-memory-model-what-does-it-mean-and-how-is-it-g
.”, on a modern CPU, ensuring sequential consistency can be expensive. In particular, the compiler is likely to emit full-blown memory barriers between every access here..” Coder dont know, or care about wires, an tie selves up in nested abstractions in Mutex land.
August 2, 2021 at 10:57 pm
Bart, linked thru your excellent StackOver C++ 11 rant..
pls email, yor input into several new hardware blocks into CPUs to keep track of shared memory, once we can gain a flat address partition without interference from that nasty HEAP. Intel has M controller, emulated by AMD , but not IBM.. At least the P8 is an available transactional mem platform, have 2 sitting over in the US, apart from 40 Gb backplane switches, for a Kraken style 30channel 1Gb/Sec correlator, to get away from pesky OS and hopeless VonNeumann Arch .. :- )
August 3, 2021 at 2:52 am
For the worth of BlueSpec SoC simulators, appears we have the tools to synthesize, any required hardware block (with its call instruction, hopefully)
jus thought to include the links below.. only use email or phone
Only prob , building this into a more primitive RISC compiler, as standards such as OpenMP would require capability of ensuring instructions hold fences / barriers / locks, transaction sync, in nested constructions. i may be too close to hardware, here.. Not having used the toolset, yet.
1st relaxed approach is dual core with its own workspace register set, as in emulation, DRAM is used for registers, and if profile timing is available it does not gain from the host processor/ cache/ or suffer from spin locks.
When we move a register set by indexing a workspace pointer, there is no need to load/ store, on interrupts, or Kernel / User switch (ghastly.. we make the kernel aware, as being done in Rust !! who has extensive kernel nous?
only want to add some clever features to the core (some used in commercial design to avoid lengthy routine delays.. even interprocess communication)
Not sure, how timing of synthesized core emulation, (obviously much slower that RT, does it reflect a preset clockspeed for eventual target FPGA ?
as my hardware design background, suggests. Or if there is such timing available, outside a code generator, in this case, compiler.
.
As in all simulation, the speed-up offered by new hardware blocks, may conflict with compiler output (that is mystery how timing is achieved from a compiled high level abstraction that runs sooo slow compared to hardware state machines that replace dimensional trigonometric expressions, for eg.
Coders waste 1mill man hours / day fixing bugs in memory unsafe C/ C++
.. and cant bother to think, outside the little box of x86 , any influencers ?
https://github.com/B-lang-org/documentation
https://github.com/bluespec/Flute
https://github.com/bluespec/Piccolo
November 2, 2021 at 4:26 am
[…] C++ atomics and memory ordering […]
July 13, 2022 at 7:52 am
It seems that atomics and memory order can do almost all kinds of thread synchronization, and I don’t understand why memory barrier is needed.
July 13, 2022 at 11:20 am
To squeeze the last bit of performance.
September 24, 2022 at 12:07 pm
[…] C++ atomics and memory ordering […]
December 10, 2022 at 2:59 pm
[…] C++ atomics and memory ordering […]