Multicores and Publication Safety

August 4, 2008

Multicores and Publication Safety

Posted by Bartosz Milewski under Concurrency, D Programming Language, Multicore, Multithreading, Programming
[12] Comments

Publication safety is the core issue in the famously non-intuitive Double-Checked Locking Pattern.

What’s publication? In a nutshell, one thread prepares data and publishes it–other threads check if the data has been published and use it. Common scenario is the creation of a shared object (this example is written in the D programming language, but it’s pretty self-explanatory).

shared Foo foo = new shared Foo();

When a thread creates an object, it first runs its constructor (Foo()) and then points the shared handle (foo) to it. Other threads check the handle for non-null and then happily access the object.

if (foo !is null) foo.doSomething();

Naturally, in our naivete, we tend to assume that if the second thread can see a non-null handle, the construction of the object must have completed. That belief is known as publication safety and, guess what!, it’s not guaranteed on modern multi-processors that use relaxed memory models.

To understand what’s happening, let’s simplify the problem even further and write it in pseudo assembly. Initially the globals x and ready are zero. R is a thread-local variable (register). Think of writing to x as part of the construction of an object and writing to ready as the publication (the initialization of a shared handle).

Thread 1	Thread 2
`x = 1 ready = 1`	`if ready == 1 R = x`

Can Thread 2 see ready == 1 and x == 0? Yes, for two separate reasons. On a relaxed-memory-model multiprocessor

writes to memory can be completed out of order and
reads from memory can be satisfied out of order.

Imagine processors sending e-mail messages to memory. Thread 1 sends a message instructing the memory to write 1 to x. Then it sends another message instructing it to write 1 to ready. It’s perfectly possible on modern processors that the first message gets delayed and the write to ready completes before the write to x.

The way to make sure this doesn’t happen is to separate the two writes by a memory barrier, or fence. Every relaxed-memory-model multiprocessor offers some ways to do it. The x86’s (x > 3) have such instructions (mfence, lfence, and sfence), even though they implement processor-order memory model.

But beware, even if the writes are ordered by a (write) fence, the reads in Thread 2 may still execute out of order. Imagine that Thread 2 sends two e-mail messages asking for the values of ready and x. The second message arrives first, before any writes by Thread 1 are done. The memory sends back an e-mail with the value 0 for x. Next, the two writes by Thread 1 are committed. Then the first read message (fetch ready) arrives, and the memory responds with the value 1. Thread 2 sees a non-zero value of ready, but a zero (uninitialized) value of x. We’re in trouble!

Notice that the read of x is speculative. The processor issues the read request just in case the branch ready == 1 were taken. If it’s not, it can always abandon the speculation.

Again, the way to ensure that the two reads are satisfied in program order is to put a fence between them. Here’s the pseudocode.

Thread 1	Thread 2
`x = 1 write fence ready = 1`	`if ready == 1 read fence R = x`

Both fences are necessary!

The write fence is easier to remember. In our publication example, it makes sense to put it at the end of the constructor. It has the connotation of flushing all the writes performed during construction, before the public handle is initialized.

It’s the need for the read fence that is often overlooked. It’s not immediately obvious that every time you access a published shared variable you have to use a fence. It’s the “every time” part that seems excessive, especially if your code initializes the handle only once (as in the double-checked locking pattern). Sure, there are a few cases when a benign race is acceptable, but even the best of us get it wrong half of the time.

Why is this whole low-level discussion relevant? Very few programmers will be inserting (non-portable) fences into their code. Most programmer will use monitors and locks, which have appropriate fences (or their equivalents) built in. Java programmers will mark shared variables volatile, which will tell the compiler to issue memory fences on every access. C++ and D programmers will occasionally use atomics, which are implemented with all the fencing in place.

But look at it this way: This is a cautionary story for high-level programmers too. Do not elide synchronization even in the simplest, seemingly obvious cases! Don’t try to be clever! The processors (and the compilers) are out there to get you. The slightest slip and they will “optimize” your code in a way that is contrary to your intuitions.

12 Responses to “Multicores and Publication Safety”

Bartosz Milewski Says:

August 6, 2008 at 10:44 am
What a coincidence! Herb Sutter has just published (August 5) an article in which he deconstructs a lock-free implementatin of a producer/consumer queue that appeared in DDJ. One of the problems he points out is a publication race. In general, the implementation totally ignored the need for fencing. The moral of the story is, if you’re trying to implement any multithreaded code without using locks, consult Herb first.
Martin d'Anjou Says:

August 7, 2008 at 9:32 pm
Great post. Very interesting. Thank you. I always wondered how out-of-order reads/writes translated in high-level languages. You explain it very well.
Who ordered memory fences on an x86? « Bartosz Milewski on Programming Says:

November 5, 2008 at 2:09 pm
[…] lfence, and sfence; but, considering all those guarantees, why would anyone use them? The famous double-checked locking pattern, on the x86, works just fine without any […]
C++ atomics and memory ordering « Bartosz Milewski’s Programming Cafe Says:

December 1, 2008 at 4:31 pm
[…] further ado, this is how the publication safety pattern (the mother of all double-checked locking patterns) could be written in […]
Dan Eloff Says:

December 25, 2011 at 9:27 pm
Very readable and easy to follow explanation of a complex topic – thanks!
Vincent Says:

January 20, 2012 at 7:00 am
I would like to see clear in this DCLP story: I have searched the web, but I cannot a find a definitive answer whether this pattern is safe or not.

Here is my problem: I work on a project where this pattern is often used. I know it is bad in principle because it is not portable (I have been bitten by DCLP on AIX for instance in the past), however the project I am working on at the moment exclusively uses Solaris on Sparc and Linux on x86, which are both TSO platforms: I would like to be certain that DCLP is safe on these platforms.

Let’s illustrate this with an example. DCLP typically goes along the lines of:
```
mutex m ;

int resource = 0 ;
bool ready = false ;

void init()
{
  if (!ready)
  {
    m.lock() ;
    if (!ready)
    {
      resource = 1 ;
      ready = true ;
    }
    m.unlock() ;
  }
}
```
resource being an integer here, although in practice it is often a pointer to a type T and resource is initialised with resource = new T(). Nevertheless, I think the code above is general enough to illustrate the problems.

(Also I know that instead of going through mind-boggling hardware considerations, it would be just as easy not to use this pattern. Point is that I work on a large code base where this pattern is used in many places and I don’t want to change this pattern everywhere if it is safe in the first place).

This being said, the init() function above would approximately translate to the following pseudo-code:
```
[1]   load r0, ready
[2]   if (r0 == 0)
[3]   {
[4]     m.lock() ;
[5]     load r0, ready
[6]     if (r0 == 0)
[7]     {
[8]        store resource, 1
[9]        store r0, 1
[10]   }
[11]   m.unlock()
[12] }
```
Now, when it comes to thread-safety, as far as I understand, several issues come into play. To name those I am interested in:
1) compiler reordering instructions,
2) hardware reordering instructions,
3) cache coherency

In other words, keeping in mind this code will only ever run on a TSO platform:
1) the stores in lines 8 and 9 cannot be reordered by the hardware, right ?
2) a C++ compiler may decide however to reorder them, I think. Correct ?
3) if a thread A running on processor P0 executes init() up to line 9, and a thread B running on processor P1 starts executing line 1, is it possible that thread B sees the updated value of the flag ready ? In other words, is it possible that the value of “ready” hold in P0 cache is flushed from register r0 to main memory (and to P1 cache) before the call to unlock() ?
4) even if neither the hardware nor the C++ compiler reorder the stores in lines 8 and 9, is it guaranteed that processor P1 will see the updates in the same order as they occurred on processor P0 ? In other words, is this Total Store Order defined on a single processor, or on all processors in a system ?

Can someone shed some light on this?
Bartosz Milewski Says:

January 20, 2012 at 12:25 pm
@Vincent: Lots of good questions. Let me try to answer them.
1. Correct. You have the TSO guarantee so all processor will see them in the same order.
2. Correct. The compiler is free to reorder these instructions and format your disk in the process because you have a data race, which might result in undefined behavior. The compiler assumes that variables ready and resource are only visible to the current thread.
3. Correct. The write buffer may be flushed at any time. (I assume the store is to “ready” not r0.
4. Correct, total store order means all processors see the same order of writes.

In general, I wouldn’t worry about cache coherency so much as about the presence of the write buffer between the processor and the cache (maybe that’s what you mean). But the real problem here is with the compiler, which has no idea that the variables are shared. Your translation to assembly is probably correct in the debug mode, but all bets are off when you turn on optimizations. Even if this code works with your current compiler, it might break when you upgrade.
Vincent Says:

January 23, 2012 at 2:21 am
Thanks Bartosz for confirming my fears 🙂
Because, as far as I understand, this means that this DCLP is not safe, not even on a TSO system. An example scenario where it may fails is if the compiler decides to reorder the two stores on line 8 and 9, and if the store to ready is flushed before the store to resource, then an other thread may see ready as true whereas it would not see the initialised resource.

I know this is a lot of ifs, I know that our compilers (Sun CC and g++) seem not to reorder these stores, but as you say, I have no way to make sure that this will stay true for all occurrences of this pattern, and for all future versions of these compilers.

One extra question, though: in this very case, the bug is due to compilers being allowed to reorder the two stores, because C++03 memory model is basically single-threaded. How is this changed or improved by C++0x ? Is using a C++0x-compliant compiler enough to forbid this reordering ? Would I have to declare these variables ready and resource as atomic ? Or it just does not make any difference and there is no way around this except using proper synchronisation around all shared data ?
Bartosz Milewski Says:

January 23, 2012 at 10:23 am
In C++11 you would make “ready” atomic (no need to make “resource” atomic). If you make the load std::memory_order_acquire and the store std::memory_order_release, you will not only prevent the compiler from reordering, but also introduce no barriers on the x86, so you won’t sacrifice efficiency. As a bonus, the code will be portable.
Relaxed memory order | keeping simple Says:

November 16, 2015 at 5:58 am
[…] Multicores and publication safety […]
david kuo Says:

April 19, 2016 at 12:39 pm
would asm volatile(“” ::: “memory”); between line 8,9 fixes the compiler reordering issue and TSO gurantees the load order? so is atomic ready really necessary?
Prefix Sum on Vulkan - MYBOOAM Says:

May 1, 2020 at 12:06 pm
[…] Because x86 has “total store order,” barrier instructions are generally not needed for publication safety. However, the same code on, say, ARM, which has more weakly ordered memory semantics, would fail, […]