Publication safety is the core issue in the famously non-intuitive Double-Checked Locking Pattern.

What’s publication? In a nutshell, one thread prepares data and publishes it–other threads check if the data has been published and use it. Common scenario is the creation of a shared object (this example is written in the D programming language, but it’s pretty self-explanatory).

shared Foo foo = new shared Foo();

When a thread creates an object, it first runs its constructor (Foo()) and then points the shared handle (foo) to it. Other threads check the handle for non-null and then happily access the object.

if (foo !is null) foo.doSomething();

Naturally, in our naivete, we tend to assume that if the second thread can see a non-null handle, the construction of the object must have completed. That belief is known as publication safety and, guess what!, it’s not guaranteed on modern multi-processors that use relaxed memory models.

To understand what’s happening, let’s simplify the problem even further and write it in pseudo assembly. Initially the globals x and ready are zero. R is a thread-local variable (register). Think of writing to x as part of the construction of an object and writing to ready as the publication (the initialization of a shared handle).

Thread 1 Thread 2
x = 1
ready = 1
if ready == 1
R = x

Can Thread 2 see ready == 1 and x == 0? Yes, for two separate reasons. On a relaxed-memory-model multiprocessor

  1. writes to memory can be completed out of order and
  2. reads from memory can be satisfied out of order.

Imagine processors sending e-mail messages to memory. Thread 1 sends a message instructing the memory to write 1 to x. Then it sends another message instructing it to write 1 to ready. It’s perfectly possible on modern processors that the first message gets delayed and the write to ready completes before the write to x.

The way to make sure this doesn’t happen is to separate the two writes by a memory barrier, or fence. Every relaxed-memory-model multiprocessor offers some ways to do it. The x86’s (x > 3) have such instructions (mfence, lfence, and sfence), even though they implement processor-order memory model.

But beware, even if the writes are ordered by a (write) fence, the reads in Thread 2 may still execute out of order. Imagine that Thread 2 sends two e-mail messages asking for the values of ready and x. The second message arrives first, before any writes by Thread 1 are done. The memory sends back an e-mail with the value 0 for x. Next, the two writes by Thread 1 are committed. Then the first read message (fetch ready) arrives, and the memory responds with the value 1. Thread 2 sees a non-zero value of ready, but a zero (uninitialized) value of x. We’re in trouble!

Notice that the read of x is speculative. The processor issues the read request just in case the branch ready == 1 were taken. If it’s not, it can always abandon the speculation.

Again, the way to ensure that the two reads are satisfied in program order is to put a fence between them. Here’s the pseudocode.

Thread 1 Thread 2
x = 1
write fence
ready = 1
if ready == 1
read fence
R = x

Both fences are necessary!

The write fence is easier to remember. In our publication example,  it makes sense to put it at the end of the constructor. It has the connotation of flushing all the writes performed during construction, before the public handle is initialized.

It’s the need for the read fence that is often overlooked. It’s not immediately obvious that every time you access a published shared variable you have to use a fence. It’s the “every time” part that seems excessive, especially if your code initializes the handle only once (as in the double-checked locking pattern). Sure, there are a few cases when a benign race is acceptable, but even the best of us get it wrong half of the time.

Why is this whole low-level discussion relevant? Very few programmers will be inserting (non-portable) fences into their code. Most programmer will use monitors and locks, which have appropriate fences (or their equivalents) built in. Java programmers will mark shared variables volatile, which will tell the compiler to issue memory fences on every access. C++ and D programmers will occasionally use atomics, which are implemented with all the fencing in place.

But look at it this way: This is a cautionary story for high-level programmers too. Do not elide synchronization even in the simplest, seemingly obvious cases! Don’t try to be clever! The processors (and the compilers) are out there to get you. The slightest slip and they will “optimize” your code in a way that is contrary to your intuitions.