Multiprocessors with relaxed memory models can be very confusing. Writes can be seen out of order, reads can be speculative and return values from the future–what a mess! In order to impose some kind of consistency you have to use memory fences, and there are several kinds of them.

The x86 seems to be an oasis in the perilous landscape of relaxed memory multicores. The Intel x86 memory model, detailed in Intel 64 Architecture Memory Ordering White Paper and the AMD spec, AMD64 Architecture Programmer’s Manual, list a lot of memory ordering guarantees, among them:

  • Loads are not reordered with other loads.
  • Stores are not reordered with other stores.
  • Stores are not reordered with older loads.
  • In a multiprocessor system, memory ordering obeys causality (memory ordering respects transitive visibility).
  • In a multiprocessor system, stores to the same location have a total order.
  • In a multiprocessor system, locked instructions have a total order.
  • Loads and stores are not reordered with locked instructions.

The x86 also has (expensive–on the order of 100 cycles) memory fence instructions, mfence, lfence, and sfence; but, considering all those guarantees, why would anyone use them? The famous double-checked locking pattern, on the x86, works just fine without any fences.

This is a very important question, because you don’t want the compiler to over-fence your code and bring it to a crawl. On the other hand, you don’t want it to emit incorrect code. I decided to look for some answers.

There is one important non-guarantee listed in the x86 spec:

Loads may be reordered with older stores to different locations

Here’s an example from the x86 spec: x and y are in shared memory and they are both initialized to zero, r1 and r2 are processor registers.

Thread 0 Thread 1
mov [_x], 1
mov r1, [_y]
mov [_y], 1
mov r2, [_x]

When the two threads execute on different cores, the non-intuitive result r1 == 0 and r2 == 0 is allowed. Notice that such result is consistent with the processor executing the loads before the stores (which access different locations).

How important is this example?

I needed to find an algorithm that would be broken by this relaxation. I’ve seen Dekker’s algorithm mentioned in this context. There is a more modern version of it used in the Peterson lock. It’s a mutual exclusion algorithm that works for two threads only. It assumes that each thread has access to a thread-local ID, which is 0 for one thread and 1 for the other. Here’s the version adapted from The Art of Multiprocessor Programming:

class Peterson
{
private:
    // indexed by thread ID, 0 or 1
    bool _interested[2];
    // who's yielding priority?
    int _victim;
public:
    Peterson()
    {
        _victim = 0;
        _interested[0] = false;
        _interested[1] = false;
    }
    void lock()
    {
        // threadID is thread local,
        // initialized either to zero or one
        int me = threadID;
        int he = 1 - me;
        _interested[me] = true;
        _victim = me;
        while (_interested[he] && _victim == me)
            continue;
    }
    void unlock()
    {
        int me = threadID;
        _interested[me] = false;
    }
}

To explain how it works, let me impersonate one of the threads. When I (the thread) want to take a lock, I set my _interested slot to true. I am the only writer of this slot, although the other guy can read it. I also toggle the _victim switch to point to myself. Now I check if the other guy is also interested. As long as he is, and I am the victim, I spin. But once he becomes uninterested or turns himself into a victim, the lock is mine. When I’m done executing critical code, I reset my _interested slot, thus potentially releasing the other guy.

Let me simplify this a little, and partition the code between the two threads. Instead of an array _interested, there are two variables zeroWants and oneWants corresponding to the two slots.

zeroWants = false;
oneWants = false;
victim = 0;
Thread 0 Thread 1
zeroWants = true;
victim = 0;
while (oneWants && victim == 0)
 continue;
// critical code
zeroWants = false;
oneWants = true;
victim = 1;
while (zeroWants && victim == 1)
 continue;
// critical code
oneWants = false;

Finally, let me rewrite the initial part of the execution in pseudo assembly.

Thread 0 Thread 1
store(zeroWants, 1)
store(victim, 0)
r0 = load(oneWants)
r1 = load(victim)
store(oneWants, 1)
store(victim, 1)
r0 = load(zeroWants)
r1 = load(victim)

Now look at the loads and stores to zeroWants and oneWants. They follow the same pattern as in the x86 reordering example. The processor is free to move the read of oneWants before the write to zeroWants (and to victim). Similarly, it can move the read of zeroWants before the write to oneWants. We may end up with the following execution:

Thread 0 Thread 1
r0 = load(oneWants)
store(zeroWants, 1)
store(victim, 0)
r1 = load(victim)
r0 = load(zeroWants)
store(oneWants, 1)
store(victim, 1)
r1 = load(victim)

Originally, both zeroWants and oneWants are initialized to zero, so both r1 and r2 may end up zero. In that case, the spin loop is never executed and both threads march into the critical section. Peterson lock on the x86 is broken!

Of course, there is a way to fix it. An mfence will force the correct ordering. It can be put anywhere between the store of zeroWants and the load of oneWants in one thread, and between the store of oneWants and the load of zeroWants in the other thread.

So here we have a real-life example that requires an actual fence on an x86. The problem is real!