Concurrency



Can a good type system prevent concurrency errors? Or is this a quest for the Holy Grail?

There are two parts to this question, corresponding to two major types of concurrency errors:

  1. Preventing data races
  2. Preventing deadlocks

I’ll start with the first one.

Data races occur only when memory is shared between threads. Disallow sharing and data races are gone! In fact there is a name for threads that don’t share memory: processes. It’s perfectly feasible to have a concurrent language that disallows sharing–Erlang is one (see my post about Erlang). The trick is to always pass data between threads by value. This is especially easy in functional languages.

Non-functional languages like C++, Java, or D (I was told by Walter Bright, the creator of D, to always use the full name, “the D programming language,” so that search engines can index it properly) tend to share data almost by default (see, however, this post).

In Java, all non-primitive types have reference semantics. When you pass a Java object between threads, you’re only passing a handle; the object behind it becomes accessible from both threads.

C++ at least has means to express passing by value and move semantics for user-defined types. Still, it’s up to the programmer to use them.

Who ordered Guava?

For every type-system idea there is one or more dialects of Java that implement it. I’ll start with an older attempt at data-race free Java called Guava, as it illustrates some of the basic premises.

-Explicit sharing

The most important step–if we don’t want to completely ban the sharing of data–is to regulate it. Let the programmer explicitly mark the data destined for sharing as such. The corollary is that the data that is not marked for sharing cannot be shared. This can be accomplished, for instance, by making all objects thread-local by default, or by using type modifiers that prevent references to such objects from escaping.

In Guava, the data type designed for sharing is called a Monitor. As the name suggests, all access to a Monitor is automatically synchronized by the Monitor’s lock. This, incidentally, eliminates the need for the synchronized keyword, which is absent from Guava.

The non-shared data types are either Objects or Values.

Objects are just like regular Java Objects, except that they don’t have a built-in lock, since they can never be shared between threads.

Values are either primitive values, like integers, or user-defined types with value semantics.

Monitors, Objects, and Values are collectively called instances.

-Value semantics

When you pass a Value, the compiler will make a deep copy of it (well, sort of–the monitors embedded in a Value are not deep copied). Since deep copying might be expensive, Guava defines operator “move”, which nulls the source. The syntax is:

  v2 = v1--;

The value v1 becomes null after the move to v2. This is similar to C++ unique_ptr and std::move.

-Ownership

The biggest problem in lock based concurrency is to make sure that the correct lock(s) are taken when accessing shared data. In Guava, all Monitor’s data are protected by that Monitor’s lock. As long as they stay inside that Monitor, nothing bad can happen to them from the point of concurrency.

Values stored inside a Monitor are never accessible outside of the Monitor–only their copies may be passed out.

The same is not true about Objects. Since Objects have reference semantics, there is a real danger that Objects’ references might escape the Monitor that protects them. Imagine a situation where two Monitors have references to the same Object. It is possible then that two threads may operate on that Object at the same time–one entering through one Monitor, the other through the other Monitor. We have a data race!

Therefore it is important that every Object have an owner at all times. The Object’s owner is either a Value or a Monitor. (The exception is a fresh Object that’s been just allocated–it has no owner until it is assigned to a variable.) Since an Object may only be owned by at most one Monitor, it is that Monitor that protects it from simultaneous (racy) access.

-Regions

All Objects that are owned by a particular Monitor or Value form a region. Equivalently, assigning a monitored region to an object specifies what lock must be held when accessing it.

All instances may contain (references to) monitors, but monitors are not “owned” by anybody. References to the same monitor may appear in multiple regions and may be freely passed around. It is thus up to programmers to define an ordering scheme for their monitors in order to avoid deadlocks.

How can we protect Objects from moving between regions and acquiring multiple owners? We need a way to control aliasing.

Here are some Guava rules for passing Objects. A method may declare its Object parameter as either kept or lent. (By default, parameters to Object methods are kept and to Monitor methods are lent.) If the parameter is kept it must belong to the same region as this, and there are no limitations on its use. If, however, the parameter is lent, the method may not store a reference to it in this, nor may it store this inside a lent Object. No cross-aliasing is allowed.

A method may also be marked new if it returns a freshly created object, which has no owner yet. Constructors are considered new unless they accept kept parameters.

Notice that you may have multiple references to the same Object, but they will all be within the same region. The only instances that may be passed between threads are Monitors and Values.

-Constructors

Guava final fields may either be initialized inside a constructor or in a private method that is only called from inside a constructor. (BTW, in D a private method is not private within a module, so the compiler would have to analyze the whole module to establish the latter condition.) [Note to self: The same scheme might be useful in the construction of immutable objects in D.]

Partially constructed Objects must not be visible outside constructors. The compiler must verify that constructors don’t pass this to other methods, and don’t store this inside other instances (no alias cross-contamination).

-Optimizations

Copying Values around may be expensive. I already mentioned one optimization, the use of the “move” operator. The other optimization is related to immutability. If a Value is immutable, it may be safely passed by reference. Guava defines immutable classes as ones that have no update methods. Any method that may modify this must be marked update. The update notation is also extended to method parameters–by default parameters are immutable.

There is a bonus advantage to separating update methods from the rest. In a Monitor, a non-update method may safely use a readers’ lock, which allows multiple readers to access data simultaneously, to increase concurrency.

-Global and local methods

A method is considered local if its effects cannot be observed or influenced by other threads. All Object and Value methods are by default local. A local method is immune to races thus allowing single-thread optimizations.

Conversely, all Monitor methods are considered global, since operations on Monitors may be visible or influenced by other threads.

These defaults may be overridden. For instance, an Object may contain a reference to a Monitor. The methods that call this Monitor’s methods must be marked global. Moreover, Object or Value methods that access Monitors that are passed to them as arguments must also be marked global. So touching a Monitor (which is equivalent to using its lock) taints the whole callers’ tree with the global annotation.

This is similar to the way update taints the callers tree, except the update annotation of a method only pertains to this, not to its arguments. However, when global and update are combined, they have the same tainting power as global. In particular, a method that invokes a global update method on its argument becomes tainted with global update.

Methods that are global update cannot be invoked from non-update methods, even if they only global-update their arguments.

Note: This part of the paper is not very clear to me. The authors never explain the importance of global update methods (other than optimization opportunities).

-Limitations

Guava implements a relatively simple and somewhat limited system to eliminate races. It punts the problem of ownership-passing and lock-free programming. Even with those limitations the Guava type system is not simple.

The idea that safe multithreading may be achieved with a few simple modification to the type system seems to be untenable. However, as long as special type annotations are limited to the code that does actual multithreading, I think they’re worth the added complexity.


In my last post I criticized the C++0x implementation of futures for their lack of composability. I now feel obliged to show what I consider “the right way” to do futures.

I use the event library of Concurrent Caml (actually, its version distributed with OCaml) as my model. The ideas behind it are described in J. H. Reppy’s Ph.D. thesis, Higher-order concurrency.

Events and Futures

CML channels and events are more general than futures, but here I’ll concentrate on the subset of their functionality that can be used to implement futures. Below is a useful translation table between CML and C++. (I should add that CML channels, as opposed to C++ futures, are synchronous. In practice it doesn’t matter that much, since promise::set_value–corresponding to channel send–is usually called right before the thread exits.)

CML C++
channel promise
event future
evt = receive chan ftr = prom.get_future()
sync (send chan msg) (synchronous) prom.set_value(msg) (asynchronous)
msg = sync evt msg = ftr.get()
choose evt_lst ???
evt’ = wrap evt fun ???

As you can see, C++ has no equivalents of choose and wrap. I will discuss those next.

choose

CML choose is a function that takes a list of events and creates a new composite event. When the program syncs on this event, it blocks until one of the listed events completes. The combination sync and choose is similar to Unix select or Windows WaitForMultipleObjects.

An example of the usefulness of choose is a situation when there is more than one algorithm to calculate a result, and the program runs them in parallel until the fastest one completes. Linear search, for instance, is faster for small sets than binary search. Instead of figuring out which one to use for each particular case, it might be cheaper (on a multicore) to run both and stop when the fist one is done.

What would choose look like in C++? One possibility is to implement it as a container object. You would push individual futures into it and then call get_future to obtain a composite future. Let me call this object future_or since it does the equivalent of logical OR for futures (the equivalent of AND is called a thread barrier).

  unique_future<int> ftr1 = promise1.get_future();
  unique_future<int> ftr2 = promise2.get_future();
  future_or<int> orf;
  orf.push_back(std::move(ftr1));
  orf.push_back(std::move(ftr2));
  unique_future<int> compositeFut = orf.get_future();
  ...
  int result = compositeFut.get();

Another possible implementation, in C++0x, would be to define a variadic template function future_choose. It would take a variable number of futures and return the composite future. The advantage of such an approach would be to let the compiler infer the types and the number of arguments automatically.

You probably noticed that future_or can only accept futures of one type (the same would be true about the function future_choose). For instance, to create a composite future that returns a string upon get, you must build it from other futures that also return a string (and not an int, for instance).

In principle, heterogeneous futures could be combined if they all returned different types through the same variant. Then, after a composite of variant-returning futures returns a variant, program logic would fork depending on the actual type inside the variant. This approach bypasses the static type system and is error prone. CML provides a more elegant solution using wrap.

wrap

The trick is to combine (wrap) a particular future with a function that processes its output. A future that returns an int is combined with a function that takes an int, etc. The result of this combination is a new future usually returning void.

When a wrapped future is synchronized (e.g., by calling wait), it executes the associated function, passing it the result returned by the original future. Just look at this example:

  void handler(Foo const & foo);
  unique_future<Foo> fooFuture = prom.get_future();
  // combine the future with a handler function
  unique_future<void> wrappedFuture = wrap_future(fooFuture, &handler);
  ...
  wrappedFuture.wait();
  // at this point fooFuture has returned a Foo object 
  // and the handler has been executed with that object.

There’s one thing worth mentioning: the compiler will assure type correctness of the arguments to wrap_future: the return type of the future must be compatible with the argument type of the function. Mismatch errors will be discovered at compile time. Compare this with the implementation using variants, which does not have this property.

Of course the usefulness of wrap really comes into play when you try to combine heterogeneous futures using future_choose. You just wrap each future with its own post-processing function (or callable object) and pass them all to future_choose. Wrapped futures are homogeneous–they all return void.

Here’s an example that uses C++0x lambdas (one of them being a closure):

  unique_future<Foo> fooFuture = prom1.get_future();
  unique_future<int> intFuture = prom2.get_future();
  int sum = 0;
  future<void> composite = future_choose(
      wrap_future(fooFuture, [](Foo const & foo){foo.Commit();},
      wrap_future(intFuture, [&sum](int i){sum += i;});
  ...
  composite.wait();

Wrapped futures would be useful, e.g., in implementing a server that listens to several communication channels.

What about Haskell?

As I described in my earlier post, unlike CML, Haskell implements asynchronous message passing. The original Concurrent Haskell paper argued against choose. Later however Avik Chaudhuri was able to show that the CML event library can be implemented on top of asynchronous MVars. So there are no theoretical obstacles to asynchronous implementation of futures.

Final remarks

The equivalent of choose called future_select has been proposed for Boost and C++0x, but it was turned down.

Interestingly enough, Microsoft’s Task Parallel Library for .NET implements higher-order concurrency. They have ContinueWith for wrapping futures and WaitAny for choice. I can only assume that it’s easier to implement those on a virtual machine.

I also wonder whether it would be more productive to implement the equivalent of the CML channels and events in C++, and treat futures as a special case.

If you like this post, please vote for it on reddit.

Bibliography

  1. J. H. Reppy. Higher-order concurrency. PhD thesis, Cornell
    University, 1992. Technical Report 92-1852.
  2. Avik Chaudhuri. A Concurrent ML Library in Concurrent Haskell.

I am frankly disappointed with C++0x futures. They lack a major feature–composability. It’s like having Booleans without OR or AND.

But let me first explain what futures are and how they’re implemented in C++0x.

What is a future?

In a nutshell, a future is an asynchronous function call. You call a function, but don’t wait until it completes–the body of the function executes in a separate thread. Instead you get a future, a promise to deliver the result at some later point in time. Only when your program desperately needs that value, do you synchronize on the future. If the calculation hasn’t been completed by then, you block until it has. Otherwise you get the result of the calculation.

C++0x futures

C++ splits the implementation of futures into a set of small blocks–it almost creates an assembly language of futures. Here are the primitives.

-promise

A promise is a vehicle for passing the return value (or an exception) from the thread executing a function to the thread that cashes in on the function future. The function (or a callable object) that is executed in a separate thread must have access to the promise, and it explicitly sets the return value.

You can think of a promise as a primitive channel; the function in this case is the sender. The equivalent of the “send” method is called promise::set_value (to pass an exception to the caller, the callee can call promise::set_exception.

There is no corresponding “receive” method–the receiving is abstracted into the future object.

Here’s some code that illustrates the use of promise. Notice how the function to be called asynchronously must be aware of the promise.

void asyncFun(promise<int> intPromise)
{
    int result;
    try {
        // calculate the result
        intPromise.set_value(result);
    } catch (MyException e) {
        intPromise.set_exception(std::copy_exception(e));
    }
}

The calling thread creates the promise and passes it to the worker thread executing asyncFun (I’ll show the details later).

-future

A future is the synchronization object constructed around the receiving end of the promise channel. The calling thread obtains a future using promise::get_future.

The most important method of the future is get. If the result is not ready, get will block. When get completes, it either returns a value or throws an exception. The return value or the exception has been set by the called function through the promise associated with the future (see above).

get can also be split into its more basic parts: wait, optionally followed by has_value (or has_exception) and the call to get, which is guaranteed not to block. The advantage of the latter approach is that one can use versions of wait, wait_for and wait_until, that set timeouts.

There is also an asynchronous method, is_ready, that can be used for polling rather than blocking.

There are two separate implementations of future: regular future (which used to be called unique_future in Draft Standard) that works like a unique_ptr as far as passing values goes, and shared_future that works more like shared_ptr. For instance, the method get of future cannot be called twice because it transfers the ownership of the result.

Here’s an example how a future could be used in the caller’s code:

promise<int> intPromise;
future<int> intFuture = intPromise.get_future();
std::thread t(asyncFun, std::move(intPromise));
// do some other stuff
int result = intFuture.get(); // may throw MyException

-packaged_task

In most cases the use of promises can be hidden from the programmer. There is a template class, packaged_task that takes any function (or callable object) and instruments it for use with futures. In essence, it creates a promise and calls the function from inside a try/catch block. If the function returns, packaged_task puts the value in the promise, otherwise it sets the exception. A packaged_task is a callable object (has the function call operator() defined) can be passed directly to a thread.

Here’s a more complete example:

vector<int> primesUpTo(int n);

int main() {
    packaged_task<vector<int>(int)> task(&primesUpTo);
    future<vector<int>> fut = task.get_future();
    thread t(move(task), 100);
    t.detach();
    // do some other work while the primes are computing
    vector<int> primes = fut.get();
    // print primes
    return 0;
}

Composability

So far so good. The creation of futures might be a bit verbose in C++, but it can be easily adapted to concrete environments. The futures standard library should be treated as a set of primitive building blocks from which to build higher abstractions.

Notice how C++ separated the channel (promise) from the synchronization mechanism (future); and thread creation from communication and synchronization. You may, for instance, use futures with thread pools or create threads on the spot.

What was omitted from the standard, however, was the ability to compose futures. Suppose you start several threads to perform calculations or retrieve data in parallel. You want to communicate with those threads using futures. And here’s the problem: you may block on any individual future but not on all of them. While you are blocked on one, other futures might become ready. Instead of spending your precious time servicing those other futures, you might be blocked on the most time-consuming one. The only option to process futures in order of completion is by polling (calling is_ready on consecutive futures in a loop) and burning the processor.

The need for composability of synchronization events has been widely recognized. Windows has WaitForMultipleObjects, Unix has select, and several higher order languages have explicit constructs that serve the same purpose. But that’s a topic for my next installment.

Acknowledgments

I’d like to thank Anthony Williams for helpful comments. Among other things, he is the author of one of the futures proposals for Boost. A stripped down version of it will become part of the threadpool library. Unfortunately, the stripping got rid of composability features.


What is the most basic building block for asynchronous message passing? I’ve found the answer in Concurrent Haskell. Not many people are fluent in Haskell, especially when monads are involved, so MVars may sound foreign to most. Not to worry–I’ll translate them into Java (as I did with synchronous CML channels in my previous post).

An MVar is an object with two methods, put and take. The sender calls put with a message and continues (it doesn’t block–that’s why this message-passing model is called “asynchronous”). The receiver, in another thread, calls take to remove the message from MVar. If there is no message, the receiver blocks until there is one.

An MVar can store a maximum of one message, so it’s an error to call put when there is an unclaimed message inside it. An MVar can thus be only in one of two states: empty or full.

Here’s a simple-minded implementation of MVar written in Java. (A lock-free implementation is possible–and closer to how Haskell implements it–but it’s harder to reason about.)

public class MVar<T> {
    private T _obj;
    private boolean _full;

    public MVar(){
        _full = false;
    }
    // put: asynchronous (non-blocking)
    // Precondition: MVar must be empty
    public synchronized void put(T obj) {
        assert !_full;
        assert _obj == null;
        _obj = obj;
        _full = true;
        notify();
    }
    // take: if empty, blocks until full.
    // Removes the object and switches to empty
    public synchronized T take() throws InterruptedException {
        while (!_full)
            wait(); // may throw!

        T ret = _obj;
        _obj = null;
        _full = false;
        return ret;
    }
}

You might think that it would be difficult to implement anything useful with MVars. After all it looks like a message queue of length one, which bombs when you try to put a second message in. Yet it is the simplest building block for more sophisticated structures. It is the atom of asynchronous message passing.

We’ve seen before how to implement asynchronous message passing using synchronous building blocks by spawning a thread. Now let’s see how we can implement a simple synchronous channel variable using asynchronous MVars. Remember, in a synchronous channel, if there is no receiver already waiting on the other side, a call to send (or write, in this example) will block.

The basic channel variable, or a channel of length one, is called a CVar in Haskell, and it contains two MVars, one to store the message, and the other for the acknowledgment. The acknowledgment MVar doesn’t really have to store anything–we are only interested in its state: empty or full. The reader will acknowledge the receipt of the message by setting it to full.

public class CVar<T> {
    private MVar<T> _data;
    private MVar<Object> _ack;
    
    public CVar(){
        _data = new MVar<T>();
        _ack = new MVar<Object>();
        _ack.put(null); // make _ack full
    }
    public void write(T obj) throws InterruptedException {
        _ack.take(); // make _ack empty
        _data.put(obj);
    }
    public T read() throws InterruptedException {
        T data = _data.take();
        _ack.put(null); // make _ack full
        return data;
    }
}

MVars can also be used to build a more useful asynchronous buffered channel that can store more than one message at a time. I will show you the construction, but it’s far from trivial. You might notice that it resembles a lot a lock-free FIFO queue (although I chose to use locks to implement the MVar). As with all lock-free data structures, one has to be very careful when reasoning about their correctness. I’ll leave it as an exercise to the reader 😉 .

Item and Stream are mutually recursive data structures forming a linked list. An Item points to a Stream–the tail of the list. A Stream is an MVar containing an Item. You may look at a Stream as a sequence of alternating MVars and Items. An empty MVar may serve as a sentinel.

class Item<T> {
    private T _val;
    private Stream<T> _tail;
    
    public Item(T val, Stream<T> tail){
        _val = val;
        _tail = tail;
    }
    T value(){
        return _val;
    }
    Stream<T> tail(){
        return _tail;
    }
}

// It's just a typedef for an MVar that stores an Item
class Stream<T> extends MVar<Item<T>> {}

A Channel contains the Stream linked list stored in an MVar, which is called _read because the head of the list is where we read messages. The other MVar points to the end of the list (really an empty sentinel Stream). This is the write end of the list. The methods put and get (which are not synchronized!) perform some intricate manipulations characteristic of lock-free algorithms, making sure that at every step the queue is in a consistent state for concurrent access. Notice that put will only block if another put is in progress. Otherwise it’s asynchronous–there is no waiting for a receiver.

public class Channel<T> {
    private MVar<Stream<T>> _read;
    private MVar<Stream<T>> _write;
    
    public Channel() {
        _read = new MVar<Stream<T>>();
        _write = new MVar<Stream<T>>();
        Stream<T> hole = new Stream<T>();
        _read.put(hole);
        _write.put(hole);
    }
    public void put(T val)throws InterruptedException {
        Stream<T> newHole = new Stream<T>();
        Stream<T> oldHole = _write.take();
        _write.put(newHole);
        oldHole.put(new Item<T>(val, newHole));
    }
    public T get()throws InterruptedException {
        Stream<T> cts = _read.take();
        Item<T> item = cts.take();
        _read.put(item.tail());
        return item.value();
    }
}

In the next few installments I’m planning to talk a little about “choice” and then tackle the actor-based message-passing paradigm (Erlang and Scala).


What’s the lowest level primitive that can be used to build a concurrent message passing system? As I discussed in my previous post, there are two message passing paradigms, synchronous and asynchronous.

Let’s start with the atom of synchronous message passing. In its purest form it is implemented in Concurrent ML (CML) in the form of a channel. A channel has two methods, send and recv. A thread that calls send blocks until another thread calls recv on the same channel. Similarly, a thread that calls recv will block if there is no blocked send to rendezvous with.

CML channels are typed, i.e., a channel for passing integers has a different type than a channel for passing characters.

Since not everybody is familiar with ML, I decided to implement an equivalent of CML channels in Java. I considered C++ and D, but the Java implementation is the simplest. In C++ I would have to use a C++0x compiler, which I don’t have; in the D programming language, condition variables are not as easy to use as in Java (although that might change in the future). This implementation works only for point-to-point communications–it does not support multiple senders or receivers. [Following readers’ comments I modified the implementation to take into account the so-called spurious wakeups–exits from wait not caused by notify.]

public class Channel<T> {
    private boolean _dataReady;
    private boolean _received;
    T _msg;

    Channel(){
        _dataReady = false;
        _received = false;
    }
    public synchronized void send(T msg) throws InterruptedException {
    	while (_dataReady)
    		wait();
        _msg = msg;
        _dataReady = true;
        notify();
        while (!_received)
            wait();
        _received = false;
    }
    public synchronized T recv() throws InterruptedException {
        while (!_dataReady)
            wait();
        T msg = _msg;
        _dataReady = false;
        _received = true;
        notify();
        return msg;
    }
}

This is probably not the simplest implementation, but it illustrates the principle. Notice that internally the message is passed through shared memory (the data member _msg). This is standard for intra-process, and often inter-process, message passing.

Synchronous channels can be used as building blocks in the implementation of asynchronous message passing. The asynchronous sender simply creates a worker thread that calls send and possibly blocks. The calling thread is free to proceed immediately. The efficiency of this sort of implementation depends heavily on how cheap thread creation is (although worker threads can be cached in thread pools). Any language runtime that uses native threads (including Java) is handicapped in this respect. Interestingly enough, Erlang and Haskell, which use the asynchronous model, have cheap threads; the synchronous CML, paradoxically, uses native threads.

In the next installment, I’ll describe the atom of asynchronous message passing that can be found in Haskell (don’t worry, I’ll translate it into Java).

For completeness, here’s the program I used to test my channels:

public class RecvThread extends Thread {
    private Channel<String> _ch;
    long _waitTime;

    public RecvThread(Channel<String> ch, long waitTime){
        _ch = ch;
        _waitTime = waitTime;
    }
    public void run()
    {
        try {
            Thread.sleep(_waitTime);
            String msg = _ch.recv();
            System.out.println("Message received: " + msg);
        } catch (InterruptedException e) {}
    }
}

public class TestChannel {
    public static void main(String[] args) {
        try {
            testReceiver();
            testSender();
        } catch (InterruptedException e) {
            System.out.println("Interrupted Exception");
        }
    }
    public static void testReceiver() throws InterruptedException 
    {
        Channel<String> ch = new Channel<String>();
        // send will happen first
        RecvThread t = new RecvThread(ch, 1000);
        t.start();
        ch.send("Hello before rendezvous");
    }
    public static void testSender() throws InterruptedException 
    {
        Channel<String> ch = new Channel<String>();
        RecvThread t = new RecvThread(ch, 0);
        t.start();
        Thread.sleep(1000);
        ch.send("Hello after sleep");
    }
}

In the first test, the sender blocks; in the second, the receiver blocks.


Recently I’ve been looking into message passing as a model for concurrency. It turns out that there are two warring camps in the message passing community. One believes that synchronous message passing is more fundamental, the other believes that asynchronous message passing is more basic. You might think that the discussion is moot, since one can be emulated by the other, but it’s not as simple as that.

Let me first explain the concepts. In message passing you have a sender and a receiver–they usually run in separate threads. The sender sends a message and the receiver receives it. All issues of memory sharing and concurrent access are hidden inside the communication channel. The client does no low level synchronization actions like locking. Which is a good thing–no low level races or deadlocks!

The receiver usually has a choice of synchronous or asynchronous access. It can block until a message is available, or it can peek to see if there is a message waiting. Usually both interfaces are available.

The major choice of paradigms is on the sender’s side.

  • In the synchronous model, the sender blocks until the receiver picks up the message. The two have to rendezvous.
  • In the asynchronous model, the sender drops the message and continues with its own business.

The synchronous model is used, among others, in CSP (Communicating Sequential Processes) and CML (Concurrent ML). The asynchronous one is used in Concurrent Haskell and in actor-based languages like Erlang or Scala.

Here’s the main argument of the synchronous camp: After calling send the sender has the guarantee that the message has been received by the receiver. The code after send may safely make this assumption. If you’re a believer in RPC (Remote Procedure Call) you’ll love this. Sending a message is just like making a function call, only the work is done in a separate thread.

To which the asynchronous camp answers: Yeah, but what about distribution? In a distributed system, the receiver may be in a different process on a different machine. There may be many reasons why the message doesn’t arrive at the recipient (the recipient is dead? a bulldozer cut the network cable?). In that case the sender will hang forever. The code after send may still assume safe delivery, but it will never be executed.

There is another, more subtle problem with synchronous message passing between machines–a network protocol might deliver messages in different order than they were sent. If the receiver’s code expects a certain sequence of messages, it might block forever if the messages are permuted.

All these problems may be overcome by queuing messages, building sophisticated protocols, and spawning helper threads. But is it really worth the trouble?

Yes, say the synchronous camp, if you want to formally prove the correctness of your programs. With synchronous message passing you can use the theoretical machinery of C.A.R Hoare’s CSP. That’s an important advantage if you are writing your Ph.D. thesis, but maybe less so if you’re maintaining millions of lines of Erlang code.

For my next installment I’m already working on translating Haskell’s MVar’s to Java. This will be fun!

You may vote on this article on reddit.


In my last post I made a mistake of publishing a piece of code that used C++ weak atomics–which are part of the new C++ Memory Model. I qualified the post by saying that I had no proof of correctness, and the main purpose of it was to discourage the use of weak atomics. In a way the post was successful because it did prove that weak atomics are very tricky indeed. As was pointed out to me by Anthony and Dmitriy, my implementation was incorrect (I modified the post retroactively to make this point clear). Additionally, in his blog, Anthony published a proof of the algorithm posted by Dmitriy. And that’s when the fun started…

Before I describe what happened, let me give you a bird’s eye view of the C++ Memory Model.

The Model

The purpose of a memory model is to define the semantics of concurrent operations at an abstract level–independent of the platform on which the program is running. Essentially, one has to define in what order program statements may be executed, and what memory values may be observed. The basic memory model, the one used by Java, is based on sequential consistency. Program statements are executed in program order on each processor, and the actions of multiple threads are interleaved in some (global) order. A memory load can only see the value that was written last, according to this global order.

The C++ strong atomics (the default) follow this model.

It’s the weak atomics that are trouble. Weak atomics let the programmer specify memory ordering of loads and stores that is less than sequentially consistent. Unlike the sequentially consistent model, the model for weak atomics cannot be described in terms of a state machine (this was pointed out and criticized by Herb Sutter in his Prism proposal). Instead the model is formulated as a set of global constraints imposed on executions.

In a state-machine formulation, a step the program takes is based on previous history. The step doesn’t have to be uniquely defined–the machine may be non-deterministic. The execution may split into multiple branches. What is important, though, is that the step taken does not depend on the future of the execution.

This is not true in the weak-atomics model. The way to think about it is that, in a non-sequentially-consistent execution, the values seen by memory loads may come either from the past or from the future. A processor may speculate–make a guess at the values that have not been stored yet. A speculation may be discarded if the guess turns out to be wrong, or accepted if it was right. So the acceptability of a given execution can only be determined by looking at the whole execution, from start to end. Understandably, this makes any correctness proofs extremely tricky.

The Proof

The proposed implementation of the Peterson lock using weak atomics can be reduced to the following:

// Thread 0
zeroWants.save(true, memory_order_relaxed);
r0 = victim.exchange(0, memory_order_acq_rel);
r1 = oneWants.load(memory_order_acquire);

// Thread 1
oneWants.save(true, memory_order_relaxed);
r2 = victim.exchange(1, memory_order_acq_rel);
r3 = zeroWants.load(memory_order_acquire);

(The memory_order_acquire directives in the two loads are needed for synchronization with the unlock part of the algorithm, which is not included here.)

The hardest part of the proof is to show mutual exclusion–it should be impossible for both threads to enter the critical section at the same time. In Peterson’s algorithm, the entrance to the critical section is predicated on the values loaded into r1 and r3. If both values are false, the algorithm is broken.

Let’s consider the situation when both threads try to acquire the lock for the first time. The values of zeroWants and oneWants are initially false, and the value of victim is zero. Is it possible for both loads (zeroWants and oneWants) to see the initial values, rather than the ones saved by the opposing thread? The short answer is yes, as long as there is no intervening inter-thread synchronization.

In this case, the only synchronization that may happen is between a store-release in one thread and the corresponding load-acquire in the other. For instance, if the load part of the exchange operation on victim in thread 0 sees the value written by the store part of the exchange in thread 1 (that is, r0 is equal to 1), than these two operations are in a “synchronizes-with” relationship. We can then establish the following chronology:

  1. The save of true to oneWants in T1 happens before the store of 1 to victim (part of exchange) in T1. This is guaranteed by program order.
  2. The store of 1 into victim in T1 happens before the load of victim in T0 (part of exchange). This is because r0 is 1 in this part of the proof.
  3. The load of victim in T0 happens before the load of oneWants in T0. This is guaranteed by program order.

Because the happens-before relation is transitive, we can conclude that the load of oneWants in T0 (item 3) must see the value stored in T1 (item 1), which is true, and therefore T0 won’t enter the critical section. So far so good.

That was the simple case. Here’s the controversial one: Both r0 and r2 are zero. In particular, it means that r0 came from the initial value of victim. What about r2? It could have come either from the exchange in T0 or from the initial value of victim (both are zero!). According to the memory model, we have to consider both cases!

If r0 came from the exchange in T1 then we can use the same reasoning as before to conclude that T1 cannot enter the critical section.

But if the source of the zero in r0 is the original value of victim then there is no synchronization between T0 and T1. Both threads can enter the critical section!

You might be thinking: Either the exchange in T1 comes first and the exchange in T2 must see its value, or the exchange in T2 comes first and the one in T1 must see its value–they can’t both see the original value. This just shows you how ingrained sequentially consistent thinking is. There is no meaning to “comes first” in the world of weak atomics. Two events may only be sequenced if they occur in the same thread, or there is a synchronization path between them. Here we are still at the stage of trying to establish whether the two exchanges are synchronized or not.

The Controversy

Let me spell it out: Both exchanges see the initial value of victim. Since they didn’t interact, there is no synchronizes-with relationship between them. Therefore we can’t establish any inter-thread chronology, and it’s perfectly possible for both threads to see the original values of zeroWants and oneWants.

// Thread 0
zeroWants.save(true, memory_order_relaxed);
r0 = victim.exchange(0, memory_order_acq_rel); // sees initial value, 0
r1 = oneWants.load(memory_order_acquire); // sees initial value, false

// Thread 1
oneWants.save(true, memory_order_relaxed);
r2 = victim.exchange(1, memory_order_acq_rel); // sees initial value, 0
r3 = zeroWants.load(memory_order_acquire); // sees initial value, false

I had concluded that the implementation was broken, but I let Anthony defend his proof.

Here’s the argument he presented, based on the modification order of victim. The C++ memory model enforces global modification order for every memory location. The victim variable starts at zero and then is modified twice by the exchanges in T0 and T1. If thread T1 sees the original value, zero, of victim then the same exchange modifies it to one. The modification order is therefore (T1: 1, T0: 0). If we assume that an atomic exchange must always see the preceding value in the modification order, then we must conclude that the exchange in T0 must read 1. Therefore it synchronizes with the exchange in T1 and mutual exclusion is assured.

And here’s the clincher: The draft C++ Standard did not specify what value must be seen by the atomic exchange (or, more generally, by an atomic Read-Modify-Write operation). I confirmed that with Hans Boehm, who concluded that this apparent hole in the standard must be corrected.

The next draft of the Standard will say that a RMW operation must see the last value in the modification order of the memory object in question.

Only then the proof of correctness of the weak-atomics implementation of Peterson lock will be complete.

Conclusion

I had no idea what I was getting myself into when attempting to reason about C++ weak atomics. The theory behind them is so complex that it’s borderline unusable. It took three people (Anthony, Hans, and me) and a modification to the Standard to complete the proof of a relatively simple algorithm. Imagine doing the same for a lock-free queue based on weak atomics!

Hans, who was instrumental in creating the C++ memory model, hopes that weak atomics will be a stopgap measure until processor designers find more efficient ways of implementing sequential consistency. For now, it is strongly recommended to leave weak atomics to top experts whose numbers can be counted on the fingers of one hand.

I’m grateful to (in order of appearance) Anthony Williams, Dmitriy V’jukov, Hans Boehm, and others who contributed to this post.


The question that’s been bugging me lately was: How does C++ make the use atomic variables both portable and efficient.

I knew how Java volatile worked–it enforced sequential consistency, which is not always the most efficient thing to do.

C++0x has atomic variables which also enforce sequential consistency when used in the default mode. Without special ordering annotations they work almost like Java’s volatile (interestingly, Java’s volatile doesn’t enforce atomicity–there is an atomic library in Java, though, which does).

However, C++ offers various degrees of relaxation of sequential consistency which, if used correctly, should produce more efficient code.

After studying a bit of the x86 memory model, I realized that some basic lock-free patterns (like the one in double-checked locking) will work without any fences. There should be a way of coding them in C++ such that, when compiled for the x86, no fences are produced. On the other hand, the same C++ code, when compiled for, let’s say, the alpha or the Power PC, should produce the necessary fencing.

To make things more interesting, there are some other algorithms, like the Peterson lock, which do require memory fences on the x86 (see my previous post). So it’s not just the matter of skipping all fencing.

I reduced my question to: How does one write portable code in C++ that results in just the right amount of fencing on the x86?

Specifying memory ordering in C++

The C++0x atomics library proposal underwent a lot of changes before it settled into its current shape. There is now a complete C++0x draft. It’s not an easy read so I’m grateful to Hans Boehm for clarifying some of the fine points.

Without further ado, this is how the publication safety pattern (the mother of all double-checked locking patterns) could be written in C++:

  atomic<bool> ready = false;
  atomic<int> data = 0;

Thread 0:

  data.store(1);
  ready.store(true);

Thread 1:

  if (ready.load())
    assert (data.load() == 1);

As you can see, atomic objects have methods store and load that are used for writing and reading underlying shared data. By default, these methods enforce sequential consistency. What it means is that this code has the same semantics as if it were written in Java using volatile variables. Consequently, on the x86, it will generate memory fences after each store.

But we know that these fences are not necessary for publication safety. How can we write code that would produce minimal fencing on the x86?

To make such fine tuning possible, the C++0x atomics library allows the programmer to specify ordering requirements for each load and store. I’ll explain various ordering options in a moment; for now let’s have a look at the optimized version of the publication pattern:

  atomic<bool> ready = false;
  atomic<int> data = 0;

Thread 0:

  data.store(1, memory_order_release);
  ready.store(true, memory_order_release);

Thread 1:

  if (ready.load(memory_order_acquire))
    assert (data.load(memory_order_acquire) == 1);

What’s important is that this code will work correctly on any major processor, but it will produce no fences on the x86.

Warning: Even if you know that your program will only ever run on an x86, you can’t remove the atomics and the ordering constraints from your code. They are still necessary to prevent the compiler from doing the reordering.

Fine-tuning memory ordering

Memory order can be specified using the following enumeration:

namespace std {
 typedef enum memory_order {
   memory_order_relaxed,
   memory_order_consume,
   memory_order_acquire,
   memory_order_release,
   memory_order_acq_rel,
   memory_order_seq_cst
 } memory_order; 
}

The default for all operations on atomic variables is memory_order_seq_cst which, just like Java’s volatile, enforces sequential consistency. Other orderings are used to relax sequential consistency and often squeeze better performance from lock-free algorithms.

  • memory_order_acquire: guarantees that subsequent loads are not moved before the current load or any preceding loads.
  • memory_order_release: preceding stores are not moved past the current store or any subsequent stores.
  • memory_order_acq_rel: combines the two previous guarantees.
  • memory_order_consume: potentially weaker form of memory_order_acquire that enforces ordering of the current load before other operations that are data-dependent on it (for instance, when a load of a pointer is marked memory_order_consume, subsequent operations that dereference this pointer won’t be moved before it (yes, even that is not guaranteed on all platforms!).
  • memory_order_relaxed: all reorderings are okay.

As I discussed before, the x86 guarantees acquire semantics for loads and release semantics for stores, so load(memory_order_acquire) and store(x, memory_order_release) produce no fences. So, indeed, our optimized version of the publication pattern will produce optimal assembly on the x86 (and I presume on other processors too).

But I have also shown before that the Peterson’s algorithm won’t work on the x86 without fencing. So it’s not enough to use just memory_order_acquire and memory_order_release to implement it. In fact, the problem with the Peterson lock was the possibility of reordering loads with stores. The weakest constraint that prevents such reordering is memory_order_acq_rel.

Now here the funny story begins. Without much thinking, I decided that just slapping memory_order_acq_rel on any of the writes would fix the problem. Here’s what I originally wrote:

[begin quote] The correct C++ code in this case is:

Peterson::Peterson() {
   _victim.store(0, memory_order_release);
   _interested[0].store(false, memory_order_release);
   _interested[1].store(false, memory_order_release);
}
void Peterson::lock() {
   int me = threadID; // either 0 or 1
   int he = 1 – me; // the other thread
   _interested[me].exchange(true, memory_order_acq_rel);
   _victim.store(me, memory_order_release);
   while (_interested[he].load(memory_order_acquire)
       && _victim.load(memory_order_acquire) == me)
      continue; // spin
}

The ordering memory_order_acq_rel produces an mfence instruction on the x86. This fence will prevent the movement of the load of _interested[he] before the store to _interested[me], which could otherwise happen (and break the algorithm).[end quote]

You can read comments to this post–especially the ones by Anthony and Dmitriy, who made me eat crow. In short, the point is that the “acquire” part of memory_order_acq_rel has no meaning unless another thread writes to (“releases”) the same variable. Since only thread 0 writes to _interested[0] and only thread 1 one writes to _interested[1], this synchronization accomplished nothing (outside of the x86). Dmitriy’s implementation, which synchronizes on _victim, is correct (but I had to ask many questions before understanding Anthony’s proof of it).

Conclusion

What strikes me about this example is how non-trivial the reasoning behind this implementation was. I had to actually go through the x86 assembly to realize that I needed those and not some other ordering constraints (and I still got it wrong!). In comparison to that, the old fashioned assembly programming looks like a piece of cake.

Any time you deviate from sequential consistency, you increase the complexity of the problem by orders of magnitude.

Microsoft volatile

No, I’m not talking about the stock market. I mentioned before that C++ volatile has nothing to do with multithreading. This is not entirely true because some compiler vendors took the liberty to add non-standard semantics to volatile (out of pure desperation, since they had to support multi-processor code, and pre-0x C++ offered no help in this area). This new semantics, at least in the case of Microsoft compilers, did not include sequential consistency. Instead it guaranteed acquire and release semantics (which, on the x86, didn’t result in any fencing). So, although the publication pattern will work on a Microsoft compiler with volatile variables, the Peterson lock won’t! I thought that was an interesting piece of trivia.


How does Java do it? Motivation for C++ programmers

Things like double-checked locking pattern and Peterson lock work out of the box in Java, as long as you declare all shared variables volatile. (Do not confuse Java volatile with C++ volatile–the latter has nothing to do with multithreading.) Obviously the Java compiler knows what kind of fences are necessary on relaxed memory architectures, including the x86. I thought it was worth looking at.

Why the sudden interest in Java? Because Java memory model is very relevant to C++0x. The keyword here is sequential consistency. Java enforces sequential consistency on all access to volatile variables. C++0x introduces atomic objects which, by default, also follow sequential consistency. So C++ atomics will, by default, behave almost exactly like Java volatile variables.

Also, why even bother studying the quirks of the x86? Can’t we just stick to locks and, occasionally, to default atomics and let the compiler/library writers worry about how they translate into fences? Absolutely!

This should be the end of the story, except that a lot of programmers seem to “know better.” They will try to optimize multithreaded algorithms and get into a world of trouble. So if you know anybody who might be tempted to write or optimize lock-free algorithms, read on.

Why the x86? Not only because it’s the most common chip, but also because its memory model is deceptively “normal.” It’s much more likely for programmers to attempt to play games with the x86 than, for instance, with the alpha. The rest of this post should serve as motivation to stay away form such games.

What is Sequential Consistency?

Sequential consistency is how we naturally think about multithreaded programs. It’s also how we think about the world. If A happens before B then it cannot be true that B happens before A. If one processor stores 1 in variable x, and another processor stores 1 in variable y, than either the sequence of events is:

  • x flips to 1 and then y flips to 1, or
  • y flips to 1 and then x flips to 1

(assume both are initially zero). Which sequence actually happened could be observed by a third processor, which loads both variables. If, for instance, it sees x == 1 and y == 0, it concludes that the write to x happened earlier. If it sees x == 0 and y == 1, it concludes that the write to y happened earlier (if it sees x == y, it can’t tell).

Now, in a sequentially consistent world, a fourth processor could not see the two events in a different order than what the third processor observed. We assume that, for each particular run of the program, there is a global sequence of events that is seen by all processors.

Of course, on multicore processors many things can happen simultaneously, and it usually doesn’t matter–except when memory access is involved. Sequentially consistent model assumes that there is a single switch between all processors and memory, and only one processor at a time can access it. This imaginary switch serves as the serialization point.

The x86 is not sequentially consistent

Welcome to the world of special relativity! Two observers (cores) might see two events (memory writes) in different order. Even on an x86.

The technical explanation is that, instead of one bus serializing all memory accesses, each core uses its own memory cache. Writes propagate from one cache to another with finite speed (measured in clock cycles). Reads snoop around the caches before going to shared memory. All this caching is done because a trip to main memory very expensive–hundreds of cycles.

I don’t care, says Java

The Java memory model requires that all access to shared volatile variables be sequentially consistent across all threads, even when they run on different cores. How do they enforce it?

The primary source for this kind of information is Doug Lea’s excellent The JSR-133 Cookbook for Compiler Writers.

Here’s the big picture: When Java is translated into bytecode, special codes are issued around volatile-variable access. These codes tell the Java runtime where memory fences should be inserted. The bytecodes are supposed to be processor independent; the fences are highly processor dependent. I’ll call the special bytecodes memory barriers as opposed to processor-specific memory fences. Memory barriers are inserted conservatively–the compiler assumes the most relaxed memory model (in other words, the bytecode has to run correctly on an alpha, whose memory model is so relaxed you’d think it was conceived in a hot tub).

Fences should go between memory accesses, so Java barriers are named after the kind of accesses (load or store) they separate. There is a LoadLoad barrier between volatile reads, LoadStore barrier between a read and a write, StoreLoad barrier between a write and a read, and StoreStore barrier between writes.

Here’s the first problem: the compiler can often only see one part of the pair. In that case it has to assume the worst and use the most restrictive barrier. Considering that, here are some of the cookbook rules:

  • Issue a StoreStore barrier before each volatile store.
  • Issue a StoreLoad barrier after each volatile store.
  • Issue LoadLoad and LoadStore barriers after each volatile load.

That’s a lot of barriers! Fortunately, the compiler can optimize some of them away using dataflow analysis.

The next step is to run the bytecode on a particular processor (or compile it to native code). If the processor is an alpha, practically all Java barriers translate into some kind of fences. But if it’s an x86, all but the StoreLoad barriers are turned into no-ops. The StoreLoad barrier is either translated into an mfence or a locked instruction (lock xchg).

When executed on an x86, the barrier rules boil down to this one: Issue an mfence after each volatile store (or turn it into lock xchg).

Peterson lock on an x86 in Java

Figuring the details of the Peterson lock in Java was not easy. I’m grateful to Doug Lea for patiently explaining to me some of the gotchas. I am solely responsible for any mistakes though.

In my previous post I came to the conclusion that for Peterson lock to work on an x86, an mfence is needed. Here’s the relevant pseudo-code:

Thread 0 Thread 1
store(zeroWants, 1)
mfence
store(victim, 0)
// Java puts another fence here
r0 = load(oneWants)
r1 = load(victim)
store(oneWants, 1)
mfence
store(victim, 1)
// Java puts another fence here
r0 = load(zeroWants)
r1 = load(victim)

In Java, all three variables, zeroWants, oneWants, and victim, would be declared volatile. Using the x86 Java translation rules, that would mean putting an mfence after every store to these variables.

The fences after the writes to zeroWant and oneWant are just what the doctor ordered. Form my previous post we know why they are needed on an x86.

The mfence after the write to victim is something new though. It isn’t strictly necessary on an x86, but to see that you really have to analyze the whole algorithm–something that’s beyond capabilities of modern compilers.

My original (turns out, incorrect) reasoning was that no fence is needed between the write to victim and the subsequent read because, on an x86, reads and writes to the same location are never reordered. Well, that’s not the whole story because…

Intra-processor forwarding is allowed

Consider the following example from the x86 spec. Initially x == y == 0.

Processor 0 Processor 1
mov [_x], 1
mov r1, [_x]
mov r2, [_y]
mov [_y], 1

mov r3, [_y]
mov r4, [_x]

The result r2 == 0 and r4 == 0 is allowed.

Let’s analyze that. First of all, because reads and writes to the same location are never reordered, r1 and r3 must both be 1.

If r2 == 0, we would naturally conclude that the write to y at processor 1 hasn’t happened yet. It means: processor 1 hasn’t executed its code yet. When it does, it should see the earlier write to x (processor 0 saw it!). Guess what–sometimes it doesn’t.

The technical reason for this behavior is that the write to x is immediately visible at processor 0, before it propagates to processor 1. Similarly, the write to y is immediately visible at processor 1, before it propagates to processor 0. So it is possible for both processor to see each other’s writes delayed, even though they see their own writes immediately.

This result violates sequential consistency and therefore cannot be tolerated in Java. Hence the need for the mfence between the write to victim and the subsequent read of victim.

Gosh darn it! I want to optimize it!

Having said that, I must point out that, in this particular case, lack of sequential consistency is not a deal breaker. It’s okay for thread 0 of Peterson lock to read stale (local) value of victim, even if “in the meanwhile” thread 1 overwrote it. All it means is that some unnecessary spins might be performed. This doesn’t violate the lock principles and doesn’t lead to a deadlock as long as the new value of victim eventually reaches thread 0.

I’m pretty sure of this reasoning–at least at the time of this writing. However, I wouldn’t be totally surprised is somebody found a fault in it.

The bottom line is that even such an “obvious” optimization is not obviously correct. The proof of correctness of the Peterson lock is based on the assumption of sequential consistency. If we relax this assumption even slightly, we have to redo the whole proof.

Any volunteers?


Multiprocessors with relaxed memory models can be very confusing. Writes can be seen out of order, reads can be speculative and return values from the future–what a mess! In order to impose some kind of consistency you have to use memory fences, and there are several kinds of them.

The x86 seems to be an oasis in the perilous landscape of relaxed memory multicores. The Intel x86 memory model, detailed in Intel 64 Architecture Memory Ordering White Paper and the AMD spec, AMD64 Architecture Programmer’s Manual, list a lot of memory ordering guarantees, among them:

  • Loads are not reordered with other loads.
  • Stores are not reordered with other stores.
  • Stores are not reordered with older loads.
  • In a multiprocessor system, memory ordering obeys causality (memory ordering respects transitive visibility).
  • In a multiprocessor system, stores to the same location have a total order.
  • In a multiprocessor system, locked instructions have a total order.
  • Loads and stores are not reordered with locked instructions.

The x86 also has (expensive–on the order of 100 cycles) memory fence instructions, mfence, lfence, and sfence; but, considering all those guarantees, why would anyone use them? The famous double-checked locking pattern, on the x86, works just fine without any fences.

This is a very important question, because you don’t want the compiler to over-fence your code and bring it to a crawl. On the other hand, you don’t want it to emit incorrect code. I decided to look for some answers.

There is one important non-guarantee listed in the x86 spec:

Loads may be reordered with older stores to different locations

Here’s an example from the x86 spec: x and y are in shared memory and they are both initialized to zero, r1 and r2 are processor registers.

Thread 0 Thread 1
mov [_x], 1
mov r1, [_y]
mov [_y], 1
mov r2, [_x]

When the two threads execute on different cores, the non-intuitive result r1 == 0 and r2 == 0 is allowed. Notice that such result is consistent with the processor executing the loads before the stores (which access different locations).

How important is this example?

I needed to find an algorithm that would be broken by this relaxation. I’ve seen Dekker’s algorithm mentioned in this context. There is a more modern version of it used in the Peterson lock. It’s a mutual exclusion algorithm that works for two threads only. It assumes that each thread has access to a thread-local ID, which is 0 for one thread and 1 for the other. Here’s the version adapted from The Art of Multiprocessor Programming:

class Peterson
{
private:
    // indexed by thread ID, 0 or 1
    bool _interested[2];
    // who's yielding priority?
    int _victim;
public:
    Peterson()
    {
        _victim = 0;
        _interested[0] = false;
        _interested[1] = false;
    }
    void lock()
    {
        // threadID is thread local,
        // initialized either to zero or one
        int me = threadID;
        int he = 1 - me;
        _interested[me] = true;
        _victim = me;
        while (_interested[he] && _victim == me)
            continue;
    }
    void unlock()
    {
        int me = threadID;
        _interested[me] = false;
    }
}

To explain how it works, let me impersonate one of the threads. When I (the thread) want to take a lock, I set my _interested slot to true. I am the only writer of this slot, although the other guy can read it. I also toggle the _victim switch to point to myself. Now I check if the other guy is also interested. As long as he is, and I am the victim, I spin. But once he becomes uninterested or turns himself into a victim, the lock is mine. When I’m done executing critical code, I reset my _interested slot, thus potentially releasing the other guy.

Let me simplify this a little, and partition the code between the two threads. Instead of an array _interested, there are two variables zeroWants and oneWants corresponding to the two slots.

zeroWants = false;
oneWants = false;
victim = 0;
Thread 0 Thread 1
zeroWants = true;
victim = 0;
while (oneWants && victim == 0)
 continue;
// critical code
zeroWants = false;
oneWants = true;
victim = 1;
while (zeroWants && victim == 1)
 continue;
// critical code
oneWants = false;

Finally, let me rewrite the initial part of the execution in pseudo assembly.

Thread 0 Thread 1
store(zeroWants, 1)
store(victim, 0)
r0 = load(oneWants)
r1 = load(victim)
store(oneWants, 1)
store(victim, 1)
r0 = load(zeroWants)
r1 = load(victim)

Now look at the loads and stores to zeroWants and oneWants. They follow the same pattern as in the x86 reordering example. The processor is free to move the read of oneWants before the write to zeroWants (and to victim). Similarly, it can move the read of zeroWants before the write to oneWants. We may end up with the following execution:

Thread 0 Thread 1
r0 = load(oneWants)
store(zeroWants, 1)
store(victim, 0)
r1 = load(victim)
r0 = load(zeroWants)
store(oneWants, 1)
store(victim, 1)
r1 = load(victim)

Originally, both zeroWants and oneWants are initialized to zero, so both r1 and r2 may end up zero. In that case, the spin loop is never executed and both threads march into the critical section. Peterson lock on the x86 is broken!

Of course, there is a way to fix it. An mfence will force the correct ordering. It can be put anywhere between the store of zeroWants and the load of oneWants in one thread, and between the store of oneWants and the load of zeroWants in the other thread.

So here we have a real-life example that requires an actual fence on an x86. The problem is real!

« Previous PageNext Page »