Parallelism



Can a data race not be a bug? In the strictest sense I would say it’s always a bug. A correct program written in a high-level language should run the same way on every processor present, past, and future. But there is no proscription, or even a convention, about what a processor should (or shouldn’t) do when it encounters a race. This is usually described in higher-level language specs by the ominous phrase: “undefined behavior.” A data race could legitimately reprogram your BIOS, wipe out your disk, and stop the processor’s fan causing a multi-core meltdown.

Data race: Multiple threads accessing the same memory location without intervening synchronization, with at least one thread performing a write.

However, if your program is only designed to run on a particular family of processors, say the x86, you might allow certain types of data races for the sake of performance. And as your program matures, i.e., goes through many cycles of testing and debugging, the proportion of buggy races to benign races keeps decreasing. This becomes a real problem if you are using a data-race detection tool that cannot distinguish between the two. You get swamped by false positives.

Microsoft Research encountered and dealt with this problem when running their race detector called DataCollider on the Windows kernel (see Bibliography). Their program found 25 actual bugs, and almost an order of magnitude more benign data races. I’ll summarize their methodology and discuss their findings about benign data races.

Data Races in the Windows Kernel

The idea of the program is very simple. Put a hardware breakpoint on a shared memory access and wait for one of the threads to stumble upon it. This is a code breakpoint, which is triggered when the particular code location is executed. The x86 also supports another kind of a breakpoint, which is called a data breakpoint, triggered when the program accesses a specific memory location. So when a thread hits the code breakpoint, DataCollider installs a data breakpoint at the location the thread was just accessing. It then stalls the current thread and lets all other threads run. If any one of them hits the data breakpoint, it’s a race (as long as one of the accesses is a write). Consider this: If there was any synchronization (say, a lock acquisition was attempted) between the two accesses, the second thread would have been blocked from accessing that location. Since it wasn’t, it’s a classic data race.

Notice that this method might not catch all data races, but it doesn’t produce false positives. Except, of course, when the race is considered benign.

There are other interesting details of the algorithm. One is the choice of code locations for installing breakpoints. DataCollider first analyzes the program’s assembly code to create a pool of memory accesses. It discards all thread-local accesses and explicitly synchronized instructions (for instance, the ones with the LOCK prefix). It then randomly picks locations for breakpoints from this pool. Notice that rarely executed paths are as likely to be sampled as the frequently executed ones. This is important because data races, like fugitive criminals, often hide in less frequented places.

Pruning Benign Races

90% of data races caught by DataCollider in the Windows kernel were benign. For several reasons it’s hard to say how general this result is. First, the kernel had already been tested and debugged for some time, so many low-hanging concurrency bugs have been picked. Second, operating system kernels are highly optimized for a particular processor and might use all kinds of tricks to improve performance. Finally, kernels often use unusual synchronization strategies. Still, it’s interesting to see what shape benign data races take.

It turns out that half of all false positives came from lossy counters. There are many places where statistics are gathered: counting certain kinds of events, either for reporting or for performance enhancements. In those situations losing a few increments is of no relevance. However not all counters are lossy and, for instance, a data race in reference counting is a serious bug. DataCollider uses simple heuristic to detect lossy counters–they are the ones that are always incremented. A reference counter, on the other hand, is as often incremented as decremented.

Another benign race happens when one thread reads a particular bit in a bitfield while another thread updates another bit. A bit update is a read-modify-write (RMW) sequence: The thread reads the previous value of the bitfield, modifies one bit, and writes the whole bitfield back. Other bits are overwritten in the process too, but their new values are the same as the old values. A read from another thread of any of the the non-changed bits does not interfere with the write, at least not on the x86. Of course if yet another thread modified one of those bits, it would be a real bug, and it would be caught separately. The pruning of this type of race requires analysis of surrounding code (looking for the masking of other bits).

Windows kernel also has some special variables that are racy by design–current time is one such example. DataCollider has these locations hard-coded and automatically prunes them away.

There are benign races that are hard to prune automatically, and those are left for manual pruning (in fact, DataCollider reports all races, it just de-emphasizes the ones it considers benign). One of them is the double-checked locking pattern (DCLP), where a thread makes a non-synchronized read to be later re-confirmed under the lock. This pattern happens to work on the x86, although it definitely isn’t portable.

Finally, there is the interesting case of idempotent writes— two racing writes that happen to write the same value to the same location. Even though such scenarios are easy to prune, the implementers of DataCollider decided not to prune them because more often than not they led to the uncovering of concurrency bugs. Below is a table that summarizes various cases.

Benign race Differentiate from Pruned?
Lossy counter Reference counting Yes
Read and write of different bits Read and write of the whole word Yes
Deliberately racy variables Yes
DCLP No
Idempotent writes No

Conclusion

In the ideal world there would be no data races. But a concurrency bug detector must take into account the existence of benign data races. In the early stages of product testing the majority of detected races are real bugs. It’s only when chasing the most elusive of concurrency bugs that it becomes important to weed out benign races. But it’s the elusive ones that bite the hardest.

Bibliography

  1. John Erickson, Madanlal Musuvathi, Sebastian Burckhardt, Kirk Olynyk, Effective Data-Race Detection for the Kernel

C++ is like an oil tanker — it takes a long time for it to change course. The turbulent reefs towards which C++ has been heading were spotted on the horizon more than ten years ago. I’m talking, of course, about the end of smooth sailing under the Moore’s law and the arrival of the Multicore. It took six years to acknowledge the existence of concurrency in the C++11 Standard, but that’s only the beginning. It’s becoming more and more obvious that a major paradigm shift is needed if C++ is to remain relevant in the new era.

Why do we need a new paradigm to deal with concurrency? Can’t we use object oriented programming with small modifications? The answer to this question goes to the heart of programming: it’s about composability. We humans solve complex problems by splitting them into smaller subproblems. This is a recursive process, we split subproblems into still smaller pieces, and so on. Eventually we reach the size of the problem which can be easily translated into computer code. We then have to compose all these partial solutions into larger programs.

The key to composability is being able to hide complexity at each level. This is why object oriented programming has been so successful. When you’re implementing an object, you have to deal with its internals, with state transitions, intermediate states, etc. But once the object is implemented, all you see is the interface. The interface must be simpler than the implementation for object oriented programming to make sense. You compose larger objects from smaller objects based on their interfaces, not the details of their implementation. That’s how object oriented programming solves the problem of complexity.

Unfortunately, objects don’t compose in the presence of concurrency. They hide the wrong kind of things. They hide sharing and mutation. Let me quote the definition of data race: Two or more threads accessing the same piece of memory at the same time, at least one of them writing. In other words: Sharing + Mutation = Data Race. Nothing in the object’s interface informs you about the possibility of sharing and mutation inside the object’s implementation. Each object in isolation may be data-race-free but their composition may inadvertently introduce data races. And you won’t know about it unless you study the details of their implementation down to every single memory access.

In Java, an attempt had been made to mollify this problem: Every object is equipped with a mutex that can be invoked by declaring the method synchronized. This is not a scalable solution. Even Java’s clever thin lock implementation incurs non-negligible performance overhead, so it is used only when the programmer is well aware of potential races, which requires deep insight into the implementation of all subobjects, exactly the thing we are trying to avoid.

More importantly, locking itself doesn’t compose. There’s a classic example of a locked bank account whose deposit and withdraw methods are synchronized by a lock. The problem occurs when one tries to transfer money from one account to another. Without exposing the locks, it’s impossible to avoid a transient state in which the funds have already left one account but haven’t reached the second. With locks exposed, one may try to hold both locks during the transfer, but that creates a real potential for deadlocks. (Software Transactional Memory provides a composable solution to this problem, but there are no practical implementations of STM outside of Haskell and Clojure.)

Moreover, if we are interested in taking advantage of multicores to improve performance, the use of locks is a non-starter. Eking out parallel performance is hard enough without locks, given all the overheads of thread management and the Amdahl’s law. Parallelism requires a drastically different approach.

Since the central problem of concurrency is the conflict between sharing and mutation, the solution is to control these two aspects of programming. We can do mutation to our heart’s content as long as there’s no sharing. For instance, we can mutate local variables; or we can ensure unique ownership by making deep copies, using move semantics, or by employing unique_ptrs. Unique ownership plays very important role in message passing, allowing large amounts of data to be passed cheaply between threads.

However, the key to multicore programming is controlling mutation. This is why functional languages have been steadily gaining ground in concurrency and parallelism. In a nutshell, functional programmers have found a way to program using what, to all intents and purposes, looks like immutable data. An imperative programmer, when faced with immutability, is as confused as a barbecue cook in a vegetarian kitchen. And the truth is that virtually all data structures from the C++ standard library are unsuitable for this kind of programming — the standard vector being the worst offender. A continuous slab of memory is perfect for random or sequential access, but the moment mutation is involved, you can’t share it between threads. Of course, you can use a mutex to lock the whole vector every time you access it, but as I explained already, you can forget about performance and composability of such a solution.

The trick with functional data structures is that they appear immutable, and therefore require no synchronization when accessed from multiple threads. Mutation is replaced by construction: you construct a new object that’s a clone of the source object but with the requested modification in place. Obviously, if you tried to do this with a vector, you’d end up with a lot of copying. But functional data structures are designed for maximum sharing of representation. So a clone of a functional object will share most of its data with the original, and only record a small delta. The sharing is totally transparent since the originals are guaranteed to be immutable.

A singly-linked list is a classical, if not somewhat trivial, example of such a data structure. Adding an element to the front of a list requires only the creation of a single node to store the new value and a pointer to the original (immutable) list. There are also many tree-like data structures that are logarithmically cheap to clone-mutate (red-black trees, leftist heaps). Parallel algorithms are easy to implement with functional data structures, since the programmer doesn’t have to worry about synchronization.

Functional data structures, also known as “persistent” data structures, are naturally composable. This follows from the composability of immutable data — you can build larger immutable objects from smaller immutable objects. But there’s more to it: This new way of mutating by construction also composes well. A composite persistent object can be clone-mutated by clone-mutating only the objects on the path to the mutation; everything else can be safely shared.

Concurrency also introduces nonstandard flows of control. In general, things don’t progress sequentially. Programmers have to deal with inversion of control, jumping from handler to handler, keeping track of shared mutable state, etc. Again, in functional programming this is nothing unusual. Functions are first class citizens and they can be composed in many ways. A handler is nothing but a continuation in the continuation passing style. Continuations do compose, albeit in ways that are not familiar to imperative programmers. Functional programmers have a powerful compositional tool called a monad that, among other things, can linearize inverted flow of control. The design of libraries for concurrent programming makes much more sense once you understand that.

A paradigm shift towards functional programming is unavoidable and I’m glad to report that there’s a growing awareness of that new trend among C++ programmers. I used to be the odd guy talking about Haskell and monads at C++ meetings and conferences. This is no longer so. There was a sea change at this year’s C++Now. The cool kids were all talking about functional programming, and the presentation “Functional Data Structures in C++” earned me the most inspiring session award. I take it as a sign that the C++ community is ready for a big change.


Lazy evaluation can be a powerful tool for structuring your code. For instance, it can let you turn your code inside out, inverting the flow of control. Many a Haskell program take advantage of laziness to express algorithms in clear succinct terms, turning them from recipes to declarations.

The question for today’s blog post is: How can we tap the power of lazy evaluation in an inherently eager language like C++? I’ll lead you through a simple coding example and gradually introduce the building blocks of lazy programming: the suspension, the lazy stream, and a whole slew of functional algorithms that let you operate on them. In the process we’ll discover some fundamental functional patterns like functors, monads, and monoids. I have discussed them already in my post about C++ futures. It’s very edifying to see them emerge in a completely different context.

The Problem

Let’s write a program that prints the first n Pythagorean triples. A Pythagorean triple consists of three integers, x, y, and z, that satisfy the relation x2 + y2 = z2. Let’s not be fancy and just go with the brute force approach. Here’s the program in C:

void printNTriples(int n)
{
    int i = 0;
    for (int z = 1; ; ++z)
        for (int x = 1; x <= z; ++x)
            for (int y = x; y <= z; ++y)
                if (x*x + y*y == z*z) {
                    printf("%d, %d, %d\n", x, y, z);
                    if (++i == n)
                        return;
                }
}

Here, a single C function serves three distinct purposes: It

  1. Generates Pythagorean triples,
  2. Prints them,
  3. Counts them; and when the count reaches n, breaks.

This is fine, as long as you don’t have to modify or reuse this code. But what if, for instance, instead of printing, you wanted to draw the triples as triangles? Or if you wanted to stop as soon as one of the numbers reached 100? The problem with this code is that it’s structured inside out: both the test and the sink for data are embedded in the innermost loop of the algorithm. A more natural and flexible approach would be to:

  1. Generate the list of Pythagorean triples,
  2. Take the first ten of them, and
  3. Print them.

And that’s exactly how you’d write this program in Haskell:

main = print (take 10 triples)

triples = [(x, y, z) | z <- [1..]
                     , x <- [1..z]
                     , y <- [x..z]
                     , x^2 + y^2 == z^2]

This program reads: take 10 triples and print them. It declares triples as a list (square brackets mean a list) of triples (x, y, z), where (the vertical bar reads “where”) z is an element of the list of integers from 1 to infinity, x is from 1 to z, y is from x to z, and the sum of squares of x and y is equal to the square of z. This notation is called “list comprehension” and is characteristic of Haskell terseness.

You see the difference? Haskell let’s you abstract the notion of the list of Pythagorean triples so you can operate on it as one entity, whereas in C (or, for that matter, in C++) we were not able to disentangle the different, orthogonal, aspects of the program.

The key advantage of Haskell in this case is its ability to deal with infinite lists. And this ability comes from Haskell’s inherent laziness. Things are never evaluated in Haskell until they are absolutely needed. In the program above, it was the call to print that forced Haskell to actually do some work: take 10 elements from the list of triples. Since the triples weren’t there yet, it had to calculate them, but only as many as were requested and not a number more.

Suspension

We’ll start with the most basic building block of laziness: a suspended function. Here’s the first naive attempt:

template<class T>
class Susp {
public:
    explicit Susp(std::function<T()> f)
        : _f(f)
    {}
    T get() { return _f(); }
private:
    std::function<T()> _f;
};

We often create suspensions using lambda functions, as in:

int x = 2;
int y = 3;
Susp<int> sum([x, y]() { return x + y; });
...
int z = sum.get();

Notice that the suspended lambda may capture variables from its environment: here x and y. A lambda, and therefore a suspension, is a closure.

The trouble with this implementation is that the function is re-executed every time we call get. There are several problems with that: If the function is not pure, we may get different values each time; if the function has side effects, these may happen multiple times; and if the function is expensive, the performance will suffer. All these problems may be addressed by memoizing the value.

Here’s the idea: The first time the client calls get we should execute the function and store the returned value in a member variable. Subsequent calls should go directly to that variable. We could implement this by setting a Boolean flag on the first call and then checking it on every subsequent call, but there’s a better implementation that uses thunks.

A thunk is a pointer to a free function taking a suspension (the this pointer) and returning a value (by const reference). The get method simply calls this thunk, passing it the this pointer.

Initially, the thunk is set to thunkForce, which calls the method setMemo. This method evaluates the function, stores the result in _memo, switches the thunk to thunkGet, and returns the memoized value. On subsequent calls get goes through the getMemo thunk which simply returns the memoized value.

template<class T>
class Susp
{
    // thunk
    static T const & thunkForce(Susp * susp) {
        return susp->setMemo();
    }
    // thunk
    static T const & thunkGet(Susp * susp) {
        return susp->getMemo();
    }
    T const & getMemo() {
        return _memo;
    }
    T const & setMemo() {
        _memo = _f();
        _thunk = &thunkGet;
        return getMemo();
    }
public:
    explicit Susp(std::function<T()> f)
        : _f(f), _thunk(&thunkForce), _memo(T())
    {}
    T const & get() {
        return _thunk(this);
    }
private:
    T const & (*_thunk)(Susp *);
    mutable T   _memo;

    std::function<T()> _f;
};

(By the way, the function pointer declaration of _thunk looks pretty scary in C++, doesn’t it?)

[Edit: I decided to remove the discussion of the thread safe implementation since it wasn’t ready for publication. The current implementation is not thread safe.]

You can find a lot more detail about the Haskell implementation of suspended functions in the paper by Tim Harris, Simon Marlow, and Simon Peyton Jones, Haskell on a Shared-Memory Multiprocessor.

Lazy Stream

The loop we used to produce Pythagorean triples in C worked on the push principle — data was pushed towards the sink. If we want to deal with infinite lists, we have to use the pull principle. It should be up to the client to control the flow of data. That’s the inversion of control I was talking about in the introduction.

We’ll use a lazy list and call it a stream. In C++ a similar idea is sometimes expressed in terms of input and forward iterators, although it is understood that an iterator itself is not the source or the owner of data — just an interface to one. So we’ll stick with the idea of a stream.

We’ll implement the stream in the functional style as a persistent data structure fashioned after persistent lists (see my series of blog post on persistent data structures). It means that a stream, once constructed, is never modified. To “advance” the stream, we’ll have to create a new one by calling the const method pop_front.

Let’s start with the definition: A stream is either empty or it contains a suspended cell. This immediately suggests the implementation as a (possibly null) pointer to a cell. Since the whole stream is immutable, the cell will be immutable too, so it’s perfectly safe to share it between copies of the stream. We can therefore use a shared pointer:

template<class T>
class Stream
{
private:
    std::shared_ptr <Susp<Cell<T>>> _lazyCell;
};

Of course, because of reference counting and memoization, the stream is only conceptually immutable and, in the current implementation, not thread safe.

So what’s in the Cell? Remember, we want to be able to generate infinite sequences, so Stream must contain the DNA for not only producing the value of type T but also for producing the offspring — another (lazy) Stream of values. The Cell is just that: A value and a stream.

template<class T>
class Cell
{
public:
    Cell() {} // need default constructor for memoization
    Cell(T v, Stream<T> const & tail)
        : _v(v), _tail(tail)
    {}
    explicit Cell(T v) : _v(v) {}
    T val() const {
        return _v;
    }
    Stream<T> pop_front() const {
        return _tail;
    }
private:
    T _v;
    Stream<T> _tail;
};

This mutually recursive pair of data structures works together amazingly well.

template<class T>
class Stream
{
private:
    std::shared_ptr <Susp<Cell<T>>> _lazyCell;
public:
    Stream() {}
    Stream(std::function<Cell<T>()> f)
        : _lazyCell(std::make_shared<Susp<Cell<T>>>(f))
    {}
    Stream(Stream && stm)
        : _lazyCell(std::move(stm._lazyCell))
    {}
    Stream & operator=(Stream && stm)
    {
        _lazyCell = std::move(stm._lazyCell);
        return *this;
    }
    bool isEmpty() const
    {
        return !_lazyCell;
    }
    T get() const
    {
        return _lazyCell->get().val();
    }
    Stream<T> pop_front() const
    {
        return _lazyCell->get().pop_front();
    }
};

There are several things worth pointing out. The two constructors follow our formal definition of the Stream: one constructs an empty stream, the other constructs a suspended Cell. A suspension is created from a function returning Cell.

I also added a move constructor and a move assignment operator for efficiency. We’ll see it used in the implementation of forEach.

The magic happens when we call get for the first time. That’s when the suspended Cell comes to life. The value and the new stream are produced and memoized for later use. Or, this may happen if the first call is to pop_front. Notice that pop_front is a const method — the Stream itself is immutable. The method returns a new stream that encapsulates the rest of the sequence.

Let’s get our feet wet by constructing a stream of integers from n to infinity. The constructor of a Stream takes a function that returns a Cell. We’ll use a lambda that captures the value of n. It creates a Cell with that value and a tail, which it obtains by calling intsFrom with n+1:

Stream<int> intsFrom(int n)
{
    return Stream<int>([n]()
    {
        return Cell<int>(n, intsFrom(n + 1)); 
    });
}

It’s a recursive definition, but without the usual recursive function calls that eat up the stack. The call to the inner intsFrom is not made from the outer intsFrom. Instead it’s made the first time get is called on the emerging Stream.

Of course, we can also create finite streams, like this one, which produces integers from n to m:

Stream<int> ints(int n, int m)
{
    if (n > m)
        return Stream<int>();
    return Stream<int>([n, m]()
    {
        return Cell<int>(n, ints(n + 1, m));
    });
}

The trick is to capture the limit m as well as the recursion variable n. When the limit is reached, we simply return an empty Stream.

We’ll also need the method take, which creates a Stream containing the first n elements of the original stream:

Stream take(int n) const {
    if (n == 0 || isEmpty())
        return Stream();
    auto cell = _lazyCell;
    return Stream([cell, n]()
    {
        auto v = cell->get().val();
        auto t = cell->get().pop_front();
        return Cell<T>(v, t.take(n - 1));
    });
}

Here we are capturing the suspended cell and use it to lazily generate the elements of the new, truncated, Stream. Again, the key to understanding why this works is to keep in mind that Streams and Cells are conceptually immutable, and therefore can be shared by the implementation. This has some interesting side effects, which don’t influence the results, but change the performance. For instance, if the caller of take forces the evaluation of the first n elements — e.g., by passing them through the consuming forEach below — these elements will appear miraculously memoized in the original Stream.

Finally, we’ll need some way to iterate through streams. Here’s an implementation of forEach that consumes the stream while enumerating it and feeding its elements to a function.

template<class T, class F>
void forEach(Stream<T> strm, F f)
{
    while (!strm.isEmpty())
    {
        f(strm.get());
        strm = strm.pop_front();
    }
}

It’s the assignment:

strm = strm.pop_front();

which consumes the stream by decreasing the reference count of the head of the Stream. In particular, if you pass an rvalue Stream to forEach, its elements will be generated and deleted in lockstep. The algorithm will use constant memory, independent of the virtual length of the Stream. What Haskell accomplishes with garbage collection, we approximate in C++ with reference counting and shared_ptr.

Working with Streams

It’s not immediately obvious how to translate our Pythagorean triple program from nested loops to lazy streams, so we’ll have to take inspiration from the corresponding Haskell program. Let me first rewrite it using a slightly different notation:

triples = do
    z <- [1..]
    x <- [1..z]
    y <- [x..z]
    guard (x^2 + y^2 == z^2)
    return (x, y, z)

The general idea is this: Start with the stream of integers from 1 to infinity. For every such integer — call it z — create a stream from 1 to z. For each of those — call them x — create a stream from x to z. Filter out those which don’t satisfy the Pythagorean constraint. Finally, output a stream of tuples (x, y, z).

So far we’ve learned how to create a stream of integers — we have the function intsFrom. But now we’ll have to do something for each of these integers. We can’t just enumerate those integers and apply a function to each, because that would take us eternity. So we need a way to act on each element of a stream lazily.

In functional programming this is called mapping a function over a list. In general, a parameterized data structure that can be mapped over is called a functor. I’m going to show you that our Stream is a functor.

Stream as a Functor

The idea is simple: we want to apply a function to each element of a stream to get a new transformed stream (it’s very similar to the std::transform algorithm from STL). The catch is: We want to do it generically and lazily.

To make the algorithm — we’ll call it fmap — generic, we have to parameterize it over types. The algorithm starts with a Stream of elements of type T and a function from T to some other type U. The result should be a stream of U.

We don’t want to make U the template argument, because then the client would have to specify it explicitly. We want the compiler to deduce this type from the type of the function. We want, therefore, the function type F to be the parameter of our template (this will also allow us to call it uniformly with function pointers, function objects, and lambdas):

template<class T, class F>
auto fmap(Stream<T> stm, F f)

Without the use of concepts, we have no way of enforcing, or even specifying, that F be a type of a function from T to U. The best we can do is to statically assert it inside the function:

static_assert(std::is_convertible<F, std::function<U(T)>>::value,
        "fmap requires a function type U(T)");

But what is U? We can get at it using decltype:

decltype(f(stm.get()));

Notice that decltype takes, as an argument, an expression that can be statically typed. Here, the expression is a function call of f. We also need a dummy argument for this function: we use the result of stm.get(). The argument to decltype is never evaluated, but it is type-checked at compile time.

One final problem is how to specify the return type of fmap. It’s supposed to be Stream<U>, but we don’t know U until we apply decltype to the arguments of fmap. We have to use the new auto function declaration syntax of C++11. So here are all the type-related preliminaries:

template<class T, class F>
auto fmap(Stream<T> stm, F f)->Stream<decltype(f(stm.get()))>
{
    using U = decltype(f(stm.get()));
    static_assert(std::is_convertible<F, std::function<U(T)>>::value,
        "fmap requires a function type U(T)");
    ...
}

Compared to that, the actual implementation of fmap seems rather straightforward:

    if (stm.isEmpty()) return Stream<U>();
    return Stream<U>([stm, f]()
    {
        return Cell<U>(f(stm.get()), fmap(stm.pop_front(), f));
    });

In words: If the stream is empty, we’re done — return an empty stream. Otherwise, create a new stream by suspending a lambda function. That function captures the original stream (by value) and the function f, and returns a Cell. That cell contains the value of f acting on the first element of the original stream, and a tail. The tail is created with fmap acting on the rest of the original stream.

Equipped with fmap, we can now attempt to take the first step towards generating our triples: apply the function ints(1, z) to each element of the stream intsFrom(1):

fmap(intsFrom(1), [](int z)
{
    return ints(1, z);
});

The result is a Stream of Streams of integers of the shape:

1
1 2
1 2 3
1 2 3 4
1 2 3 4 5
...

But now we are stuck. We’d like to apply ints(x, z) to each element of that sequence, but we don’t know how to get through two levels of Stream. Our fmap can only get through one layer. We need a way to flatten a Stream of Streams. That functionality is part of what functional programmers call a monad. So let me show you that Stream is indeed a monad.

Stream as a Monad

If you think of a Stream as a list, the flattening of a list of lists is just concatenation. Suppose for a moment that we know how to lazily concatenate two Streams (we’ll get to it later) and let’s implement a function mjoin that concatenates a whole Stream of Streams.

You might have noticed a pattern in the implementation of lazy functions on streams. We use some kind of recursion, which starts with “Are we done yet?” If not, we do an operation that involves one element of the stream and the result of a recursive call to the function itself.

The “Are we done yet?” question usually involves testing for an empty stream. But here we are dealing with a Stream of Streams, so we have to test two levels deep. This way we’ll ensure that the concatenation of a Stream of empty Streams immediately returns an empty Stream.

The recursive step in mjoin creates a Cell whose element is the head of the first stream, and whose tail is the concatenation of the tail of the first stream and the result of mjoin of the rest of the streams:

template<class T>
Stream<T> mjoin(Stream<Stream<T>> stm)
{
    while (!stm.isEmpty() && stm.get().isEmpty())
    {
        stm = stm.pop_front();
    }
    if (stm.isEmpty()) return Stream<T>();
    return Stream<T>([stm]()
    {
        Stream<T> hd = stm.get();
        return Cell<T>( hd.get()
                      , concat(hd.pop_front(), mjoin(stm.pop_front())));
    });
}

The combination of fmap and mjoin lets us compose function like intsFrom or ints that return Streams. In fact, this combination is so common that it deserves its own function, which we’ll call mbind:

template<class T, class F>
auto mbind(Stream<T> stm, F f) -> decltype(f(stm.get()))
{
    return mjoin(fmap(stm, f));
}

If we use mbind in place of fmap:

mbind(intsFrom(1), [](int z)
{
    return ints(1, z);
});

we can produce a flattened list:

1 1 2 1 2 3 1 2 3 4...

But it’s not just the list: Each element of the list comes with variables that are defined in its environment — here the variable z. We can keep chaining calls to mbind and capture more variables in the process:

mbind(intsFrom(1), [](int z)
{
    return mbind(ints(1, z), [z](int x)
    {
        return mbind(ints(x, z), [x, z](int y)
        {
            ...
        }
    }
}

At this point we have captured the triples x, y, z, and are ready for the Pythagorean testing. But before we do it, let’s define two additional functions that we’ll use later.

The first one is mthen which is a version of mbind that takes a function of no arguments. The idea is that such a function will be executed for each element of the stream, but it won’t use the value of that element. The important thing is that the function will not be executed when the input stream is empty. In that case, mthen will return an empty stream.

We implement mthen using a slightly modified version of fmap that takes a function f of no arguments:

template<class T, class F>
auto fmapv(Stream<T> stm, F f)->Stream<decltype(f())>
{
    using U = decltype(f());
    static_assert(std::is_convertible<F, std::function<U()>>::value,
        "fmapv requires a function type U()");

    if (stm.isEmpty()) return Stream<U>();
    return Stream<U>([stm, f]()
    {
        return Cell<U>(f(), fmapv(stm.pop_front(), f));
    });
}

We plug it into the definition of mthen the same way fmap was used in mbind:

template<class T, class F>
auto mthen(Stream<T> stm, F f) -> decltype(f())
{
    return mjoin(fmapv(stm, f));
}

The second useful function is mreturn, which simply turns a value of any type into a one-element Stream:

template<class T>
Stream<T> mreturn(T v)
{
    return Stream<T>([v]() {
        return Cell<T>(v);
    });
}

We’ll need mreturn to turn our triples into Streams.

It so happens that a parameterized type equipped with mbind and mreturn is called a monad (it must also satisfy some additional monadic laws, which I won’t talk about here). Our lazy Stream is indeed a monad.

Stream as a Monoid and a Monad Plus

When implementing mjoin we used the function concat to lazily concatenate two Streams. Its implementation follows the same recursive pattern we’ve seen so many times:

template<class T>
Stream<T> concat( Stream<T> lft
                , Stream<T> rgt)
{
    if (lft.isEmpty())
        return rgt;
    return Stream<T>([=]()
    {
        return Cell<T>(lft.get(), concat<T>(lft.pop_front(), rgt));
    });
}

What’s interesting is that the concatenation of streams puts them under yet another well known functional pattern: a monoid. A monoid is equipped with a binary operation, just like concat, which must be associative and possess a unit element. It’s easy to convince yourself that concatenation of Streams is indeed associative, and that the neutral element is an empty Stream. Concatenating an empty Stream, whether in front or in the back of any other Stream, doesn’t change the original Stream.

What’s even more interesting is that being a combination of a monoid and a monad makes Stream into a monad plus, and every monad plus defines a guard function — exactly what we need for the filtering of our triples. This function takes a Boolean argument and outputs a Stream. If the Boolean is false, the Stream is empty (the unit element of monad plus!), otherwise it’s a singleton Stream. We really don’t care what value sits in this Stream — we never use the result of guard for anything but the flow of control. In Haskell, there is a special “unit” value () — here I use a nullptr as its closest C++ analog.

Stream<void*> guard(bool b)
{
    if (b) return Stream<void*>(nullptr);
    else return Stream<void*>();
}

We can now pipe the result of guard into mthen, which will ignore the content of the Stream but won’t fire when the Stream is empty. When the Stream is not empty, we will call mreturn to output a singleton Stream with the result tuple:

Stream<std::tuple<int, int, int>> triples()
{
    return mbind(intsFrom(1), [](int z)
    {
        return mbind(ints(1, z), [z](int x)
        {
            return mbind(ints(x, z), [x, z](int y)
            {
                return mthen(guard(x*x + y*y == z*z), [x, y, z]()
                {
                    return mreturn(std::make_tuple(x, y, z));
                });
            });
        });
    });
}

These singletons will then be concatenated by the three levels of mbind to create one continuous lazy Stream of Pythagorean triples.

Compare this function with its Haskell counterpart:

triples = do
    z <- [1..]
    x <- [1..z]
    y <- [x..z]
    guard (x^2 + y^2 == z^2)
    return (x, y, z)

Now, the client can take 10 of those triples from the Stream — and the triples still won’t be evaluated!. It’s the consuming forEach that finally forces the evaluation:

void test()
{
    auto strm = triples().take(10);
    forEach(std::move(strm), [](std::tuple<int, int, int> const & t)
    {
        std::cout << std::get<0>(t) << ", " 
                  << std::get<1>(t) << ", " 
                  << std::get<2>(t) << std::endl;
    });
}

Conclusion

The generation of Pythagorean triples is a toy example, but it shows how lazy evaluation can be used to restructure code in order to make it more reusable. You can use the same function triples to print the values in one part of your program and draw triangles in another. You can filter the triples or impose different termination conditions. You can use the same trick to generate an infinite set of approximation to the solution of a numerical problem, and then use different strategies to truncate it. Or you can create an infinite set of animation frames, and so on.

The building blocks of laziness are also reusable. I have used them to implement the solution to the eight-queen problem and a conference scheduling program. Once they made thread safe, the combinators that bind them are thread safe too. This is, in general, the property of persistent data structures.

You might be concerned about the performance of lazy data structures, and rightly so. They use the heap heavily, so memory allocation and deallocation is a serious performance bottleneck. There are many situation, though, where code structure, reusability, maintenance, and correctness (especially in multithreaded code) are more important than performance. And there are some problems that might be extremely hard to express without the additional flexibility gained from laziness.

I made the sources to all code in this post available on GitHub.


A heap is a great data structure for merging and sorting data. It’s implemented as a tree with the special heap property: A parent node is always less or equal than its children nodes, according to some comparison operator. In particular, the top element of the heap is always its smallest element. To guarantee quick retrieval and insertion, the tree doesn’t necessarily have to be well balanced. A leftist heap, for instance, is lopsided, with left branches always larger or equal to right branches.

The invariant of the leftist heap is expressed in terms of its right spines. The right spine of a tree is its rightmost path. Its length is called the rank of the tree. In a leftist heap the rank of the right child is always less or equal to the rank of the left child — the tree is leaning left. Because of that, the rank can grow at most logarithmically with the number of elements.

Leftist heap with ranks and spines. Ranks take into account empty leaf nodes, not shown.

Leftist heap with ranks and spines. Ranks take into account empty leaf nodes, not shown.

You can always merge two heaps by merging their right spines because they are just sorted linked lists. Since the right spines are at most logarithmically long, the merge can be done in logarithmic time. Moreover, it’s always possible to rotate nodes in the merged path to move heavier branches to the left and thus restore the leftist property.

With merging thus figured out, deletion from the top and insertion are trivial. After removing the top, you just merge left and right children. When inserting a new element, you create a singleton heap and merge it with the rest.

Implementation

The implementation of the functional leftist heap follows the same pattern we’ve seen before. We start with the definition:

A heap can either be empty or consist of a rank, a value, and two children: left and right heaps.

Let’s start with the definition of a non-empty heap as a private structure inside the Heap class:

template<class T>
class Heap
{
private:
    struct Tree
    {
        Tree(T v) : _rank(1), _v(v) {}
        Tree(int rank
            , T v
            , std::shared_ptr<const Tree> const & left
            , std::shared_ptr<const Tree> const & right)
        : _rank(rank), _v(v), _left(left), _right(right)
        {}

        int _rank;
        T   _v;
        std::shared_ptr<const Tree> _left;
        std::shared_ptr<const Tree> _right;
    };
    std::shared_ptr<Tree> _tree;
    ...
};

Heap data is just the shared_ptr<Tree>. An empty shared_ptr encodes an empty heap, otherwise it points to a non-empty Tree.

We’ll make the constructor of a non-empty heap private, because not all combinations of its arguments create a valid heap — see the two assertions:

Heap(T x, Heap const & a, Heap const & b)
{
    assert(a.isEmpty() || x <= a.front());
    assert(b.isEmpty() || x <= b.front());
    // rank is the length of the right spine
    if (a.rank() >= b.rank())
        _tree = std::make_shared<const Tree>(
                b.rank() + 1, x, a._tree, b._tree);
    else
        _tree = std::make_shared<const Tree>(
                a.rank() + 1, x, b._tree, a._tree);
}

We’ll make sure these assertions are true whenever we call this constructor from inside Heap code. This constructor guarantees that, as long as the two arguments are leftist heaps, the result is also a leftist heap. It also calculates the rank of the resulting heap by adding one to the rank of its right, shorter, branch. We’ll set the rank of an empty heap to zero (see implementation of rank).

As always with functional data structures, it’s important to point out that the construction takes constant time because the two subtrees are shared rather than copied. The sharing is thread-safe because, once constructed, the heaps are always immutable.

The clients of the heap will need an empty heap constructor:

Heap() {}

A singleton constructor might come in handy too:

explicit Heap(T x) : _tree(std::make_shared(x)) {}

They will need a few accessors as well:

bool isEmpty() const { return !_tree; }
int rank() const {
    if (isEmpty()) return 0;
    else return _tree->_rank;
}

The top, smallest, element is accessed using front:

T front() const { return _tree->_v; }

As I explained, the removal of the top element is implemented by merging left and right children:

Heap pop_front() const {
    return merge(left(), right()); 
}

Again, this is a functional data structure, so we don’t mutate the original heap, we just return the new heap with the top removed. Because of the sharing, this is a cheap operation.

The insertion is also done using merging. We merge the original heap with a singleton heap:

Heap insert(T x) {
    return merge(Heap(x), *this);
}

The workhorse of the heap is the recursive merge algorithm below:

static Heap merge(Heap const & h1, Heap const & h2)
{
    if (h1.isEmpty())
        return h2;
    if (h2.isEmpty())
        return h1;
    if (h1.front() <= h2.front())
        return Heap(h1.front(), h1.left(), merge(h1.right(), h2));
    else
        return Heap(h2.front(), h2.left(), merge(h1, h2.right()));
}

If neither heap is empty, we compare the top elements. We create a new heap with the smaller element at the top. Now we have to do something with the two children of the smaller element and the other heap. First we merge the right child with the other heap. This is the step I mentioned before: the merge follows the right spines of the heaps, guaranteeing logarithmic time. The left child is then combined with the result of the merge. Notice that the Heap constructor will automatically rotate the higher-rank tree to the left, thus keeping the leftist property. The code is surprisingly simple.

You might wonder how come we are not worried about the trees degenerating — turning into (left leaning) linked lists. Consider, however, that such a linked list, because of the heap property, would always be sorted. So the retrieval of the smallest element would still be very fast and require no restructuring. Insertion of an element smaller than the existing top would just prepend it to the list — a very cheap operation. Finally, the insertion of a larger element would turn this element into a length-one right spine — the right child of the top of the linked list. The degenerate case is actually our best case.

Turning an unsorted list of elements into a heap could naively be done in O(N*log(N)) time by inserting the elements one by one. But there is a better divide-and-conquer algorithm that does it in O(N) time (the proof that it’s O(N) is non-trivial though):

template<class Iter>
static Heap heapify(Iter b, Iter e)
{
    if (b == e)
        return Heap();
    if (e - b == 1)
        return Heap(*b);
    else
    {
        Iter mid = b + (e - b) / 2;
        return merge(heapify(b, mid), heapify(mid, e));
    }
}

This function is at the core of heap sort: you heapify a list and then extract elements from the top one by one. Since the extraction takes O(log(N)) time, you end up with a sort algorithm with the worst case performance O(N*log(N)). On average, heapsort is slower than quicksort, but quicksort’s worst case performance is O(N2), which might be a problem in some scenarios.


In my previous blog posts I described C++ implementations of two basic functional data structures: a persistent list and a persistent red-black tree. I made an argument that persistent data structures are good for concurrency because of their immutability. In this post I will explain in much more detail the role of immutability in concurrent programming and argue that functional data structures make immutability scalable and composable.

Concurrency in 5 Minutes

To understand the role of functional data structures in concurrent programming we first have to understand concurrent programming. Okay, so maybe one blog post is not enough, but I’ll try my best at mercilessly slashing through the complexities and intricacies of concurrency while brutally ignoring all the details and subtleties.

The archetype for all concurrency is message passing. Without some form of message passing you have no communication between processes, threads, tasks, or whatever your units of execution are. The two parts of “message passing” loosely correspond to data (message) and action (passing). So there is the fiddling with data by one thread, some kind of handover between threads, and then the fiddling with data by another thread. The handover process requires synchronization.

There are two fundamental problems with this picture: Fiddling without proper synchronization leads to data races, and too much synchronization leads to deadlocks.

Communicating Processes

Let’s start with a simpler world and assume that our concurrent participants share no memory — in that case they are called processes. And indeed it might be physically impossible to share memory between isolated units because of distances or hardware protection. In that case messages are just pieces of data that are magically transported between processes. You just put them (serialize, marshall) in a special buffer and tell the system to transmit them to someone else, who then picks them up from the system.

So the problem reduces to the proper synchronization protocols. The theory behind such systems is the good old CSP (Communicating Sequential Processes) from the 1970s. It has subsequently been extended to the Actor Model and has been very successful in Erlang. There are no data races in Erlang because of the isolation of processes, and no traditional deadlocks because there are no locks (although you can have distributed deadlocks when processes are blocked on receiving messages from each other).

The fact that Erlang’s concurrency is process-based doesn’t mean that it’s heavy-weight. The Erlang runtime is quite able to spawn thousands of light-weight user-level processes that, at the implementation level, may share the same address space. Isolation is enforced by the language rather than by the operating system. Banning direct sharing of memory is the key to Erlang’s success as the language for concurrent programming.

So why don’t we stop right there? Because shared memory is so much faster. It’s not a big deal if your messages are integers, but imagine passing a video buffer from one process to another. If you share the same address space (that is, you are passing data between threads rather than processes) why not just pass a pointer to it?

Shared Memory

Shared memory is like a canvas where threads collaborate in painting images, except that they stand on the opposite sides of the canvas and use guns rather than brushes. The only way they can avoid killing each other is if they shout “duck!” before opening fire. This is why I like to think of shared-memory concurrency as the extension of message passing. Even though the “message” is not physically moved, the right to access it must be passed between threads. The message itself can be of arbitrary complexity: it could be a single word of memory or a hash table with thousands of entries.

It’s very important to realize that this transfer of access rights is necessary at every level, starting with a simple write into a single memory location. The writing thread has to send a message “I have written” and the reading thread has to acknowledge it: “I have read.” In standard portable C++ this message exchange might look something like this:

std::atomic x = false;
// thread one
x.store(true, std::memory_order_release);
// thread two
x.load(std::memory_order_acquire);

You rarely have to deal with such low level code because it’s abstracted into higher order libraries. You would, for instance, use locks for transferring access. A thread that acquires a lock gains unique access to a data structure that’s protected by it. It can freely modify it knowing that nobody else can see it. It’s the release of the lock variable that makes all those modifications visible to other threads. This release (e.g., mutex::unlock) is then matched with the subsequent acquire (e.g., mutex::lock) by another thread. In reality, the locking protocol is more complicated, but it is at its core based on the same principle as message passing, with unlock corresponding to a message send (or, more general, a broadcast), and lock to a message receive.

The point is, there is no sharing of memory without communication.

Immutable Data

The first rule of synchronization is:

The only time you don’t need synchronization is when the shared data is immutable.

We would like to use as much immutability in implementing concurrency as possible. It’s not only because code that doesn’t require synchronization is faster, but it’s also easier to write, maintain, and reason about. The only problem is that:

No object is born immutable.

Immutable objects never change, but all data, immutable or not, must be initialized before being read. And initialization means mutation. Static global data is initialized before entering main, so we don’t have to worry about it, but everything else goes through a construction phase.

First, we have to answer the question: At what point after initialization is data considered immutable?

Here’s what needs to happen: A thread has to somehow construct the data that it destined to be immutable. Depending on the structure of that data, this could be a very simple or a very complex process. Then the state of that data has to be frozen — no more changes are allowed. But still, before the data can be read by another thread, a synchronization event has to take place. Otherwise the other thread might see partially constructed data. This problem has been extensively discussed in articles about the singleton pattern, so I won’t go into more detail here.

One such synchronization event is the creation of the receiving thread. All data that had been frozen before the new thread was created is seen as immutable by that thread. That’s why it’s okay to pass immutable data as an argument to a thread function.

Another such event is message passing. It is always safe to pass a pointer to immutable data to another thread. The handover always involves the release/acquire protocol (as illustrated in the example above).

All memory writes that happened in the first thread before it released the message become visible to the acquiring thread after it received it.

The act of message passing establishes the “happens-before” relationship for all memory writes prior to it, and all memory reads after it. Again, these low-level details are rarely visible to the programmer, since they are hidden in libraries (channels, mailboxes, message queues, etc.). I’m pointing them out only because there is no protection in the language against the user inadvertently taking affairs into their own hands and messing things up. So creating an immutable object and passing a pointer to it to another thread through whatever message passing mechanism is safe. I also like to think of thread creation as a message passing event — the payload being the arguments to the thread function.

The beauty of this protocol is that, once the handover is done, the second (and the third, and the fourth, and so on…) thread can read the whole immutable data structure over and over again without any need for synchronization. The same is not true for shared mutable data structures! For such structures every read has to be synchronized at a non-trivial performance cost.

However, it can’t be stressed enough that this is just a protocol and any deviation from it may be fatal. There is no language mechanism in C++ that may enforce this protocol.

Clusters

As I argued before, access rights to shared memory have to be tightly controlled. The problem is that shared memory is not partitioned nicely into separate areas, each with its own army, police, and border controls. Even though we understand that an object is frozen after construction and ready to be examined by other threads without synchronization, we have to ask ourselves the question: Where exactly does this object begin and end in memory? And how do we know that nobody else claims writing privileges to any of its parts? After all, in C++ it’s pointers all the way. This is one of the biggest problems faced by imperative programmers trying to harness concurrency — who’s pointing where?

For instance, what does it mean to get access to an immutable linked list? Obviously, it’s not enough that the head of the list never changes, every single element of the list must be immutable as well. In fact, any memory that can be transitively accessed from the head of the list must be immutable. Only then can you safely forgo synchronization when accessing the list, as you would in a single-threaded program. This transitive closure of memory accessible starting from a given pointer is often called a cluster. So when you’re constructing an immutable object, you have to be able to freeze the whole cluster before you can pass it to other threads.

But that’s not all! You must also guarantee that there are no mutable pointers outside of the cluster pointing to any part of it. Such pointers could be inadvertently used to modify the data other threads believe to be immutable.

That means the construction of an immutable object is a very delicate operation. You not only have to make sure you don’t leak any pointers, but you have to inspect every component you use in building your object for potential leaks — you either have to trust all your subcontractors or inspect their code under the microscope. This clearly is no way to build software! We need something that it scalable and composable. Enter…

Functional Data Structures

Functional data structures let you construct new immutable objects by composing existing immutable objects.

Remember, an immutable object is a complete cluster with no pointers sticking out of it, and no mutable pointers poking into it. A sum of such objects is still an immutable cluster. As long as the constructor of a functional data structure doesn’t violate the immutability of its arguments and does not leak mutable pointers to the memory it is allocating itself, the result will be another immutable object.

Of course, it would be nice if immutability were enforced by the type system, as it is in the D language. In C++ we have to replace the type system with discipline, but still, it helps to know exactly what the terms of the immutability contract are. For instance, make sure you pass only (const) references to other immutable objects to the constructor of an immutable object.

Let’s now review the example of the persistent binary tree from my previous post to see how it follows the principles I described above. In particular, let me show you that every Tree forms an immutable cluster, as long as user data is stored in it by value (or is likewise immutable).

The proof proceeds through structural induction, but it’s easy to understand. An empty tree forms an immutable cluster trivially. A non-empty tree is created by combining two other trees. We can assume by the inductive step that both of them form immutable clusters:

Tree(Tree const & lft, T val, Tree const & rgt)

In particular, there are no external mutating pointers to lft, rgt, or to any of their nodes.

Inside the constructor we allocate a fresh node and pass it the three arguments:

Tree(Tree const & lft, T val, Tree const & rgt)
      : _root(std::make_shared<const Node>(lft._root, val, rgt._root))
{}

Here _root is a private member of the Tree:

std::shared_ptr<const Node> _root;

and Node is a private struct defined inside Tree:

struct Node
{
   Node(std::shared_ptr<const Node> const & lft
       , T val
       , std::shared_ptr<const Node> const & rgt)
   : _lft(lft), _val(val), _rgt(rgt)
   {}

   std::shared_ptr<const Node> _lft;
   T _val;
   std::shared_ptr<const Node> _rgt;
};

Notice that the only reference to the newly allocated Node is stored in _root through a const pointer and is never leaked. Moreover, there are no methods of the tree that either modify or expose any part of the tree to modification. Therefore the newly constructed Tree forms an immutable cluster. (With the usual caveat that you don’t try to bypass the C++ type system or use other dirty tricks).

As I discussed before, there is some bookkeeping related to reference counting in C++, which is however totally transparent to the user of functional data structures.

Conclusion

Immutable data structures play an important role in concurrency but there’s more to them that meets the eye. In this post I tried to demonstrate how to use them safely and productively. In particular, functional data structures provide a scalable and composable framework for working with immutable objects.

Of course not all problems of concurrency can be solved with immutability and not all immutable object can be easily created from other immutable objects. The classic example is a doubly-linked list: you can’t add a new element to it without modifying pointers in it. But there is a surprising variety of composable immutable data structures that can be used in C++ without breaking the rules. I will continue describing them in my future blog posts.


Persistent trees are more interesting than persistent lists, which were the topic of my previous blog. In this installment I will concentrate on binary search trees. Such trees store values that can be compared to each other (they support total ordering). Such trees may be used to implement sets, multisets, or associated arrays. Here I will focus on the simplest of those, the set — the others are an easy extensions of the same scheme.

A set must support insertion, and membership test (I’ll leave deletion as an exercise). These operations should be doable, on average, in logarithmic time, O(log(N)). Only balanced trees, however, can guarantee logarithmic time even in the worst case. A simple tree may sometimes degenerate to a singly-linked list, with performance dropping to O(N). I will start with a simple persistent tree and then proceed with a balanced red-black tree.

Persistent Binary Search Tree

As with lists, we will start with an abstract definition:

A tree is either empty or contains a left tree, a value, and a right tree.

This definition translates into a data structure with two constructors:

template<class T>
class Tree {
public:
    Tree(); // empty tree
    Tree(Tree const & lft, T val, Tree const & rgt)
};

Just as we did with persistent lists, we’ll encode the empty/non-empty tree using null/non-null (shared) pointer to a node. A Node represents a non-empty tree:

   struct Node
   {
       Node(std::shared_ptr<const Node> const & lft
          , T val
          , std::shared_ptr<const Node> const & rgt)
       : _lft(lft), _val(val), _rgt(rgt)
       {}

       std::shared_ptr<const Node> _lft;
       T _val;
       std::shared_ptr<const Node> _rgt;
   };

Here’s the complete construction/deconstruction part of the tree. Notice how similar it is to the list from my previous post. All these methods are const O(1) time, as expected. As before, the trick is to construct a new object (Tree) from big immutable chunks (lft and rgt), which can be safely put inside shared pointers without the need for deep copying.

template<class T>
class Tree
{
    struct Node;
    explicit Tree(std::shared_ptr<const Node> const & node) 
    : _root(node) {} 
public:
    Tree() {}
    Tree(Tree const & lft, T val, Tree const & rgt)
      : _root(std::make_shared<const Node>(lft._root, val, rgt._root))
    {
        assert(lft.isEmpty() || lft.root() < val);
        assert(rgt.isEmpty() || val < rgt.root());       
    }
    bool isEmpty() const { return !_root; }
    T root() const {
        assert(!isEmpty());
        return _root->_val;
    }
    Tree left() const {
        assert(!isEmpty());
        return Tree(_root->_lft);
    }
    Tree right() const {
        assert(!isEmpty());
        return Tree(_root->_rgt);
    }
private:
    std::shared_ptr<const Node> _root;
};

Insert

The persistent nature of the tree manifests itself in the implementation of insert. Instead of modifying the existing tree, insert creates a new tree with the new element inserted in the right place. The implementation is recursive, so imagine that you are at a subtree of a larger tree. This subtree might be empty. Inserting an element into an empty tree means creating a single-node tree with the value being inserted, x, and two empty children.

On the other hand, if you’re not in an empty tree, you can retrieve the root value y and compare it with x. If x is less then y, it has to be inserted into the left child. If it’s greater, it must go into the right child. In both cases we make recursive calls to insert. If x is neither less nor greater than y, we assume it’s equal (that’s why we need total order) and ignore it. Remember, we are implementing a set, which does not store duplicates.

Tree insert(T x) const {
    if (isEmpty())
        return Tree(Tree(), x, Tree());
    T y = root();
    if (x < y)
        return Tree(left().insert(x), y, right());
    else if (y < x)
        return Tree(left(), y, right().insert(x));
    else
        return *this; // no duplicates
}

Now consider how many new nodes are created during an insertion. A new node is only created in the constructor of a tree (in the code: std::make_shared<const Node>(lft._root, val, rgt._root)). The left and right children are not copied, they are stored by reference. At every level of insert, a tree constructor is called at most once. So in the worst case, when we recurse all the way to the leaves of the tree, we only create h nodes, where h is the height of the tree. If the tree is not too much out of balance its height scales like a logarithm of the number of nodes. To give you some perspective, if you store a billion values in a tree, an insertion will cost you 30 copies on average. If you need a logarithmic bound on the worst case, you’d have to use balanced trees (see later).

If you study the algorithm more closely, you’ll notice that only the nodes that are on the path from the root to the point of insertion are modified.

Testing for membership in a persistent tree is no different than in a non-persistent one. Here’s the recursive algorithm:

bool member(T x) const {
    if (isEmpty())
        return false;
    T y = root();
    if (x < y)
        return left().member(x);
    else if (y < x)
        return right().member(x);
    else
        return true;
}

When using C++11, you might take advantage of the initializer list constructor to initialize a tree in one big swoop like this:

Tree t{ 50, 40, 30, 10, 20, 30, 100, 0, 45, 55, 25, 15 };

.

Here’s the implementation of such constructor, which works in O(N*log(N)) average time (notice that it effectively sorts the elements, and O(N*log(N)) is the expected asymptotic behavior for sort):

Tree(std::initializer_list<T> init) {
    Tree t;
    for (T v: init) {
        t = t.insert(v);
    }
    _root = t._root;
}

Persistent Red-Black Tree

If you want to keep your tree reasonably balanced — that is guarantee that its height is on the order of log(N) — you must do some rebalancing after inserts (or deletes). Care has to be taken to make sure that rebalancing doesn’t change the logarithmic behavior of those operations. The balance is often expressed using some invariants. You can’t just require that every path from root to leaf be of equal length, because that would constrain the number of elements to be always a power of two. So you must give it some slack.

In the case of a red-black tree, the invariants are formulated in terms of colors. Every node in the tree is marked as either red or black. These are the two invariants that have to be preserved by every operation:

  1. Red invariant: No red node can have a red child
  2. Black invariant: Every path from root to an empty leaf node must contain the same number of black nodes — the black height of the tree.

This way, if the shortest path in a tree is all black, the longest path could only be twice as long, containing one red node between each pair of black nodes. The height of such a tree could only vary between (all black) log(N) and (maximum red) 2*log(N).

With these constraints in mind, the re-balancing can be done in log(N) time by localizing the modifications to the nearest vicinity of the path from the root to the point of insertion or deletion.

Let’s start with basic definitions. The node of the tree will now store its color:

enum Color { R, B };

Otherwise, it’s the same as before:

    struct Node
    {
        Node(Color c, 
            std::shared_ptr const & lft, 
            T val, 
            std::shared_ptr const & rgt)
            : _c(c), _lft(lft), _val(val), _rgt(rgt)
        {}
        Color _c;
        std::shared_ptr _lft;
        T _val;
        std::shared_ptr _rgt;
    };

An empty tree will be considered black by convention.

The membership test ignores colors so we don’t have to re-implement it. In fact the search performance of a persistent RB Tree is exactly the same as that of an imperative RB Tree. You pay no penalty for persistence in search.

With insertion, you pay the penalty of having to copy the path from root to the insertion point, which doesn’t change its O(log(N)) asymptotic behavior. As I explained before, what you get in exchange is immutability of every copy of your data structure.

The Balancing

Let’s have a look at the previous version of insert and figure out how to modify it so the result preserves the RB Tree invariants.

Tree insert(T x) const {
    if (isEmpty())
        return Tree(Tree(), x, Tree());
    T y = root();
    if (x < y)
        return Tree(left().insert(x), y, right());
    else if (y < x)
        return Tree(left(), y, right().insert(x));
    else
        return *this; // no duplicates
}

Let’s first consider the most difficult scenario: the insertion into a maximum capacity tree for a given black height. Such a tree has alternating levels of all black and all red nodes. The only way to increase its capacity is to increase its black height. The cheapest way to add one more black level to all paths (thus preserving the black invariant) is to do it at the root (for instance, lengthening all the path at the leaves would require O(N) red-to-black re-paintings).

So here’s the plan: We’ll insert a new node at the leaf level and make it red. This won’t break the black invariant, but may break the red invariant (if the parent node was red). We’ll then retrace our steps back to the root, percolating any red violation up. Then, at the top level, we’ll paint the resulting root black, thus killing two birds with one stone: If we ended up with a red violation at the top, this will fix it and, at the same time, increase the black height of the whole tree.

It’s important that during percolation we never break the black invariant.

So here’s how we execute this plan: insert will call the recursive insertion/re-balancing method ins, which might return a red-topped tree. We’ll paint that root black (if it’s already black, it won’t change anything) and return it to the caller:

RBTree insert(T x) const {
    RBTree t = ins(x);
    return RBTree(B, t.left(), t.root(), t.right());
}

In the implementation of ins, the first case deals with an empty tree. This situation happens when it’s the first insertion into an empty tree or when, during the recursive process, we’ve reached the insertion point at the bottom of the tree. We create a red node and return it to the caller:

if (isEmpty())
  return RBTree(R, RBTree(), x, RBTree());

Notice that, if this new node was inserted below another red node, we are creating a red violation. If that node was the root of the whole tree, insert will repaint it immediately. If it weren’t, and we pop one level up from recursion, we’ll see that violation. We can’t fix it at that point — for that we’ll have to pop one more level, up to the black parent, where we have more nodes to work with.

Here are the details of ins: We’ll follow the same logic as in the non-balanced tree, thus preserving the ordering of values; but instead of reconstructing the result tree on the spot we’ll call a function balance, which will do that for us in a semi-balanced way (that is, with a possibility of a red violation, but only at the very top).

RBTree ins(T x) const
{
    if (isEmpty())
        return RBTree(R, RBTree(), x, RBTree());
    T y = root();
    Color c = rootColor();
    if (x < y)
        return balance(c, left().ins(x), y, right());
    else if (y < x)
        return balance(c, left(), y, right().ins(x));
    else
        return *this; // no duplicates
}

Just like the constructor of the red-black tree, balance takes the following arguments: color, left subtree, value, and right subtree. Depending on the result of the comparison, the new element is inserted either into the left or the right subtree.

As I explained, balance, and consequently ins, cannot fix the red violation when they are sitting on it. All they can do is to make sure that the violation is at the very top of the tree they return. So when we call balance with the result of ins, as in:

balance(c, left().ins(x), y, right())

or:

balance(c, left(), y, right().ins(x))

the left or the right subtree, respectively, may be semi-balanced. This is fine because balance can then rotate this violation away.

So the interesting cases for balance are the ones that rebuild a black node with either the left or the right subtree having a red violation at the top.

There are four possible cases depending on the position of the violation. In each case we can rearrange the nodes in such a way that the violation disappears and the ordering is preserved. In the pictures below I have numbered the nodes and subtrees according to the order of the values stored in them. Remember that all values in the left subtree are less than the value stored in the node, which in turn is less than all the values in the right subtree.

Fig 1

Rotating lft.doubledLeft()

Fig 1

Rotating lft.doubledRight()()

Fig 1

Rotating rgt.doubledLeft()

Fig 1

Rotating rgt.doubledRight()()

Each rotation creates a tree that preserves both invariants. Notice, however, that the result of the rotation is always red-tipped, even though we were rebuilding a node that was originally black. So if the parent of that node was red, our caller will produce a red violation (it will call balance with red color as its argument, which will fall through to the default case). This violation will be then dealt with at the parent’s parent level.

static RBTree balance(Color c
                    , RBTree const & lft
                    , T x
                    , RBTree const & rgt)
{
   if (c == B && lft.doubledLeft())
        return RBTree(R
                    , lft.left().paint(B)
                    , lft.root()
                    , RBTree(B, lft.right(), x, rgt));
    else if (c == B && lft.doubledRight())
        return RBTree(R
                    , RBTree(B, lft.left(), lft.root(), lft.right().left())
                    , lft.right().root()
                    , RBTree(B, lft.right().right(), x, rgt));
    else if (c == B && rgt.doubledLeft())
        return RBTree(R
                    , RBTree(B, lft, x, rgt.left().left())
                    , rgt.left().root()
                    , RBTree(B, rgt.left().right(), rgt.root(), rgt.right()));
    else if (c == B && rgt.doubledRight())
        return RBTree(R
                    , RBTree(B, lft, x, rgt.left())
                    , rgt.root()
                    , rgt.right().paint(B));
    else
        return RBTree(c, lft, x, rgt);
}

For completeness, here are the auxiliary methods used in the implementation of balance:

bool doubledLeft() const {
    return !isEmpty()
        && rootColor() == R
        && !left().isEmpty()
        && left().rootColor() == R;
}
bool doubledRight() const {
    return !isEmpty()
        && rootColor() == R
        && !right().isEmpty()
        && right().rootColor() == R;
}
RBTree paint(Color c) const {
    assert(!isEmpty());
    return RBTree(c, left(), root(), right());
}

Conclusion

Our implementation of the persistent red-black tree follows the Chris Okasaki’s book. As Chris asserts, this is one of the fastest implementations there is, and he offers hints to make it even faster. Of course there are many imperative implementations of red-black trees, including STL’s std::set and std::map. Persistent RB-trees match their performance perfectly when it comes to searching. Insertion and deletion, which are O(log(N)) for either implementation, are slower by a constant factor because of the need to copy the path from root to leaf. On the other hand, the persistent implementation is thread-safe and synchronization-free (except for reference counting in shared_ptr — see discussion in my previous blog).

Complete code is available at GitHub.

Acknowledgment

I’d like to thank Eric Niebler for reading the draft and telling me which of my explanations were more abstruse than usual.

Haskell Code

For comparison, here’s the original Haskell code. You can see that the C++ implementation preserves its structure pretty well. With proper optimization tricks (unboxing and eager evaluation) the Haskell code should perform as well as its C++ translation.

Regular (unbalanced) binary search tree:

data Tree a = Empty | Node (Tree a) a (Tree a)

member x Empty = False
member x (Node lft y rgt) =
    if x < y then member x lft
    else if y < x then member x rgt
    else True

insert x Empty = Node Empty x Empty
insert x t@(Node lft y rgt) =
    if x < y then Node (insert x lft) y rgt
    else if y < x then Node lft y (insert x rgt)
    else t

Balanced Red-Black tree:

data Color = R | B

data Tree a = Empty | Node Color (Tree a) a (Tree a)

member x Empty = False
member x (Node _ lft y rgt) =
    if x < y then member x lft
    else if y < x then member x rgt
    else True

insert x tree = Node B left val right
  where
      ins Empty = Node R Empty x Empty
      ins t@(Node c lft y rgt) =
          if (x < y) then balance c (ins lft) y rgt
          else if (y < x) then balance c lft y (ins rgt)
          else t
      Node _ left val right = ins tree -- pattern match result of ins


balance B (Node R (Node R a x b) y c) z d = 
    Node R (Node B a x b) y (Node B c z d)
balance B (Node R a x (Node R b y c)) z d = 
    Node R (Node B a x b) y (Node B c z d)
balance B a x (Node R (Node R b y c) z d) = 
    Node R (Node B a x b) y (Node B c z d)
balance B a x (Node R b y (Node R c z d)) = 
    Node R (Node B a x b) y (Node B c z d)
balance color a x b = Node color a x b

I’ve been looking for a good analogy of what programming in C++ feels like and I remembered this 1990 Tim Burton movie, Edward Scissorhands.

It’s a darker version of Pinocchio, shot in suburban settings. In this poster, the scary guy (Johnny Depp) is trying to gently hug Winona Ryder but his clumsy scissor-hands are making it very dangerous for both of them. His face is already covered with deep scars.

Having scissors for hands in not all that bad. Edward has many talents: he can, for instance, create stunning dog hairdos.

I often have these kinds of thoughts after attending C++ conferences: this time it was Going Native 2013. The previous year, the excitement was all about the shiny new C++11 Standard. This year it was more of a reality check. Don’t get me wrong — there were many stunning dog hairdos on display (I mean C++ code that was elegant and simple) but the bulk of the conference was about how to avoid mutilation and how to deliver first aid in case of accidental amputation.

Little shop of horrors

There was so much talk about how not to use C++ that it occurred to me that maybe this wasn’t the problem of incompetent programmers, but that straightforward C++ is plain wrong. So if you just learn the primitives of the language and try to use them, you’re doomed.

C++ has an excuse for that: backward compatibility — in particular compatibility with C. You might think of the C subset of C++ as bona fide assembly language which you shouldn’t use it in day-to-day programming, except that it’s right there on the surface. If you reach blindly into your C++ toolbox, you’re likely to come up with naked pointers, for loops, and all this ugly stuff.

A well known example of what not to do is to use malloc to dynamically allocate memory, and free to deallocate it. malloc takes a count of bytes and returns a void pointer, which you have to cast to something more usable — it would be hard to come up with worse API for memory management. Here’s an example of really bad (but almost correct, if it weren’t for the possibility of null pointer dereference) code:

struct Pod {
    int count;
    int * counters;
};

int n = 10;
Pod * pod = (Pod *) malloc (sizeof Pod);
pod->count = n
pod->counters = (int *) malloc (n * sizeof(int));
...
free (pod->counters);
free (pod);

Hopefully, nobody writes code like this in C++, although I’m sure there are a lot of legacy apps with such constructs, so don’t laugh.

C++ “solved” the problem of redundant casting and error-prone size calculations by replacing malloc and free with new and delete. The corrected C++ version of the code above would be:

struct Pod {
    int count;
    int * counters;
};

int n = 10;
Pod * pod = new Pod;
pod->count = n;
pod->counters = new int [n];
...
delete [] pod->counters;
delete pod;

BTW, the null pointer dereference problem is solved too, because new will throw an exception when the system runs out of memory. There is still a slight chance of a memory leak if the second new fails (But how often does that happen? Hint: how big can n get?) So here’s the really correct version of the code:

class Snd { // Sophisticated New Data
public:
    Snd (int n) : _count(n), _counters(new int [n]) {}
    ~Snd () { delete [] _counters; }
private:
    int _count;
    int * _counters;
};

Snd * snd = new Snd (10);
...
delete snd;

Are we done yet? Of course not! The code is not exception safe.

The C++ lore is that you should avoid naked pointers, avoid arrays, avoid delete. So the remedy for the lameness of malloc is operator new, which is also broken because it returns a dangerous pointer and pointers are bad.

We all know (and have scars on our faces to prove it) that you should use the Standard Library containers and smart pointers whenever possible. Oh, and use value semantics for passing things around. No wait! Value semantics comes with a performance penalty because of excessive copying. So what about shared_ptr and vectors of shared_ptr? But that adds the overhead of reference counting! No, here’s a new idea: move semantics and rvalue references.

I can go on and on like this (and I often do!). Do you see the pattern? Every remedy breeds another remedy. It’s no longer just the C subset that should be avoided. Every new language feature or library addition comes with a new series of gotchas. And you know a new feature is badly designed if Scott Meyers has a talk about it. (His latest was about the pitfalls of, you guessed it, move semantics.)

The Philosophy of C++

Bjarne Stroustrup keeps stressing how important backward compatibility is for C++. It’s one of the pillars of the C++ philosophy. Considering how much legacy code there is, it makes perfect sense. Compatibility, though, takes a very heavy toll on the evolution of the language. If nature were as serious about backward compatibility as C++ is, humans would still have tails, gills, flippers, antennae, and external skeletons on top of internal ones — they all made sense at some point in the evolution.

C++ has become an extremely complex language. There are countless ways of doing the same thing — almost all of them either plain wrong, dangerous, unmaintainable, or all of the above. The problem is that most code compiles and even runs. The mistakes and shortcomings are discovered much later, often after the product has been released.

You might say: Well, that’s the nature of programming. If you think so, you should seriously look at Haskell. Your first reaction will be: I don’t know how to implement the first thing (other than the factorial and Fibonacci numbers) in this extremely restrictive language. This is totally different from the C++ experience, where you can start hacking from day one. What you don’t realize is that it will take you 10 years, if you’re lucky, to discover the “right way” of programming in C++ (if there even is such thing). And guess what, the better a C++ programmer you are, the more functional your programs look like. Ask any C++ guru and they will tell you: avoid mutation, avoid side effects, don’t use loops, avoid class hierarchies and inheritance. But you will need strict discipline and total control over your collaborators to pull that off because C++ is so permissive.

Haskell is not permissive, it won’t let you — or your coworkers — write unsafe code. Yes, initially you’ll be scratching your head trying to implement something in Haskell that you could hack in C++ in 10 minutes. If you’re lucky, and you work for Sean Parent or other exceptional programmer, he will code review your hacks and show you how not to program in C++. Otherwise, you might be kept in the dark for decades accumulating self-inflicted wounds and dreaming of dog hairdos.

Resource Management

I started this post with examples of resource management (strictly speaking, memory management), because this is one of my personal favorites. I’ve been advocating and writing about it since the nineties (see bibliography at the end). Obviously I have failed because 20 years later resource management techniques are still not universally known. Bjarne Stroustrup felt obliged to spend half of his opening talk explaining resource management to the crowd of advanced C++ programmers. Again, one could blame incompetent programmers for not accepting resource management as the foundation of C++ programming. The problem though is that there is nothing in the language that would tell a programmer that something is amiss in the code I listed in the beginning of this post. In fact it often feels like learning the correct techniques is like learning a new language.

Why is it so hard? Because in C++ the bulk of resource management is memory management. In fact it has to be stressed repeatedly that garbage collection would not solve the problem of managing resources: There will always be file handles, window handles, open databases and transactions, etc. These are important resources, but their management is overshadowed by the tedium of memory management. The reason C++ doesn’t have garbage collection is not because it can’t be done in an efficient way, but because C++ itself is hostile to GC. The compiler and the runtime have to always assume the worst — not only that any pointer can alias any other pointer but that a memory address can be stored as an integer or its lower bits could be used as bitfields (that’s why only conservative garbage collectors are considered for C++).

It’s a common but false belief that reference counting (using shared pointers in particular) is better than garbage collection. There is actual research showing that the two approaches are just two sides of the same coin. You should realize that deleting a shared pointer may lead to an arbitrary long pause in program execution, with similar performance characteristics as a garbage sweep. It’s not only because every serious reference counting algorithm must be able to deal with cycles, but also because every time a reference count goes to zero on a piece of data a whole graph of pointers reachable from that object has to be traversed. A data structure built with shared pointers might take a long time to delete and, except for simple cases, you’ll never know which shared pointer will go out of scope last and trigger it.

Careful resource management and spare use of shared_ptr might still be defendable for single-threaded programs, but the moment you start using concurrency, you’re in big trouble. Every increment or decrement of the counter requires locking! This locking is usually implemented with atomic variables, but so are mutexes! Don’t be fooled: accessing atomic variables is expensive. Which brings me to the central problem with C++.

Concurrency and Parallelism

It’s been 8 years since Herb Sutter famously exclaimed: The Free Lunch is Over! Ever since then the big C++ oil tanker has been slowly changing its course. It’s not like concurrency was invented in 2005. Posix threads have been defined in 1995. Microsoft introduced threads in Windows 95 and multiprocessor support in Windows NT. Still, concurrency has only been acknowledged in the C++ Standard in 2011.

C++11 had to start from scratch. It had to define the memory model: when and in what order memory writes from multiple threads become visible to other threads. For all practical purposes, the C++ memory model was copied from Java (minus some controversial guarantees that Java made about behavior under data races). In a nutshell, C++ programs are sequentially consistent if there are no data races. However, since C++ had to compete with the assembly language, the full memory model includes so called weak atomics, which I would describe as portable data races, and recommend staying away from.

C++11 also defined primitives for thread creation and management, and basic synchronization primitives as defined by Dijkstra and Hoare back in the 1960s, such as mutexes and condition variables. One could argue whether these are indeed the right building blocks for synchronization, but maybe that doesn’t really matter because they are known not to be composable anyway. The composable abstraction for synchronization is STM (Software Transactional Memory), which is hard to implement correctly and efficiently in an imperative language. There is an STM study group in the Standards Committee, so there is a chance it might one day become part of the Standard. But because C++ offers no control over effects, it will be very hard to use properly.

There was also a misguided and confusing attempt at providing support for task-based parallelism with async tasks and non-composable futures (both seriously considered for deprecation in C++14). Thread-local variables were also standardized making task-based approach that much harder. Locks and condition variables are also tied to threads, not tasks. So that was pretty much a disaster. The Standards Committee has the work cut out for them for many years ahead. That includes task-based composable parallelism, communication channels to replace futures (one would hope), task cancellation and, probably longer term, data-driven parallelism, including GPU support. A derivative of Microsoft PPL and Intel TBB should become part of the Standard (hopefully not Microsoft AMP).

Let’s take a great leap of faith and assume that all these things will be standardized and implemented by, say, 2015. Even if that happens, I still don’t think people will be able to use C++ for mainstream parallel programming. C++ has been designed for single thread programming, and parallel programming requires a revolutionary rather than evolutionary change. Two words: data races. Imperative languages offer no protection against data races — maybe with the exception of D.

In C++, data is shared between threads by default, is mutable by default, and functions have side effects almost by default. All those pointers and references create fertile grounds for data races, and the vulnerability of data structures and functions to races is in no way reflected in the type system. In C++, even if you have a const reference to an object, there is no guarantee that another thread won’t modify it. Still worse, any references inside a const object are mutable by default.

D at least has the notion of deep constness and immutability (no thread can change an immutable data structure). Another nod towards concurrency from D is the ability to define pure functions. Also, in D, mutable objects are not shared between threads by default. It is a step in the right direction, even though it imposes runtime cost for shared objects. Most importantly though, threads are not a good abstraction for parallel programming, so this approach won’t work with lightweight tasks and work-stealing queues, where tasks are passed between threads.

But C++ doesn’t support any of this and it doesn’t look like it ever will.

Of course, you might recognize all these pro-concurrency and parallelism features as functional programming — immutability and pure functions in particular. At the risk of sounding repetitive: Haskell is way ahead of the curve with respect to parallelism, including GPU programming. That was the reason I so easily converted to Haskell after years of evangelizing good programming practices in C++. Every programmer who’s serious about concurrency and parallelism should learn enough Haskell to understand how it deals with it. There is an excellent book by Simon Marlow, Parallel and Concurrent Programming in Haskell. After you read it, you will either start using functional techniques in your C++ programming, or realize what an impedance mismatch there is between parallel programming and an imperative language, and you will switch to Haskell.

Conclusions

I believe that the C++ language and its philosophy are in direct conflict with the requirements of parallel programming. This conflict is responsible for the very slow uptake of parallel programming in mainstream software development. The power of multicore processors, vector units, and GPUs is being squandered by the industry because of an obsolete programming paradigm.

Bibliography

Here I put together some of my publications about resource management:

  1. Bartosz Milewski, “Resource Management in C++,” Journal of Object Oriented Programming, March/April 1997, Vol. 10, No 1. p. 14-22. This is still pre-unique_ptr, so I’m using auto_ptr for what it’s worth. Since you can’t have vectors of auto_ptr I implemented an auto_vector.
  2. C++ Report in September 1998 and February 1999 (still using auto_ptr).
  3. C++ in Action (still auto_ptr), Addison Wesley 2001. See an excerpt from this book that talks about resource management.
  4. Walking Down Memory Lane, with Andrei Alexandrescu, CUJ October 2005 (using unique_ptr)
  5. unique_ptr–How Unique is it?, WordPress, 2009

Here are some of my blogs criticizing the C++11 approach to concurrency:

  1. Async Tasks in C++11: Not Quite There Yet
  2. Broken promises–C++0x futures

It was my first experience working with the C++ Standardization Committee in a subgroup dedicated to concurrency and parallelism. I won’t bore you with details — they will be available at the committee web site. I’ll share my overall impressions and then focus on specific areas where I have strong opinions.

Being an outsider I considered the C++ Standard the ultimate word. If I had problems interpreting the letter of the Standard I would ask one of the committee members for interpretation and assume that I would get the same answer from any of them. Reality turned out to be more complex than that. C++ Standard is full of controversial topics. Some of those controversies could not be resolved in time, so often the wording of the Standard is intentionally vague. Some features were not ready for inclusion so little stubs were inserted into the document that sometimes don’t make much sense in isolation.

One such example is the intentional vagueness and the lack of definition of thread of execution. Not only is a thread undefined, some of the semantics are expressed using the “as if” language. In particular the thingie started by std::async is supposed to behave “as if” it were run in a separate thread of execution (whatever that means). At some point I had a long email exchange about it with Anthony Williams and Hans Boehm that resulted in a blog post. I thought the things were settled until I was alerted to the fact that Microsoft’s interpretation of the Standard was slightly different, and their “as if” didn’t include thread_local variables, at least not in the beta of the new Visual C++.

Here’s the problem: std::async was introduced in the Standard as a compromise between the idea that it’s just syntactic sugar over std::thread creation, and the idea that it’s an opening for task-based parallelism. In fact when I first tried std::async using Anthony Williams’ Just Thread library I expected it to run on a thread pool complete with work stealing and thread reuse. Not so, argued Anthony and Hans, pointing among others things to the problem of managing thread-local variables — are they supposed to be local with respect to the underlying OS thread, or to a smaller units of execution, the tasks?. If multiple tasks are reusing the same thread should they see fresh versions of thread_local variables? When should thread-local variables be destroyed if the lifetime of pool threads is theoretically infinite?

Now, Microsoft has its implementation of task-based concurrency in the form of PPL (Parallel Pattern Library). Intel has TBB (Threading Building Blocks), which is a superset of PPL and it also runs on Linux. I can understand the eagerness of those companies to bend the (intentionally vague) rules and make these libraries accessible through std::async, especially if they can dramatically improve performance.

I’d be the first to vote for this proposal, except for a few unsolved problems.

First of all, Microsoft wanted to change the semantics of std::async when called with launch_policy::async. I think this was pretty much ruled out in the ensuing discussion. Pure async case should be indistinguishable from direct creation of std::thread. Any attempt at using a thread pool behind the scenes could result in deadlocks. Essentially, the programmer must have a guarantee that all the tasks will be allowed to run in parallel no matter how many there are. Just imagine a bunch of tasks trying to communicate with each other back and forth. If thread creation is throttled down after N of them start and possibly block waiting for responses from the rest of them, they might block forever. Thread pools usually have the ability to create new threads on demand, but it’s never obvious when a new thread must be created. Even if the pool could detect all the threads that are blocked, it couldn’t detect those that are busy-spinning. This is why std::async with launch_policy::async must always create, or at least immediately steal, a thread.

The situation is different with std::async called with the default launch policy (the bitwise OR of launch_policy::async and launch_policy::deferred). In that case the runtime does not guarantee that all tasks will be able to run in parallel. In fact the programmer must be prepared for the possibility that all tasks run serially in the context of the parent thread (more specifically, in the context of the thread that calls future::get). Here the problem with using a thread pool is different. It has to do with the lifetimes of thread_local variables that I mentioned before. This is a serious problem and the semantics defined by the current Standard are far from natural. As it stands, a task created using the default launch policy must either run on a completely new thread, in which case that thread defines the lifetimes of thread_local variables; or it must be deferred, in which case it shares thread_local variables with its parent (again, strictly speaking, with the caller of future::get — if the future is passed to a different thread). This behavior might seem confusing, but at least it’s well defined.

Here’s how Herb Sutter proposed to solve the problem of making tasks run in a thread pool: Disallow non-POD thread_locals altogether. The argument was that nobody has implemented non-POD thread locals anyway, so nobody will suffer. Anthony Williams’ and Boost implementations were dismissed as library-based.

This seems to me like a violation of the spirit of C++, but there is a precedent for it: atomic variables. You can declare a POD (Plain Old Data, including simple structs) as atomic and, if it fits inside a hardware supported atomic word, it will become a lock-free atomic; otherwise a lock will be provided free of charge (well, you’ll pay for it with performance, but that’s a different story). But you can’t define a non-POD as atomic!

A quick straw poll showed that the subcommittee was equally split between those who were willing to discuss this change and those who weren’t. It seems though that Microsoft will go ahead with its PPL implementation ignoring the problems with thread_local (and also with DLL_THREAD_DETACH handlers I mentioned in my blog). So you might want to restrict the use of non-POD thread-local variables for the time being.

This discussion had a larger context: The proposal to introduce thread pools into the language/library as first class objects. Google’s Jeffrey Yaskin described their Executor library, which combines thread pools with work-stealing queues and schedulers. PPL has a similar construct called task group. In this new context, std::async would only provide an interface to a global default thread-pool/executor/task-group. The introduction of first-class thread pools would take away the pressure to modify the semantics of std::async. If you cared about the way your tasks are scheduled, you could spawn them using a dedicated thread-pool object. Having an explicit object representing a set of tasks would also allow collective operations such as wait-for-all or cancel.

Which brings me to another topic: composable futures. I wrote a blog post some time ago, Broken Promises: C++0x Futures, in which I lamented the lack of composability of futures. I followed it with another blog, Futures Done Right, proposing a solution. So I was very happy to learn about a new proposal to fix C++ futures. The proposal came from an unexpected source — C#.

The newest addition to C# is support for asynchronous interfaces (somewhat similar to Boost::ASIO). This is a hot topic at Microsoft because the new Windows 8 runtime is based on asynchronous API — any call that might take more than 50ms is implemented as an asynchronous API. Of course you can program to asynchronous API by writing completion handlers, but it’s a very tedious and error-prone method. Microsoft’s Mads Torgersen described how C# offers several layers of support for asynchronous programming.

But what caught my interest was how C# deals with composition of futures (they call them task objects). They have the analog of an aggregate join called WhenAll and an equivalent of “select” called WhenAny. However these combinators do not block; instead they return new futures. There is another important combinator, ContinueWith. You give it a function (usually a lambda) that will be called when the task completes. And again, ContinueWith doesn’t block — it returns another future, which may be composed with other futures, and so on. This is exactly what makes C# futures composable and, hopefully, C++ will adopt a similar approach.

Of course there is much more to the async proposal, and I wish I had more time to talk about it; but the composable integration of asynchronicity with task-based concurrency is in my eyes a perfect example of thoughtful design.

I noticed that there seems to be a problem with C++’s aversion to generalizations (I might be slightly biased having studied Haskell with its love for generalizations). Problems are often treated in separation, and specific solutions are provided for each, sometimes without a serious attempt at generalization. Case in point: cancellation of tasks. A very specialized solution involving cancellation tokens was proposed. You get opaque tokens from a factory, you pass them to tasks (either explicitly or by lambda capture), and the tasks are responsible for polling the tokens and performing appropriate cancellation actions. But this is an example of an asynchronous Boolean channel. Instead of defining channels, C++ is considering a special-purpose one-shot solution (unless there is a volunteer willing who will write a channels proposal). By the way, futures can be also viewed as channels, so this generalization might go a long way.

Another candidate for generalization was the Intel vectorization proposal presented by Robert Geva. Of course it would be great to support the use of vector processors in C++. But you have to see it in the larger context of data-driven parallelism. It doesn’t make sense to have separate solutions for vector processors, multicores running in SIMD mode, and GPGPUs. What’s needed is general support for data parallelism that allows multiple hardware-specific specializations. Hopefully a more general proposal will materialize.

The C++ Standards Committee is doing a great job, considering all the limitations it’s facing. The committee will not add anything to the language unless there are volunteers who will write proposals and demonstrate working implementations. Remember, you too can contribute to the future of C++.


My new blog post is at the FP Commplete web site where I work now. It explains the major unsolved problem of imperative programming and why I turned to functional programming. There is also an animated discussion on reddit.

 


How would you like a job in the supercomputing industry? Programming those powerful Ks, Jaguars, Roadrunners, Blue Genes, or gigantic clusters of computers? How inspiring would that be?

Not much, according to the luminaries of the field. I went to a panel about the future of supercomputing at SC11, and learned that the future is… Fortran, MPI, OpenMP and CUDA. I have no reason to doubt the experts; after all some of them were with the industry when it was all ferrite core memory and punch cards. But it makes me wonder if there is a future at all for supercomputing, if things keep going in this direction.

Let me explain: Programming in Fortran, MPI (Message Passing Interface), OpenMP (a system of annotations for C or Fortran to help the compiler parallelize the program), and CUDA (Compute Unified Device Architecture for programming GPGPUs) is tedious, uninspiring, and boring.

I talked to a CS student who was demonstrating his summer work at the booth belonging to one of the large national labs. It was a project to improve Monte Carlo simulations of some physical processes. It was done, unsurprisingly, using MPI and OpenMP. I asked him what the exciting part of the job was. It was the learning of the Monte Carlo method. The rest was the tedium of combining barely compatible clunky programming paradigms into a workable program.

Why does it matter? Because a thriving industry or a company must attract talent. And talent can’t be bought, at least not easily. There was once a study, which showed that, above a certain compensation level, talented people don’t care so much about salaries as they do about the novelty, excitement, and freedom. Google knows it very well: They create an exciting work environments (I call them day-care centers for programmers), and encourage their employees to spend 20% of their time pursuing their own projects. No wonder there is an underground pipeline from Microsoft to Google through which the talent keeps leaking out.

By the way, I worked for Microsoft back when it was exciting. Our salaries were rather mediocre, but we felt the urge to work long hours and weekends because we felt that our contributions mattered. Unlike today, sales and marketing were not driving the company, developers were.

To confuse matters even more for the executives, programmers are relatively cheap. The cooling bill for a data center dwarfs the cost of software development. Let’s face it, from a distance, a programmer might look just like another commodity, like a computer rack, air conditioner, or a router. This is even more pronounced in supercomputing, where a single rack might go for a million dollars–an equivalent of 10-20 programmer/years.

If you drain all the excitement from work, your company, or the whole industry, is bound to stagnate. Bored people don’t innovate. And we know from experience that, in high tech industries, if you don’t innovate, you die. Old programming paradigms might have worked for years, but new unmet challenges are piling up. A lot of work that required supercomputers in the past is now done on clusters of off-the-shelf components. Google owns one of the largest supercomputers in the world, and it’s all built from cheap commodity boxes. But Google lets its people innovate.

But not everything is bleak in the land of supercomputers. I have met two teams that were brimming with ideas and enthusiasm: one was Brad Chamberlain’s Cray Chapel team and the other was Hartmut Kaiser’s Louisiana State University Ste||ar team. I’m sure there were many others, but those were the ones I had the pleasure of meeting outside of the exhibition hall.

You can tell that a team is dedicated to a task if they can’t stop talking about their work even after a few beers. Young creative people are attracted like moths to interesting and challenging projects. I don’t think writing simulations using OpenMP and MPI, even if they run on Cray X-MP, can generate this kind of enthusiasm.

Next Page »