Persistent trees are more interesting than persistent lists, which were the topic of my previous blog. In this installment I will concentrate on binary search trees. Such trees store values that can be compared to each other (they support total ordering). Such trees may be used to implement sets, multisets, or associated arrays. Here I will focus on the simplest of those, the set — the others are an easy extensions of the same scheme.

A set must support insertion, and membership test (I’ll leave deletion as an exercise). These operations should be doable, on average, in logarithmic time, O(log(N)). Only balanced trees, however, can guarantee logarithmic time even in the worst case. A simple tree may sometimes degenerate to a singly-linked list, with performance dropping to O(N). I will start with a simple persistent tree and then proceed with a balanced red-black tree.

Persistent Binary Search Tree

As with lists, we will start with an abstract definition:

A tree is either empty or contains a left tree, a value, and a right tree.

This definition translates into a data structure with two constructors:

template<class T>
class Tree {
    Tree(); // empty tree
    Tree(Tree const & lft, T val, Tree const & rgt)

Just as we did with persistent lists, we’ll encode the empty/non-empty tree using null/non-null (shared) pointer to a node. A Node represents a non-empty tree:

   struct Node
       Node(std::shared_ptr<const Node> const & lft
          , T val
          , std::shared_ptr<const Node> const & rgt)
       : _lft(lft), _val(val), _rgt(rgt)

       std::shared_ptr<const Node> _lft;
       T _val;
       std::shared_ptr<const Node> _rgt;

Here’s the complete construction/deconstruction part of the tree. Notice how similar it is to the list from my previous post. All these methods are const O(1) time, as expected. As before, the trick is to construct a new object (Tree) from big immutable chunks (lft and rgt), which can be safely put inside shared pointers without the need for deep copying.

template<class T>
class Tree
    struct Node;
    explicit Tree(std::shared_ptr<const Node> const & node) 
    : _root(node) {} 
    Tree() {}
    Tree(Tree const & lft, T val, Tree const & rgt)
      : _root(std::make_shared<const Node>(lft._root, val, rgt._root))
        assert(lft.isEmpty() || lft.root() < val);
        assert(rgt.isEmpty() || val < rgt.root());       
    bool isEmpty() const { return !_root; }
    T root() const {
        return _root->_val;
    Tree left() const {
        return Tree(_root->_lft);
    Tree right() const {
        return Tree(_root->_rgt);
    std::shared_ptr<const Node> _root;


The persistent nature of the tree manifests itself in the implementation of insert. Instead of modifying the existing tree, insert creates a new tree with the new element inserted in the right place. The implementation is recursive, so imagine that you are at a subtree of a larger tree. This subtree might be empty. Inserting an element into an empty tree means creating a single-node tree with the value being inserted, x, and two empty children.

On the other hand, if you’re not in an empty tree, you can retrieve the root value y and compare it with x. If x is less then y, it has to be inserted into the left child. If it’s greater, it must go into the right child. In both cases we make recursive calls to insert. If x is neither less nor greater than y, we assume it’s equal (that’s why we need total order) and ignore it. Remember, we are implementing a set, which does not store duplicates.

Tree insert(T x) const {
    if (isEmpty())
        return Tree(Tree(), x, Tree());
    T y = root();
    if (x < y)
        return Tree(left().insert(x), y, right());
    else if (y < x)
        return Tree(left(), y, right().insert(x));
        return *this; // no duplicates

Now consider how many new nodes are created during an insertion. A new node is only created in the constructor of a tree (in the code: std::make_shared<const Node>(lft._root, val, rgt._root)). The left and right children are not copied, they are stored by reference. At every level of insert, a tree constructor is called at most once. So in the worst case, when we recurse all the way to the leaves of the tree, we only create h nodes, where h is the height of the tree. If the tree is not too much out of balance its height scales like a logarithm of the number of nodes. To give you some perspective, if you store a billion values in a tree, an insertion will cost you 30 copies on average. If you need a logarithmic bound on the worst case, you’d have to use balanced trees (see later).

If you study the algorithm more closely, you’ll notice that only the nodes that are on the path from the root to the point of insertion are modified.

Testing for membership in a persistent tree is no different than in a non-persistent one. Here’s the recursive algorithm:

bool member(T x) const {
    if (isEmpty())
        return false;
    T y = root();
    if (x < y)
        return left().member(x);
    else if (y < x)
        return right().member(x);
        return true;

When using C++11, you might take advantage of the initializer list constructor to initialize a tree in one big swoop like this:

Tree t{ 50, 40, 30, 10, 20, 30, 100, 0, 45, 55, 25, 15 };


Here’s the implementation of such constructor, which works in O(N*log(N)) average time (notice that it effectively sorts the elements, and O(N*log(N)) is the expected asymptotic behavior for sort):

Tree(std::initializer_list<T> init) {
    Tree t;
    for (T v: init) {
        t = t.insert(v);
    _root = t._root;

Persistent Red-Black Tree

If you want to keep your tree reasonably balanced — that is guarantee that its height is on the order of log(N) — you must do some rebalancing after inserts (or deletes). Care has to be taken to make sure that rebalancing doesn’t change the logarithmic behavior of those operations. The balance is often expressed using some invariants. You can’t just require that every path from root to leaf be of equal length, because that would constrain the number of elements to be always a power of two. So you must give it some slack.

In the case of a red-black tree, the invariants are formulated in terms of colors. Every node in the tree is marked as either red or black. These are the two invariants that have to be preserved by every operation:

  1. Red invariant: No red node can have a red child
  2. Black invariant: Every path from root to an empty leaf node must contain the same number of black nodes — the black height of the tree.

This way, if the shortest path in a tree is all black, the longest path could only be twice as long, containing one red node between each pair of black nodes. The height of such a tree could only vary between (all black) log(N) and (maximum red) 2*log(N).

With these constraints in mind, the re-balancing can be done in log(N) time by localizing the modifications to the nearest vicinity of the path from the root to the point of insertion or deletion.

Let’s start with basic definitions. The node of the tree will now store its color:

enum Color { R, B };

Otherwise, it’s the same as before:

    struct Node
        Node(Color c, 
            std::shared_ptr const & lft, 
            T val, 
            std::shared_ptr const & rgt)
            : _c(c), _lft(lft), _val(val), _rgt(rgt)
        Color _c;
        std::shared_ptr _lft;
        T _val;
        std::shared_ptr _rgt;

An empty tree will be considered black by convention.

The membership test ignores colors so we don’t have to re-implement it. In fact the search performance of a persistent RB Tree is exactly the same as that of an imperative RB Tree. You pay no penalty for persistence in search.

With insertion, you pay the penalty of having to copy the path from root to the insertion point, which doesn’t change its O(log(N)) asymptotic behavior. As I explained before, what you get in exchange is immutability of every copy of your data structure.

The Balancing

Let’s have a look at the previous version of insert and figure out how to modify it so the result preserves the RB Tree invariants.

Tree insert(T x) const {
    if (isEmpty())
        return Tree(Tree(), x, Tree());
    T y = root();
    if (x < y)
        return Tree(left().insert(x), y, right());
    else if (y < x)
        return Tree(left(), y, right().insert(x));
        return *this; // no duplicates

Let’s first consider the most difficult scenario: the insertion into a maximum capacity tree for a given black height. Such a tree has alternating levels of all black and all red nodes. The only way to increase its capacity is to increase its black height. The cheapest way to add one more black level to all paths (thus preserving the black invariant) is to do it at the root (for instance, lengthening all the path at the leaves would require O(N) red-to-black re-paintings).

So here’s the plan: We’ll insert a new node at the leaf level and make it red. This won’t break the black invariant, but may break the red invariant (if the parent node was red). We’ll then retrace our steps back to the root, percolating any red violation up. Then, at the top level, we’ll paint the resulting root black, thus killing two birds with one stone: If we ended up with a red violation at the top, this will fix it and, at the same time, increase the black height of the whole tree.

It’s important that during percolation we never break the black invariant.

So here’s how we execute this plan: insert will call the recursive insertion/re-balancing method ins, which might return a red-topped tree. We’ll paint that root black (if it’s already black, it won’t change anything) and return it to the caller:

RBTree insert(T x) const {
    RBTree t = ins(x);
    return RBTree(B, t.left(), t.root(), t.right());

In the implementation of ins, the first case deals with an empty tree. This situation happens when it’s the first insertion into an empty tree or when, during the recursive process, we’ve reached the insertion point at the bottom of the tree. We create a red node and return it to the caller:

if (isEmpty())
  return RBTree(R, RBTree(), x, RBTree());

Notice that, if this new node was inserted below another red node, we are creating a red violation. If that node was the root of the whole tree, insert will repaint it immediately. If it weren’t, and we pop one level up from recursion, we’ll see that violation. We can’t fix it at that point — for that we’ll have to pop one more level, up to the black parent, where we have more nodes to work with.

Here are the details of ins: We’ll follow the same logic as in the non-balanced tree, thus preserving the ordering of values; but instead of reconstructing the result tree on the spot we’ll call a function balance, which will do that for us in a semi-balanced way (that is, with a possibility of a red violation, but only at the very top).

RBTree ins(T x) const
    if (isEmpty())
        return RBTree(R, RBTree(), x, RBTree());
    T y = root();
    Color c = rootColor();
    if (x < y)
        return balance(c, left().ins(x), y, right());
    else if (y < x)
        return balance(c, left(), y, right().ins(x));
        return *this; // no duplicates

Just like the constructor of the red-black tree, balance takes the following arguments: color, left subtree, value, and right subtree. Depending on the result of the comparison, the new element is inserted either into the left or the right subtree.

As I explained, balance, and consequently ins, cannot fix the red violation when they are sitting on it. All they can do is to make sure that the violation is at the very top of the tree they return. So when we call balance with the result of ins, as in:

balance(c, left().ins(x), y, right())


balance(c, left(), y, right().ins(x))

the left or the right subtree, respectively, may be semi-balanced. This is fine because balance can then rotate this violation away.

So the interesting cases for balance are the ones that rebuild a black node with either the left or the right subtree having a red violation at the top.

There are four possible cases depending on the position of the violation. In each case we can rearrange the nodes in such a way that the violation disappears and the ordering is preserved. In the pictures below I have numbered the nodes and subtrees according to the order of the values stored in them. Remember that all values in the left subtree are less than the value stored in the node, which in turn is less than all the values in the right subtree.

Fig 1

Rotating lft.doubledLeft()

Fig 1

Rotating lft.doubledRight()()

Fig 1

Rotating rgt.doubledLeft()

Fig 1

Rotating rgt.doubledRight()()

Each rotation creates a tree that preserves both invariants. Notice, however, that the result of the rotation is always red-tipped, even though we were rebuilding a node that was originally black. So if the parent of that node was red, our caller will produce a red violation (it will call balance with red color as its argument, which will fall through to the default case). This violation will be then dealt with at the parent’s parent level.

static RBTree balance(Color c
                    , RBTree const & lft
                    , T x
                    , RBTree const & rgt)
   if (c == B && lft.doubledLeft())
        return RBTree(R
                    , lft.left().paint(B)
                    , lft.root()
                    , RBTree(B, lft.right(), x, rgt));
    else if (c == B && lft.doubledRight())
        return RBTree(R
                    , RBTree(B, lft.left(), lft.root(), lft.right().left())
                    , lft.right().root()
                    , RBTree(B, lft.right().right(), x, rgt));
    else if (c == B && rgt.doubledLeft())
        return RBTree(R
                    , RBTree(B, lft, x, rgt.left().left())
                    , rgt.left().root()
                    , RBTree(B, rgt.left().right(), rgt.root(), rgt.right()));
    else if (c == B && rgt.doubledRight())
        return RBTree(R
                    , RBTree(B, lft, x, rgt.left())
                    , rgt.root()
                    , rgt.right().paint(B));
        return RBTree(c, lft, x, rgt);

For completeness, here are the auxiliary methods used in the implementation of balance:

bool doubledLeft() const {
    return !isEmpty()
        && rootColor() == R
        && !left().isEmpty()
        && left().rootColor() == R;
bool doubledRight() const {
    return !isEmpty()
        && rootColor() == R
        && !right().isEmpty()
        && right().rootColor() == R;
RBTree paint(Color c) const {
    return RBTree(c, left(), root(), right());


Our implementation of the persistent red-black tree follows the Chris Okasaki’s book. As Chris asserts, this is one of the fastest implementations there is, and he offers hints to make it even faster. Of course there are many imperative implementations of red-black trees, including STL’s std::set and std::map. Persistent RB-trees match their performance perfectly when it comes to searching. Insertion and deletion, which are O(log(N)) for either implementation, are slower by a constant factor because of the need to copy the path from root to leaf. On the other hand, the persistent implementation is thread-safe and synchronization-free (except for reference counting in shared_ptr — see discussion in my previous blog).

Complete code is available at GitHub.


I’d like to thank Eric Niebler for reading the draft and telling me which of my explanations were more abstruse than usual.

Haskell Code

For comparison, here’s the original Haskell code. You can see that the C++ implementation preserves its structure pretty well. With proper optimization tricks (unboxing and eager evaluation) the Haskell code should perform as well as its C++ translation.

Regular (unbalanced) binary search tree:

data Tree a = Empty | Node (Tree a) a (Tree a)

member x Empty = False
member x (Node lft y rgt) =
    if x < y then member x lft
    else if y < x then member x rgt
    else True

insert x Empty = Node Empty x Empty
insert x t@(Node lft y rgt) =
    if x < y then Node (insert x lft) y rgt
    else if y < x then Node lft y (insert x rgt)
    else t

Balanced Red-Black tree:

data Color = R | B

data Tree a = Empty | Node Color (Tree a) a (Tree a)

member x Empty = False
member x (Node _ lft y rgt) =
    if x < y then member x lft
    else if y < x then member x rgt
    else True

insert x tree = Node B left val right
      ins Empty = Node R Empty x Empty
      ins t@(Node c lft y rgt) =
          if (x < y) then balance c (ins lft) y rgt
          else if (y < x) then balance c lft y (ins rgt)
          else t
      Node _ left val right = ins tree -- pattern match result of ins

balance B (Node R (Node R a x b) y c) z d = 
    Node R (Node B a x b) y (Node B c z d)
balance B (Node R a x (Node R b y c)) z d = 
    Node R (Node B a x b) y (Node B c z d)
balance B a x (Node R (Node R b y c) z d) = 
    Node R (Node B a x b) y (Node B c z d)
balance B a x (Node R b y (Node R c z d)) = 
    Node R (Node B a x b) y (Node B c z d)
balance color a x b = Node color a x b

“Data structures in functional languages are immutable.”

What?! How can you write programs if you can’t mutate data? To an imperative programmer this sounds like anathema. “Are you telling me that I can’t change a value stored in a vector, delete a node in a tree, or push an element on a stack?” Well, yes and no. It’s all a matter of interpretation. When you give me a list and I give you back the same list with one more element, have I modified it or have I constructed a brand new list with the same elements plus one more?

Why would you care? Actually, you might care if you are still holding on to your original list. Has that list changed? In a functional language, the original data structure will remain unmodified! The version from before the modification persists — hence such data structures are called persistent (it has nothing to do with being storable on disk).

In this post I will discuss the following aspects of persistent data structures:

  • They are immutable, therefore
    • They are easier to reason about and maintain
    • They are thread-safe
  • They can be implemented efficiently
  • They require some type of resource management to “garbage collect” them.

I will illustrate these aspects on a simple example of a persistent linked list implemented in C++.


There is a wealth of persistent data structures in functional languages, a lot of them based on the seminal book by Chris Okasaki, Purely Functional Data Structures (based on his thesis, which is available online). Unfortunately, persistent data structures haven’t found their way into imperative programming yet. In this series of blog posts I’ll try to provide the motivation for using functional data structures in imperative languages and start translating some of them into C++. I believe that persistent data structures should be part of every C++ programmer’s toolbox, especially one interested in concurrency and parallelism.

Persistence and Immutability

What’s the advantage of persistent data structures? For one, they behave as if they were immutable, yet you can modify them. The trick is that the modifications never spread to the aliases of the data structure — nobody else can observe them other that the mutator itself. This way you avoid any implicit long-distance couplings. This is important for program maintenance — you know that your bug fixes and feature tweaks will remain localized, and you don’t have to worry about breaking remote parts of the program that happen to have access to the same data structure.

There is another crucial advantage of immutable data structures — they are thread safe! You can’t have a data race on a structure that is read-only. Since copies of a persistent data structure are immutable, you don’t need to synchronize access to them. (This is, by the way, why concurrent programming is much easier in functional languages.)

Persistence and Multithreading

So if you want just one reason to use persistent data structures — it’s multithreading. A lot of conventional wisdom about performance is void in the face of multithreading. Concurrent and parallel programming introduces new performance criteria. It forces you to balance the cost of accessing data structures against the cost of synchronization.

Synchronization is hard to figure out correctly in the first place and is fraught with such dangers as data races, deadlocks, livelocks, inversions of control, etc. But even if your synchronization is correct, it may easily kill your performance. Locks in the wrong places can be expensive. You might be tempted to use a traditional mutable data structure like a vector or a tree under a single lock, but that creates a bottleneck for other threads. Or you might be tempted to handcraft your own fine granularity locking schemes; which is tantamount to designing your own data structures, whose correctness and performance characteristics are very hard to estimate. Even the lock-free data structures of proven correctness can incur substantial synchronization penalty by spinning on atomic variables (more about it later).

The fastest synchronization is no synchronization at all. That’s why the holy grail of parallelism is either not to share, or share only immutable data structures. Persistent data structures offer a special kind of mutability without the need for synchronization. You are free to share these data structures without synchronization because they never change under you. Mutation is accomplished by constructing a new object.

Persistence and Performance

So what’s the catch? You guessed it — performance! A naive implementation of a persistent data structure would require a lot of copying — the smallest modification would produce a new copy of the whole data structure. Fortunately, this is not necessary, and most implementations try to minimize copying by essentially storing only the “deltas.” Half of this blog will be about performance analysis and showing that you can have your cake and eat it too.

Every data structure has unique performance characteristics. If you judge a C++ vector by the performance of indexed access, its performance is excellent: it’s O(1) — constant time. But if you judge it by the performance of insert, which is O(N) — linear time — it’s pretty bad. Similarly, persistent data structures have their good sides and bad sides. Appending to the end of a persistent singly-linked list, for instance, is O(N), but push and pop from the front are a comfortable O(1).

Most importantly, major work has been done designing efficient persistent data structures. In many cases they closely match the performance of mutable data structures, or are within a logarithm from them.

Persistent List: First Cut

Before I get to more advanced data structures in the future installments of this blog, I’d like to start with the simplest of all: singly-linked list. Because it’s so simple, it will be easy to demonstrate the craft behind efficient implementation of persistency.

We all know how to implement a singly linked list in C++. Let’s take a slightly different approach here and define it abstractly first. Here’s the definition:

A list of T is either empty or consists of an element of type T followed by a list of T.

This definition translates directly into a generic data structure with two constructors, one for the empty list and another taking the value/list (head/tail) pair:

template<class T>
class List {
    List(T val, List tail);

Here’s the trick: Since we are going to make all Lists immutable, we can guarantee that the second argument to the second constructor (marked in red) is forever frozen. Therefore we don’t have to deep-copy it, we can just store a reference to it inside the list. This way we can implement this constructor to be O(1), both in time and space. It also means that those List modifications that involve only the head of the list will have constant cost — because they can all share the same tail. This is very important, because the naive copying implementation would require O(N) time. You’ll see the same pattern in other persistent data structures: they are constructed from big immutable chunks that can be safely aliased rather than copied. Of course this brings the problem of being able to collect the no-longer-referenced tails — I’ll talk about it later.

Another important consequence of immutability is that there are two kinds of Lists: empty and non-empty. A List that was created empty will always remain empty. So the most important question about a list will be whether it’s empty or not. Something in the implementation of the list must store this information. We could, for instance, have a Boolean data member, _isEmpty, but that’s not what we do in C++. For better or worse we use this “clever” trick called the null pointer. So a List is really a pointer that can either be null or point to the first element of the list. That’s why there is no overhead in passing a list by value — we’re just copying a single pointer.

Is it a shallow or a deep copy? Technically it’s shallow, but because the List is (deeply) immutable, there is no observable difference.

Here’s the code that reflects the discussion so far:

template<class T>
class List
    struct Item;
    List() : _head(nullptr) {}
    List(T v, List tail) : _head(new Item(v, tail._head)) {}
    bool isEmpty() const { return !_head; }
    // may be null
    Item const * _head;

The Item contains the value, _val and a pointer _next to the (constant) Item (which may be shared between many lists):

    struct Item
        Item(T v, Item const * tail) : _val(v), _next(tail) {}
        T _val;
        Item const * _next;

The fact that items are (deeply) immutable can’t be expressed in the C++ type system, so we use (shallow) constness and recursive reasoning to enforce it.

In a functional language, once you’ve defined your constructors, you can automatically use them for pattern matching in order to “deconstruct” objects. For instance, an empty list would match the empty list pattern/constructor and a non-empty list would match the (head, tail) pattern/constructor (often called “cons”, after Lisp). Instead, in C++ we have to define accessors like front, which returns the head of the list, and pop_front, which returns the tail:

    T front() const
        return _head->_val;
    List pop_front() const
        return List(_head->_next);

In the implementation of pop_front I used an additional private constructor:

    explicit List (Item const * items) : _head(items) {}

Notice the assertions: You are not supposed to call front or pop_front on an empty list. Make sure you always check isEmpty before you call them. Admittedly, this kind of interface exposes the programmer to potential bugs (forgetting to check for an empty list) — something that the pattern-matching approach of functional languages mitigates to a large extent. You could make these two methods safer in C++ by using boost optional, but not without some awkwardness and performance overhead.

We have defined five primitives: two constructors plus isEmpty, front, and pop_front that completely describe a persistent list and are all of order O(1). Everything else can be implemented using those five. For instance, we may add a helper method push_front:

    List push_front(T v) const {
       return List(v, *this);

Notice that push_front does not modify the list — it returns a new list with the new element at its head. Because of the implicit sharing of the tail, push_front is executed in O(1) time and takes O(1) space.

The list is essentially a LIFO stack and its asymptotic behavior is the same as that of the std::vector implementation of stack (without the random access, and with front and back inverted). There is an additional constant cost of allocating Items (and deallocating them, as we’ll see soon) both in terms of time and space. In return we are gaining multithreaded performance by avoiding the the need to lock our immutable data structures.

I’m not going to argue whether this tradeoff is always positive in the case of simple lists. Remember, I used lists to demonstrate the principles behind persistent data structures in the simplest setting. In the next installment though, I’m going to make a stronger case for tree-based data structures.

Reference Counting

If we could use garbage collection in C++ (and there are plans to add it to the Standard) we’d be done with the List, at least in the case of no hard resources (the ones that require finalization). As it is, we better come up with the scheme for releasing both memory and hard resources owned by lists. Since persistent data structures use sharing in their implementation, the simplest thing to do is to replace naked pointers with shared pointers.

Let’s start with the pointer to the head of the list:

std::shared_ptr<const Item> _head;

We no longer need to initialize _head to nullptr in the empty list constructor because shared_ptr does it for us. We need, however, to construct a new shared_ptr when creating a list form a head and a tail:

List() {}
List(T v, List const & tail) 
    : _head(std::make_shared<Item>(v, tail._head)) {}

Item itself needs a shared_ptr as _next:

struct Item
    Item(T v, std::shared_ptr<const Item> const & tail) 
        : _val(v), _next(tail) {}
    T _val;
    std::shared_ptr<const Item> _next;

Surprisingly, these are the only changes we have to make. Everything else just works. Every time a shared_ptr is copied, as in the constructors of List and Item, a reference count is automatically increased. Every time a List goes out of scope, the destructor of _head decrements the reference count of the first Item of the list. If that item is not shared, it is deleted, which decreases the reference count of the next Item, and so on.

Let’s talk about performance again, because now we have to deal with memory management. Reference counting doesn’t come for free. First, a standard implementation of shared_ptr consists of two pointers — one pointing to data, the other to the reference counter (this is why I’m now passing List by const reference rather than by value — although it’s not clear if this makes much difference).

Notice that I was careful to always do Item allocations using make_shared, rather than allocating data using new and then turning it into a shared_ptr. This way the counter is allocated in the same memory block as the Item. This not only avoids the overhead of a separate call to new for the (shared) counter, but also helps locality of reference.

Then there is the issue of accessing the counter. Notice that the counter is only accessed when an Item is constructed or destroyed, and not, for instance, when the list is traversed. So that’s good. What’s not good is that, in a multithreaded environment, counter access requires synchronization. Granted, this is usually the lock-free kind of synchronization provided by shared_ptr, but it’s still there.

So my original claim that persistent data structures didn’t require synchronization was not exactly correct in a non-garbage-collected environment. The problem is somewhat mitigated by the fact that this synchronization happens only during construction and destruction, which are already heavy duty allocator operations with their own synchronization needs.

The cost of synchronization varies depending on how much contention there is. If there are only a few threads modifying a shared list, collisions are rare and the cost of a counter update is just one CAS (Compare And Swap) or an equivalent atomic operation. The overhead is different on different processor, but the important observation is that it’s the same overhead as in an efficient implementation of a mutex in the absence of contention (the so called thin locks or futexes require just one CAS to enter the critical section — see my blog about thin locks).

At high contention, when there are a lot of collisions, the reference count synchronization degenerates to a spin lock. (A mutex, on the other hand, would fall back on the operating system, since it must enqueue blocked threads). This high contention regime, however, is unlikely in the normal usage of persistent data structures.

A little digression about memory management is in order. Allocating Items from a garbage-collected heap would likely be more efficient, because then persistent objects would really require zero synchronization, especially if we had separate per-processor heaps. It’s been known for some time that the tradeoff between automated garbage collection (GC) and reference counting (RC) is far from obvious. David Bacon et. al. showed that, rather than there being one most efficient approach, there is a whole spectrum of solutions between GC and RC, each with their own performance tradeoffs.

There is a popular belief that GC always leads to long unexpected pauses in the execution of the program. This used to be true in the old times, but now we have incremental concurrent garbage collectors that either never “stop the world” or stop it for short bounded periods of time (just do the internet search for “parallel incremental garbage collection”). On the other hand, manual memory management a la C++ has latency problems of its own. Data structures that use bulk allocation, like vectors, have to occasionally double their size and copy all elements. In a multithreaded environment, this not only blocks the current thread from making progress but, if the vector is shared, may block other threads as well.

The use of shared_ptr in the implementation of containers may also result in arbitrarily long and quite unpredictable slowdowns. A destruction of a single shared_ptr might occasionally lead to a cascade of dereferences that deallocate large portions of a data structure, which may in turn trigger a bout of free list compactions within the heap (this is more evident in tree-like, branching, data structures). It’s important to keep these facts in mind when talking about performance tradeoffs, and use actual timings in choosing implementations.

List Functions

Since, as I said, a persistent list is immutable, we obviously cannot perform destructive operations on it. If we want to increment each element of a list of integers, for instance, we have to create a new list (which, by the way, doesn’t change the asymptotic behavior of such an operation). In functional languages such bulk operations are normally implemented using recursion.

You don’t see much recursion in C++ because of one problem: C++ doesn’t implement tail recursion optimization. In any functional language worth its salt, a recursive function that calls itself in its final step is automatically replaced by a loop. In C++, recursion consumes stack and may lead to stack overflow. So it’s the lack of guaranteed tail recursion optimization that is at the root of C++ programmers’ aversion to recursion. Of course, there are also algorithms that cannot be made tail recursive, like tree traversals, which are nevertheless implemented using recursion even in C++. One can make an argument that (balanced) tree algorithms will only use O(log(N)) amounts of stack, thus mitigating the danger of stack overflow.

List algorithms may be implemented either using recursion or loops and iterators. I’ll leave the implementation of iterators for a persistent list to the reader — notice that only a const forward iterator or an output iterator make sense in this case. Instead I’ll show you a few examples of recursive algorithms. They can all be rewritten using loops and iterators, but it’s interesting to see them in the purest form.

The example of incrementing each element of a list is a special case of a more general algorithm of applying a function to all elements of a list. This algorithm is usually called fmap and can be generalized to a large variety of data structures. Those parameterized data structures that support fmap are called functors (not to be confused with the common C++ misnomer for a function object). Here’s fmap for our persistent list:

template<class U, class T, class F>
List<U> fmap(F f, List<T> lst)
    static_assert(std::is_convertible<F, std::function<U(T)>>::value, 
                 "fmap requires a function type U(T)");
    if (lst.isEmpty()) 
        return List<U>();
        return List<U>(f(lst.front()), fmap<U>(f, lst.pop_front()));

An astute reader will notice a similarity between fmap and the standard C++ algorithm transform in both semantics and interface. The power of the Standard Template Library can be traced back to its roots in functional programming.

The static_assert verifies that the the template argument F is convertible to a function type that takes T and returns U. This way fmap may be instantiated for a function pointer, function object (a class with an overloaded operator()), or a lambda, as long as its output type is convertible to U. Ultimately, these kind of constraints should be expressible as concepts.

The compiler is usually able to infer type arguments for a template function by analyzing the instantiation context. Unfortunately, inferring the return type of a functional argument like F in fmap is beyond its abilities, so you are forced to specify the type of U at the call site, as in this example (also, toupper is defined to return an int rather than char):

auto charLst2 = fmap<char>(toupper, charLst);

There is a common structure to recursive functions operating on functional data structures. They usually branch on, essentially, different constructors. In the implementation of fmap, we first check for an empty list — the result of the empty constructor — otherwise we deconstruct the (head, tail) constructor. We apply the function f to the head and then recurse into the tail.

Notice that fmap produces a new list of the same shape (number and arrangement of elements) as the original list. There are also algorithms that either change the shape of the list, or produce some kind of a “total” from a list. An example of the former is filter:

template<class T, class P>
List<T> filter(P p, List<T> lst)
    static_assert(std::is_convertible<P, std::function<bool(T)>>::value, 
                 "filter requires a function type bool(T)");
    if (lst.isEmpty())
        return List<T>();
    if (p(lst.front()))
        return List<T>(lst.front(), filter(p, lst.pop_front()));
        return filter(p, lst.pop_front());

Totaling a list requires some kind of running accumulator and a function to process an element of a list and “accumulate” it in the accumulator, whatever that means. We also need to define an “empty” accumulator to start with. For instance, if we want to sum up the elements of a list of integers, we’d use an integer as an accumulator, set it initially to zero, and define a function that adds an element of a list to the accumulator.

In general such accumulation may produce different results when applied left to right or right to left (although not in the case of summation). Therefore we need two such algorithms, foldl (fold left) and foldr (fold right).

The right fold first recurses into the tail of the list to produce a partial accumulator then applies the function f to the head of the list and that accumulator:

template<class T, class U, class F>
U foldr(F f, U acc, List<T> lst)
    static_assert(std::is_convertible<F, std::function<U(T, U)>>::value, 
                 "foldr requires a function type U(T, U)");
    if (lst.isEmpty())
        return acc;
        return f(lst.front(), foldr(f, acc, lst.pop_front()));

Conversely, the left fold first applies f to the head of the list and the accumulator that was passed in, and then calls itself recursively with the new accumulator and the tail of the list. Notice that, unlike foldr, foldl is tail recursive.

template<class T, class U, class F>
U foldl(F f, U acc, List<T> lst)
    static_assert(std::is_convertible<F, std::function<U(U, T)>>::value, 
                 "foldl requires a function type U(U, T)");
    if (lst.isEmpty())
        return acc;
        return foldl(f, f(acc, lst.front()), lst.pop_front());

Again, the STL implements a folding algorithm as well, called accumulate. I’ll leave it to the reader to figure out which fold it implements, left or right, and why.

In C++ we can have procedures that instead of (or along with) producing a return value produce side effects. We can capture this pattern with forEach:

template<class T, class F>
void forEach(List<T> lst, F f) 
    static_assert(std::is_convertible<F, std::function<void(T)>>::value, 
                 "forEach requires a function type void(T)");
    if (!lst.isEmpty()) {
        forEach(lst.pop_front(), f);

We can, for instance, use forEach to implement print:

template<class T>
void print(List<T> lst)
    forEach(lst, [](T v) 
        std::cout << "(" << v << ") "; 
    std::cout << std::endl;

Singly-linked list concatenation is not a cheap operation. It takes O(N) time (there are however persistent data structures that can do this in O(1) time). Here’s the recursive implementation of it:

List concat(List const & a, List const & b)
    if (a.isEmpty())
        return b;
    return List(a.front(), concat(a.pop_front(), b));

We can reverse a list using foldl in O(N) time. The trick is to use a new list as the accumulator:

template<class T>
List<T> reverse(List<T> const & lst)
    return foldl([](List<T> const & acc, T v)
        return List<T>(v, acc);
    }, List<T>(), lst);

Again, all these algorithms can be easily implemented using iteration rather than recursion. In fact, once you define (input/output) iterators for a List, you can just use the STL algorithms.


A singly linked list is not the most efficient data structure in the world but it can easily be made persistent. What’s important is that a persistent list supports all the operation of a FIFO stack in constant time and is automatically thread safe. You can safely and efficiently pass such lists to and from threads without the need to synchronize (except for the internal synchronization built into shared pointers).

Complete code for this post is available on GitHub. It uses some advanced features of C++11. I compiled it with Visual Studio 2013.

Appendix: Haskell Implementation

Functional languages support persistent lists natively. But it’s not hard to implement lists explicitly. The following one liner is one such implementation in Haskell:

data List t = Empty | Cons t (List t)

This list is parameterized by the type parameter t (just like our C++ template was parameterized by T). It has two constructors, one called Empty and the other called Cons. The latter takes two arguments: a value of type t and a List of t — the tail. These constructors can be used for both the creation of new lists and for pattern-matching. For instance, here’s the implementation of cat (the function concat is already defined in the Haskell default library so I had to use a different name):

cat Empty lst = lst
cat (Cons x tail) lst = Cons x (cat tail lst)

The selection between the empty and non-empty case is made through pattern matching on the first argument. The first line matches the pattern (constructor) Empty. The second line matches the Cons pattern and, at the same time, extracts the head and the tail of the list (extraction using pattern matching is thus much safer than calling head or tail because an empty list won’t match this pattern). It then constructs a new list with x as the head and the (recursive) concatenation of the first list’s tail with the second list. (I should mention that this recursive call is lazily evaluated in Haskell, so the cost of cat is amortized across many accesses to the list — more about lazy evaluation in the future.)

I’ve been looking for a good analogy of what programming in C++ feels like and I remembered this 1990 Tim Burton movie, Edward Scissorhands.

It’s a darker version of Pinocchio, shot in suburban settings. In this poster, the scary guy (Johnny Depp) is trying to gently hug Winona Ryder but his clumsy scissor-hands are making it very dangerous for both of them. His face is already covered with deep scars.

Having scissors for hands in not all that bad. Edward has many talents: he can, for instance, create stunning dog hairdos.

I often have these kinds of thoughts after attending C++ conferences: this time it was Going Native 2013. The previous year, the excitement was all about the shiny new C++11 Standard. This year it was more of a reality check. Don’t get me wrong — there were many stunning dog hairdos on display (I mean C++ code that was elegant and simple) but the bulk of the conference was about how to avoid mutilation and how to deliver first aid in case of accidental amputation.

Little shop of horrors

There was so much talk about how not to use C++ that it occurred to me that maybe this wasn’t the problem of incompetent programmers, but that straightforward C++ is plain wrong. So if you just learn the primitives of the language and try to use them, you’re doomed.

C++ has an excuse for that: backward compatibility — in particular compatibility with C. You might think of the C subset of C++ as bona fide assembly language which you shouldn’t use it in day-to-day programming, except that it’s right there on the surface. If you reach blindly into your C++ toolbox, you’re likely to come up with naked pointers, for loops, and all this ugly stuff.

A well known example of what not to do is to use malloc to dynamically allocate memory, and free to deallocate it. malloc takes a count of bytes and returns a void pointer, which you have to cast to something more usable — it would be hard to come up with worse API for memory management. Here’s an example of really bad (but almost correct, if it weren’t for the possibility of null pointer dereference) code:

struct Pod {
    int count;
    int * counters;

int n = 10;
Pod * pod = (Pod *) malloc (sizeof Pod);
pod->count = n
pod->counters = (int *) malloc (n * sizeof(int));
free (pod->counters);
free (pod);

Hopefully, nobody writes code like this in C++, although I’m sure there are a lot of legacy apps with such constructs, so don’t laugh.

C++ “solved” the problem of redundant casting and error-prone size calculations by replacing malloc and free with new and delete. The corrected C++ version of the code above would be:

struct Pod {
    int count;
    int * counters;

int n = 10;
Pod * pod = new Pod;
pod->count = n;
pod->counters = new int [n];
delete [] pod->counters;
delete pod;

BTW, the null pointer dereference problem is solved too, because new will throw an exception when the system runs out of memory. There is still a slight chance of a memory leak if the second new fails (But how often does that happen? Hint: how big can n get?) So here’s the really correct version of the code:

class Snd { // Sophisticated New Data
    Snd (int n) : _count(n), _counters(new int [n]) {}
    ~Snd () { delete [] _counters; }
    int _count;
    int * _counters;

Snd * snd = new Snd (10);
delete snd;

Are we done yet? Of course not! The code is not exception safe.

The C++ lore is that you should avoid naked pointers, avoid arrays, avoid delete. So the remedy for the lameness of malloc is operator new, which is also broken because it returns a dangerous pointer and pointers are bad.

We all know (and have scars on our faces to prove it) that you should use the Standard Library containers and smart pointers whenever possible. Oh, and use value semantics for passing things around. No wait! Value semantics comes with a performance penalty because of excessive copying. So what about shared_ptr and vectors of shared_ptr? But that adds the overhead of reference counting! No, here’s a new idea: move semantics and rvalue references.

I can go on and on like this (and I often do!). Do you see the pattern? Every remedy breeds another remedy. It’s no longer just the C subset that should be avoided. Every new language feature or library addition comes with a new series of gotchas. And you know a new feature is badly designed if Scott Meyers has a talk about it. (His latest was about the pitfalls of, you guessed it, move semantics.)

The Philosophy of C++

Bjarne Stroustrup keeps stressing how important backward compatibility is for C++. It’s one of the pillars of the C++ philosophy. Considering how much legacy code there is, it makes perfect sense. Compatibility, though, takes a very heavy toll on the evolution of the language. If nature were as serious about backward compatibility as C++ is, humans would still have tails, gills, flippers, antennae, and external skeletons on top of internal ones — they all made sense at some point in the evolution.

C++ has become an extremely complex language. There are countless ways of doing the same thing — almost all of them either plain wrong, dangerous, unmaintainable, or all of the above. The problem is that most code compiles and even runs. The mistakes and shortcomings are discovered much later, often after the product has been released.

You might say: Well, that’s the nature of programming. If you think so, you should seriously look at Haskell. Your first reaction will be: I don’t know how to implement the first thing (other than the factorial and Fibonacci numbers) in this extremely restrictive language. This is totally different from the C++ experience, where you can start hacking from day one. What you don’t realize is that it will take you 10 years, if you’re lucky, to discover the “right way” of programming in C++ (if there even is such thing). And guess what, the better a C++ programmer you are, the more functional your programs look like. Ask any C++ guru and they will tell you: avoid mutation, avoid side effects, don’t use loops, avoid class hierarchies and inheritance. But you will need strict discipline and total control over your collaborators to pull that off because C++ is so permissive.

Haskell is not permissive, it won’t let you — or your coworkers — write unsafe code. Yes, initially you’ll be scratching your head trying to implement something in Haskell that you could hack in C++ in 10 minutes. If you’re lucky, and you work for Sean Parent or other exceptional programmer, he will code review your hacks and show you how not to program in C++. Otherwise, you might be kept in the dark for decades accumulating self-inflicted wounds and dreaming of dog hairdos.

Resource Management

I started this post with examples of resource management (strictly speaking, memory management), because this is one of my personal favorites. I’ve been advocating and writing about it since the nineties (see bibliography at the end). Obviously I have failed because 20 years later resource management techniques are still not universally known. Bjarne Stroustrup felt obliged to spend half of his opening talk explaining resource management to the crowd of advanced C++ programmers. Again, one could blame incompetent programmers for not accepting resource management as the foundation of C++ programming. The problem though is that there is nothing in the language that would tell a programmer that something is amiss in the code I listed in the beginning of this post. In fact it often feels like learning the correct techniques is like learning a new language.

Why is it so hard? Because in C++ the bulk of resource management is memory management. In fact it has to be stressed repeatedly that garbage collection would not solve the problem of managing resources: There will always be file handles, window handles, open databases and transactions, etc. These are important resources, but their management is overshadowed by the tedium of memory management. The reason C++ doesn’t have garbage collection is not because it can’t be done in an efficient way, but because C++ itself is hostile to GC. The compiler and the runtime have to always assume the worst — not only that any pointer can alias any other pointer but that a memory address can be stored as an integer or its lower bits could be used as bitfields (that’s why only conservative garbage collectors are considered for C++).

It’s a common but false belief that reference counting (using shared pointers in particular) is better than garbage collection. There is actual research showing that the two approaches are just two sides of the same coin. You should realize that deleting a shared pointer may lead to an arbitrary long pause in program execution, with similar performance characteristics as a garbage sweep. It’s not only because every serious reference counting algorithm must be able to deal with cycles, but also because every time a reference count goes to zero on a piece of data a whole graph of pointers reachable from that object has to be traversed. A data structure built with shared pointers might take a long time to delete and, except for simple cases, you’ll never know which shared pointer will go out of scope last and trigger it.

Careful resource management and spare use of shared_ptr might still be defendable for single-threaded programs, but the moment you start using concurrency, you’re in big trouble. Every increment or decrement of the counter requires locking! This locking is usually implemented with atomic variables, but so are mutexes! Don’t be fooled: accessing atomic variables is expensive. Which brings me to the central problem with C++.

Concurrency and Parallelism

It’s been 8 years since Herb Sutter famously exclaimed: The Free Lunch is Over! Ever since then the big C++ oil tanker has been slowly changing its course. It’s not like concurrency was invented in 2005. Posix threads have been defined in 1995. Microsoft introduced threads in Windows 95 and multiprocessor support in Windows NT. Still, concurrency has only been acknowledged in the C++ Standard in 2011.

C++11 had to start from scratch. It had to define the memory model: when and in what order memory writes from multiple threads become visible to other threads. For all practical purposes, the C++ memory model was copied from Java (minus some controversial guarantees that Java made about behavior under data races). In a nutshell, C++ programs are sequentially consistent if there are no data races. However, since C++ had to compete with the assembly language, the full memory model includes so called weak atomics, which I would describe as portable data races, and recommend staying away from.

C++11 also defined primitives for thread creation and management, and basic synchronization primitives as defined by Dijkstra and Hoare back in the 1960s, such as mutexes and condition variables. One could argue whether these are indeed the right building blocks for synchronization, but maybe that doesn’t really matter because they are known not to be composable anyway. The composable abstraction for synchronization is STM (Software Transactional Memory), which is hard to implement correctly and efficiently in an imperative language. There is an STM study group in the Standards Committee, so there is a chance it might one day become part of the Standard. But because C++ offers no control over effects, it will be very hard to use properly.

There was also a misguided and confusing attempt at providing support for task-based parallelism with async tasks and non-composable futures (both seriously considered for deprecation in C++14). Thread-local variables were also standardized making task-based approach that much harder. Locks and condition variables are also tied to threads, not tasks. So that was pretty much a disaster. The Standards Committee has the work cut out for them for many years ahead. That includes task-based composable parallelism, communication channels to replace futures (one would hope), task cancellation and, probably longer term, data-driven parallelism, including GPU support. A derivative of Microsoft PPL and Intel TBB should become part of the Standard (hopefully not Microsoft AMP).

Let’s take a great leap of faith and assume that all these things will be standardized and implemented by, say, 2015. Even if that happens, I still don’t think people will be able to use C++ for mainstream parallel programming. C++ has been designed for single thread programming, and parallel programming requires a revolutionary rather than evolutionary change. Two words: data races. Imperative languages offer no protection against data races — maybe with the exception of D.

In C++, data is shared between threads by default, is mutable by default, and functions have side effects almost by default. All those pointers and references create fertile grounds for data races, and the vulnerability of data structures and functions to races is in no way reflected in the type system. In C++, even if you have a const reference to an object, there is no guarantee that another thread won’t modify it. Still worse, any references inside a const object are mutable by default.

D at least has the notion of deep constness and immutability (no thread can change an immutable data structure). Another nod towards concurrency from D is the ability to define pure functions. Also, in D, mutable objects are not shared between threads by default. It is a step in the right direction, even though it imposes runtime cost for shared objects. Most importantly though, threads are not a good abstraction for parallel programming, so this approach won’t work with lightweight tasks and work-stealing queues, where tasks are passed between threads.

But C++ doesn’t support any of this and it doesn’t look like it ever will.

Of course, you might recognize all these pro-concurrency and parallelism features as functional programming — immutability and pure functions in particular. At the risk of sounding repetitive: Haskell is way ahead of the curve with respect to parallelism, including GPU programming. That was the reason I so easily converted to Haskell after years of evangelizing good programming practices in C++. Every programmer who’s serious about concurrency and parallelism should learn enough Haskell to understand how it deals with it. There is an excellent book by Simon Marlow, Parallel and Concurrent Programming in Haskell. After you read it, you will either start using functional techniques in your C++ programming, or realize what an impedance mismatch there is between parallel programming and an imperative language, and you will switch to Haskell.


I believe that the C++ language and its philosophy are in direct conflict with the requirements of parallel programming. This conflict is responsible for the very slow uptake of parallel programming in mainstream software development. The power of multicore processors, vector units, and GPUs is being squandered by the industry because of an obsolete programming paradigm.


Here I put together some of my publications about resource management:

  1. Bartosz Milewski, “Resource Management in C++,” Journal of Object Oriented Programming, March/April 1997, Vol. 10, No 1. p. 14-22. This is still pre-unique_ptr, so I’m using auto_ptr for what it’s worth. Since you can’t have vectors of auto_ptr I implemented an auto_vector.
  2. C++ Report in September 1998 and February 1999 (still using auto_ptr).
  3. C++ in Action (still auto_ptr), Addison Wesley 2001. See an excerpt from this book that talks about resource management.
  4. Walking Down Memory Lane, with Andrei Alexandrescu, CUJ October 2005 (using unique_ptr)
  5. unique_ptr–How Unique is it?, WordPress, 2009

Here are some of my blogs criticizing the C++11 approach to concurrency:

  1. Async Tasks in C++11: Not Quite There Yet
  2. Broken promises–C++0x futures

I’ve been dealing with concurrency for many years in the context of C++ and D (see my presentation about Software Transactional Memory in D). I worked for a startup, Corensic, that made an ingenious tool called Jinx for detecting data races using a lightweight hypervisor. I recorded a series of screencasts teaching concurrency to C++ programmers. If you follow these screencasts you might realize that I was strongly favoring functional approach to concurrency, promoting immutability and pure functions. I even showed how non-functional looking code leads to data races that could be detected by (now defunct) Jinx. Essentially I was teaching C++ programmers how to imitate Haskell.

Except that C++ could only support a small subset of Haskell functionality and provide no guarantees against even the most common concurrency errors.

It’s unfortunate that most programmers haven’t seen what Haskell can do with concurrency. There is a natural barrier to learning a new language, especially one that has the reputation of being mind boggling. A lot of people are turned off simply by unfamiliar syntax. They can’t get over the fact that in Haskell a function call uses no parentheses or commas. What in C++ looks like

f(x, y)

in Haskell is written as:

f x y

Other people dread the word “monad,” which in Haskell is essentially a tool for embedding imperative code inside functional code.

Why is Haskell syntax the way it is? Couldn’t it have been more C- or Java- like? Well, look no farther than Scala. Because of Java-like syntax the most basic functional trick, currying a function, requires a Scala programmer to anticipate this possibility in the function definition (using special syntax: multiple pairs of parentheses). If you have no access to the source code for a function, you can’t curry it (at least not without even more syntactic gymnastics).

If you managed to wade through this post so far, you are probably not faint of heart and you will accept the challenge of reading an excellent article by Simon Peyton Jones, Beautiful Concurrency that does a great job of explaining to the uninitiated Haskell’s beautiful approach to concurrency. It’s not a new article, but it has been adapted to the new format of the School of Haskell by Yours Truly, which means you’ll be able to run code examples embedded in it. If you feel adventurous, you might even edit them and see how the results change.

It was my first experience working with the C++ Standardization Committee in a subgroup dedicated to concurrency and parallelism. I won’t bore you with details — they will be available at the committee web site. I’ll share my overall impressions and then focus on specific areas where I have strong opinions.

Being an outsider I considered the C++ Standard the ultimate word. If I had problems interpreting the letter of the Standard I would ask one of the committee members for interpretation and assume that I would get the same answer from any of them. Reality turned out to be more complex than that. C++ Standard is full of controversial topics. Some of those controversies could not be resolved in time, so often the wording of the Standard is intentionally vague. Some features were not ready for inclusion so little stubs were inserted into the document that sometimes don’t make much sense in isolation.

One such example is the intentional vagueness and the lack of definition of thread of execution. Not only is a thread undefined, some of the semantics are expressed using the “as if” language. In particular the thingie started by std::async is supposed to behave “as if” it were run in a separate thread of execution (whatever that means). At some point I had a long email exchange about it with Anthony Williams and Hans Boehm that resulted in a blog post. I thought the things were settled until I was alerted to the fact that Microsoft’s interpretation of the Standard was slightly different, and their “as if” didn’t include thread_local variables, at least not in the beta of the new Visual C++.

Here’s the problem: std::async was introduced in the Standard as a compromise between the idea that it’s just syntactic sugar over std::thread creation, and the idea that it’s an opening for task-based parallelism. In fact when I first tried std::async using Anthony Williams’ Just Thread library I expected it to run on a thread pool complete with work stealing and thread reuse. Not so, argued Anthony and Hans, pointing among others things to the problem of managing thread-local variables — are they supposed to be local with respect to the underlying OS thread, or to a smaller units of execution, the tasks?. If multiple tasks are reusing the same thread should they see fresh versions of thread_local variables? When should thread-local variables be destroyed if the lifetime of pool threads is theoretically infinite?

Now, Microsoft has its implementation of task-based concurrency in the form of PPL (Parallel Pattern Library). Intel has TBB (Threading Building Blocks), which is a superset of PPL and it also runs on Linux. I can understand the eagerness of those companies to bend the (intentionally vague) rules and make these libraries accessible through std::async, especially if they can dramatically improve performance.

I’d be the first to vote for this proposal, except for a few unsolved problems.

First of all, Microsoft wanted to change the semantics of std::async when called with launch_policy::async. I think this was pretty much ruled out in the ensuing discussion. Pure async case should be indistinguishable from direct creation of std::thread. Any attempt at using a thread pool behind the scenes could result in deadlocks. Essentially, the programmer must have a guarantee that all the tasks will be allowed to run in parallel no matter how many there are. Just imagine a bunch of tasks trying to communicate with each other back and forth. If thread creation is throttled down after N of them start and possibly block waiting for responses from the rest of them, they might block forever. Thread pools usually have the ability to create new threads on demand, but it’s never obvious when a new thread must be created. Even if the pool could detect all the threads that are blocked, it couldn’t detect those that are busy-spinning. This is why std::async with launch_policy::async must always create, or at least immediately steal, a thread.

The situation is different with std::async called with the default launch policy (the bitwise OR of launch_policy::async and launch_policy::deferred). In that case the runtime does not guarantee that all tasks will be able to run in parallel. In fact the programmer must be prepared for the possibility that all tasks run serially in the context of the parent thread (more specifically, in the context of the thread that calls future::get). Here the problem with using a thread pool is different. It has to do with the lifetimes of thread_local variables that I mentioned before. This is a serious problem and the semantics defined by the current Standard are far from natural. As it stands, a task created using the default launch policy must either run on a completely new thread, in which case that thread defines the lifetimes of thread_local variables; or it must be deferred, in which case it shares thread_local variables with its parent (again, strictly speaking, with the caller of future::get — if the future is passed to a different thread). This behavior might seem confusing, but at least it’s well defined.

Here’s how Herb Sutter proposed to solve the problem of making tasks run in a thread pool: Disallow non-POD thread_locals altogether. The argument was that nobody has implemented non-POD thread locals anyway, so nobody will suffer. Anthony Williams’ and Boost implementations were dismissed as library-based.

This seems to me like a violation of the spirit of C++, but there is a precedent for it: atomic variables. You can declare a POD (Plain Old Data, including simple structs) as atomic and, if it fits inside a hardware supported atomic word, it will become a lock-free atomic; otherwise a lock will be provided free of charge (well, you’ll pay for it with performance, but that’s a different story). But you can’t define a non-POD as atomic!

A quick straw poll showed that the subcommittee was equally split between those who were willing to discuss this change and those who weren’t. It seems though that Microsoft will go ahead with its PPL implementation ignoring the problems with thread_local (and also with DLL_THREAD_DETACH handlers I mentioned in my blog). So you might want to restrict the use of non-POD thread-local variables for the time being.

This discussion had a larger context: The proposal to introduce thread pools into the language/library as first class objects. Google’s Jeffrey Yaskin described their Executor library, which combines thread pools with work-stealing queues and schedulers. PPL has a similar construct called task group. In this new context, std::async would only provide an interface to a global default thread-pool/executor/task-group. The introduction of first-class thread pools would take away the pressure to modify the semantics of std::async. If you cared about the way your tasks are scheduled, you could spawn them using a dedicated thread-pool object. Having an explicit object representing a set of tasks would also allow collective operations such as wait-for-all or cancel.

Which brings me to another topic: composable futures. I wrote a blog post some time ago, Broken Promises: C++0x Futures, in which I lamented the lack of composability of futures. I followed it with another blog, Futures Done Right, proposing a solution. So I was very happy to learn about a new proposal to fix C++ futures. The proposal came from an unexpected source — C#.

The newest addition to C# is support for asynchronous interfaces (somewhat similar to Boost::ASIO). This is a hot topic at Microsoft because the new Windows 8 runtime is based on asynchronous API — any call that might take more than 50ms is implemented as an asynchronous API. Of course you can program to asynchronous API by writing completion handlers, but it’s a very tedious and error-prone method. Microsoft’s Mads Torgersen described how C# offers several layers of support for asynchronous programming.

But what caught my interest was how C# deals with composition of futures (they call them task objects). They have the analog of an aggregate join called WhenAll and an equivalent of “select” called WhenAny. However these combinators do not block; instead they return new futures. There is another important combinator, ContinueWith. You give it a function (usually a lambda) that will be called when the task completes. And again, ContinueWith doesn’t block — it returns another future, which may be composed with other futures, and so on. This is exactly what makes C# futures composable and, hopefully, C++ will adopt a similar approach.

Of course there is much more to the async proposal, and I wish I had more time to talk about it; but the composable integration of asynchronicity with task-based concurrency is in my eyes a perfect example of thoughtful design.

I noticed that there seems to be a problem with C++’s aversion to generalizations (I might be slightly biased having studied Haskell with its love for generalizations). Problems are often treated in separation, and specific solutions are provided for each, sometimes without a serious attempt at generalization. Case in point: cancellation of tasks. A very specialized solution involving cancellation tokens was proposed. You get opaque tokens from a factory, you pass them to tasks (either explicitly or by lambda capture), and the tasks are responsible for polling the tokens and performing appropriate cancellation actions. But this is an example of an asynchronous Boolean channel. Instead of defining channels, C++ is considering a special-purpose one-shot solution (unless there is a volunteer willing who will write a channels proposal). By the way, futures can be also viewed as channels, so this generalization might go a long way.

Another candidate for generalization was the Intel vectorization proposal presented by Robert Geva. Of course it would be great to support the use of vector processors in C++. But you have to see it in the larger context of data-driven parallelism. It doesn’t make sense to have separate solutions for vector processors, multicores running in SIMD mode, and GPGPUs. What’s needed is general support for data parallelism that allows multiple hardware-specific specializations. Hopefully a more general proposal will materialize.

The C++ Standards Committee is doing a great job, considering all the limitations it’s facing. The committee will not add anything to the language unless there are volunteers who will write proposals and demonstrate working implementations. Remember, you too can contribute to the future of C++.

My new blog post is at the FP Commplete web site where I work now. It explains the major unsolved problem of imperative programming and why I turned to functional programming. There is also an animated discussion on reddit.


How would you like a job in the supercomputing industry? Programming those powerful Ks, Jaguars, Roadrunners, Blue Genes, or gigantic clusters of computers? How inspiring would that be?

Not much, according to the luminaries of the field. I went to a panel about the future of supercomputing at SC11, and learned that the future is… Fortran, MPI, OpenMP and CUDA. I have no reason to doubt the experts; after all some of them were with the industry when it was all ferrite core memory and punch cards. But it makes me wonder if there is a future at all for supercomputing, if things keep going in this direction.

Let me explain: Programming in Fortran, MPI (Message Passing Interface), OpenMP (a system of annotations for C or Fortran to help the compiler parallelize the program), and CUDA (Compute Unified Device Architecture for programming GPGPUs) is tedious, uninspiring, and boring.

I talked to a CS student who was demonstrating his summer work at the booth belonging to one of the large national labs. It was a project to improve Monte Carlo simulations of some physical processes. It was done, unsurprisingly, using MPI and OpenMP. I asked him what the exciting part of the job was. It was the learning of the Monte Carlo method. The rest was the tedium of combining barely compatible clunky programming paradigms into a workable program.

Why does it matter? Because a thriving industry or a company must attract talent. And talent can’t be bought, at least not easily. There was once a study, which showed that, above a certain compensation level, talented people don’t care so much about salaries as they do about the novelty, excitement, and freedom. Google knows it very well: They create an exciting work environments (I call them day-care centers for programmers), and encourage their employees to spend 20% of their time pursuing their own projects. No wonder there is an underground pipeline from Microsoft to Google through which the talent keeps leaking out.

By the way, I worked for Microsoft back when it was exciting. Our salaries were rather mediocre, but we felt the urge to work long hours and weekends because we felt that our contributions mattered. Unlike today, sales and marketing were not driving the company, developers were.

To confuse matters even more for the executives, programmers are relatively cheap. The cooling bill for a data center dwarfs the cost of software development. Let’s face it, from a distance, a programmer might look just like another commodity, like a computer rack, air conditioner, or a router. This is even more pronounced in supercomputing, where a single rack might go for a million dollars–an equivalent of 10-20 programmer/years.

If you drain all the excitement from work, your company, or the whole industry, is bound to stagnate. Bored people don’t innovate. And we know from experience that, in high tech industries, if you don’t innovate, you die. Old programming paradigms might have worked for years, but new unmet challenges are piling up. A lot of work that required supercomputers in the past is now done on clusters of off-the-shelf components. Google owns one of the largest supercomputers in the world, and it’s all built from cheap commodity boxes. But Google lets its people innovate.

But not everything is bleak in the land of supercomputers. I have met two teams that were brimming with ideas and enthusiasm: one was Brad Chamberlain’s Cray Chapel team and the other was Hartmut Kaiser’s Louisiana State University Ste||ar team. I’m sure there were many others, but those were the ones I had the pleasure of meeting outside of the exhibition hall.

You can tell that a team is dedicated to a task if they can’t stop talking about their work even after a few beers. Young creative people are attracted like moths to interesting and challenging projects. I don’t think writing simulations using OpenMP and MPI, even if they run on Cray X-MP, can generate this kind of enthusiasm.

« Previous PageNext Page »