Multithreading



Ferrari museum in Maranello

Ferrari museum in Maranello

I was recently visiting the Ferrari museum in Maranello, Italy, where I saw this display of telemetry data from racing cars.

Telemetry data from a racing car. The contour of the racing track is shown in the upper left corner and various data channels are displayed below.

Telemetry data from a racing car. The racing track is displayed in the upper left corner and various data channels are displayed below.

The processing and the display of telemetry data is an interesting programming challenge. It has application in space exploration (as in, when you land a probe on a surface of a comet), medicine, and the military. The same techniques are used in financial systems where streams carry information about stock prices, commodity prices, and currency exchange rates.

It’s also a problem that lends itself particularly well to functional programming. If you are one of these shops working with telemetry, and you have to maintain legacy code written in imperative style, you might be interested in an alternative approach, especially if you are facing constant pressure to provide more sophisticated analysis tools and introduce concurrency to make the system faster and more responsive.

What all these applications have in common is that they deal with multiple channels generating streams of data. The data has to be either displayed in real time or stored for later analysis and processing. It’s pretty obvious to a functional programmer that channels are functors, and that they should be composed using combinators. In fact this observation can drive the whole architecture. The clincher is the issue of concurrency: retrofitting non-functional code to run in parallel is a lost battle — it’s almost easier to start from scratch. But treating channels as immutable entities makes concurrency almost an after-thought.

Everything is a Number

The most basic (and totally wrong) approach is to look at telemetry as streams of numbers. This is the assembly language of data processing. When everything is a number and you can apply your math any way you wish. The problem is that you are throwing away a lot of useful information. You want to use types as soon as possible to encode additional information and to prevent nonsensical operations like adding temperature to velocity.

In an engineering application, the least you can do is to keep track of units of measurement. You also want to distinguish between channels that produce floating-point numbers and ones that produce integers, or Booleans, or strings. This immediately tells you that a channel should be a polymorphic data structure. You should be able to stream any type of data, be it bytes, complex numbers, or vectors.

Everything is an Object

To an object-oriented mind it looks very much like a channel should be an object that is parameterized by the type of data it carries. And as an object it should have some methods. We need the get method to access the current value, and the next method to increment the position in the stream. As an imperative programmer you might also be tempted to provide a mutator, set. If you ever want your program to be concurrent, don’t even think about it!

If you’re a C++ programmer, you may overload some operators, and use * and ++ instead. That would make a channel look more like a forward iterator. But whatever you call it, a functional programmer will recognize it as a list, with the head and tail functionality.

Everything is a List

Let’s talk about lists, because there is a lot of misunderstanding around them. When people think of lists in imperative languages they think about storage. A list is probably the worst data type for storing data. Imperative programmers naturally assume that functional programmers, who use lists a lot, must be crazy. They are not! A Haskell list is almost never used for storing bulk data. A list is either an interface to data that is stored elsewhere, or a generator of data. Haskell is a lazy functional language, so what looks like a data structure is really a bunch of functions that provide data on demand.

That’s why I wouldn’t hesitate to implement channels as lists in Haskell. As an added bonus, lists can provide a pull interface to data that is being pushed. Reactive programs that process streams of data may be written as if all the data were already there — the event handler logic can be hidden inside the objects that generate the data. And this is just what’s needed for live telemetry data.

Obviously, functional programming is easier in Haskell than in C++, C#, or Java. But given how much legacy software there is, it could be a lost cause to ask management to (a) throw away existing code and start from scratch, (b) retrain the team to learn a new language, and (c) deal with completely new performance characteristics, e.g., lazy evaluation and garbage collection. So, realistically, the best we can do is to keep introducing functional methods into imperative languages, at least for the time being. It doesn’t mean that Haskell should’t play an important role in it. Over and over again I find myself prototyping solutions in Haskell before translating them into C++. The added effort pays back handsomely through faster prototyping, better code quality, and fewer bugs to chase. So I would highly recommend to every imperative programmer to spend, say, an hour a day learning and playing with Haskell. You’d be amazed how it helps in developing your programming skills.

Everything is a Functor

So, if you’re an object oriented programmer, you’ll probably implement a channel as something like this:

template <class T> Channel {
    virtual T get();
    virtual bool next();
};

and then get stuck. With this kind of interface, the rest of your program is bound to degenerate into a complex system of loops that extract data from streams and process them, possibly stuffing it back into other streams.

Instead, I propose to try the functional way. I will show you some prototype code in Haskell, but mostly explain how things work, so a non-Haskell programmer can gain some insight.

Here’s the definition of a polymorphic channel type, Chan:

data Chan a = Chan [a]

where a plays the role of a type variable, analogous to T in the C++ code above. The right hand side of the equal sign defines the constructor Chan that takes a list as an argument. Constructors are used both for constructing and for pattern matching. The notation [a] means a list of a.

The details don’t really matter, as long as you understand that the channel is implemented as a list. Also, I’m making things a little bit more explicit for didactic purposes. A Haskell programmer would implement the channel as a type alias, type, rather than a separate type.

Rule number one of dealing with lists is: try not to access their elements in a loop (or, using the functional equivalent of a loop — recursively). Operate on lists holistically. For instance, one of the most common operations on lists is to apply a function to every element. That means we want our Chan to be a functor.

A functor is a polymorphic data type that supports operating on its contents with a function. In the case of Chan that’s easy, since a list itself is a functor. I’ll be explicit here, again for didactic reasons. This is how you make Chan an instance of the Functor class by defining how to fmap a function f over it:

instance Functor Chan where
    fmap f (Chan xs) = Chan (map f xs)

Here, map is a library function that applies f to every element of the list. This is very much like applying C++ std::transform to a container, except that in Haskell everything is evaluated lazily, so you can apply fmap to an infinite list, or to a list that is not there yet because, for instance, it’s being generated in real time from incoming telemetry.

Everything is a Combinator

Let’s see how far we can get with this channel idea. The next step is to be able to combine multiple channels to generate streams of derived data. For instance, suppose that you have one channel from a pressure gauge, and another providing volume data, and you want to calculate instantaneous temperature using the ideal gas equation.

Let’s start with defining some types. We want separate types for quantities that are measured using different units. Once more, I’m being didactic here, because there are ready-made Haskell libraries that use so called phantom types to encode measurement units. Here I’ll do it naively:

data Pressure = Pascal Float
data Volume   = Meter3 Float
data Temp     = Kelvin Float

I’ll also define the ideal gas constant:

constR = 8.314472 -- J/(mol·K)

Here’s the function that calculates the temperature of ideal gas:

getT :: Float -> Pressure -> Volume -> Temp
getT n (Pascal p) (Meter3 v) = Kelvin (p * v / (n * constR))

The question is, how can we apply this function to the pressure and volume channels to get the temperature channel? We know how to apply a function to a single channel using fmap, but here we have to work with two channels. Fortunately, a channel is not just a functor — it’s an applicative functor. It defines the action of multi-argument functions on multiple channels. I’ll give you a Haskell implementation, but you should be able to do the same in C++ by overloading fmap or transform.

instance Applicative Chan where
    pure x = Chan (repeat x)
    (Chan fs) <*> (Chan xs) = Chan (zipWith ($) fs xs)

The Applicative class defines two functions. One is called pure, and it creates a constant channel from a value by repeatedly returning the same value. The other is a binary operator <*> that applies a channel of functions (yes, you can treat functions the same way you treat any other data) to a channel of values. The function zipWith applies, pairwise, functions to arguments using the function application operator ($).

Again, the details are not essential. The bottom line is that this allows us to apply our function getT to two channels (actually, three channels, since we also need to provide the amount of gas in moles — here I’m assuming 0.1 moles).

chT :: Chan Pressure -> Chan Volume -> Chan Temp
chT chP chV = getT <$> pure 0.1 <*> chP <*> chV

Such functions that combine channels into new channels are called combinators, and an applicative functor makes the creation of new combinators very easy.

The combinators are not limited to producing physical quantities. They may as well produce channels of alerts, channels of pixels for display, or channels of visual widgets. You can construct the whole architecture around channels. And since we’ve been only considering functional data structures, the resulting architecture can be easily subject to parallelization.

Moving Average

But don’t some computations require mutable state? For instance, don’t you need some kind of accumulators in order to calculate, let’s say, moving averages? Let’s see how this can be done functionally.

The idea is to keep a running sum of list elements within a fixed window of size n. When advancing through the list, we will add the new incoming element to the running sum and subtract the old outgoing element. The average is just this sum divided by n.

We can use the old trick of delaying the list by n positions. We’ll pad the beginning of the delayed list with n zeros. Here’s the Haskell code:

delay :: Num a => Int -> [a] -> [a]
delay n lst = replicate n 0 ++ lst

The first line is the (optional, but very useful) type signature. The second line defines the function delay that takes the delay counter n and the list. The function returns a list that is obtained by concatenating (operator ++) the zero-filled list (replicate n 0) in front of the original list. For instance, if you start with the list [1, 2, 3, 4] and delay it by 2, you’ll get [0, 0, 1, 2, 3, 4].

The next step is to create a stream of deltas — the differences between elements separated by n positions. We do it by zipping two lists: the original and the delayed one.

zip lst (delay n lst)

The function zip pairs elements from the first list with the elements from the second list.

Continuing with our example, the zipping will produce the pairs [(1, 0), (2, 0), (3, 1), (4, 2)]. Notice that the left number in each pair is the incoming element that is to be added to the running sum, while the right number is the outgoing one, to be subtracted from the running sum.

Now if we subtract the two numbers in each pair we’ll get exactly the delta that has to be added to the running sum at each step. We do the subtraction by mapping the operator (-) over the list. (To make the subtraction operator (-) operate on pairs we have to uncurry it. (If you don’t know what currying is, don’t worry.)

deltas :: Num a => Int -> [a] -> [a]
deltas n lst = map (uncurry (-)) (zip lst (delay n lst))

Continuing with the example, we will get [1, 2, 2, 2]. These are the amounts by which the running sum should change at every step. (Incidentally, for n equal to one, the deltas are proportional to the derivative of the sampled stream.)

Finally, we have to start accumulating the deltas. There is a library function scanl1 that can be used to produce a list of partial sums when called with the summation operator (+).

slidingSums :: Num a => Int -> [a] -> [a]
slidingSums n lst =  scanl1 (+) (deltas n lst)

At each step, scanl1 will add the delta to the previous running sum. The “1” in its name means that it will start with the first element of the list as the accumulator. The result, in our little example, is [1, 3, 5, 7]. What remains is to divide each sum by n and we’re done:

movingAverage :: Fractional a => Int -> [a] -> [a]
movingAverage n list = map (/ (fromIntegral n)) (slidingSums n list)

Since n is an integer, it has to be explicitly converted to a fractional number before being passed to the division operator. This is done using fromIntegral. The slightly cryptic notation (/ (fromIntegral n)) is called operator section. It just means “divide by n.”

As expected, the final result for the two-element running average of [1, 2, 3, 4] is [0.5, 1.5, 2.5, 3.5]. Notice that we haven’t used any mutable state to achieve this result, which makes this code automatically thread safe. Also, because the calculation is lazy, we can calculate the moving average of an infinite list as long as we only extract a finite number of data points. Here, we are printing the first 10 points of the 5-element moving average of the list of integers from 1 to infinity.

print (take 10 (movingAverage 5 [1..]))

The result is:

[0.2, 0.6, 1.2, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0]

Conclusion

The functional approach is applicable to designing software not only in the small but, more importantly, in the large. It captures the patterns of interaction between components and the ways they compose. The patterns I mentioned in this post, the functor and the applicative functor, are probably the most common, but functional programmers have at their disposal a large variety of patterns borrowed from various branches of mathematics. These patterns can be used by imperative programmers as well, resulting in cleaner and more maintainable software that is, by construction, multithread-ready.


Data races lead to undefined behavior; but how bad can they really be? In my previous post I talked about benign data races and I gave several examples taken from the Windows kernel. Those examples worked because the kernel was compiled with a specific compiler for a specific processor. But in general, if you want your code to be portable, you can’t have data races, period.

You just cannot reason about something that is specifically defined to be “undefined.” So, obviously, you cannot prove the correctness of a program that has data races. However, very few people engage in proofs of correctness. In most cases the argument goes, “I can’t see how this can fail, therefore it must be correct” (maybe not in so many words). I call this “proof by lack of imagination.” If you want to become a concurrency expert, you have to constantly stretch your imagination. So let’s do a few stretches.

One of the readers of my previous post, Duarte Nunes, posted a very interesting example of a benign data race. Here’s the code:

int owner = 0;
int count = 0;
std::mutex mtx;

bool TryEnter() {
    if (owner == std::this_thread::get_id()) {
        count += 1;
        return true;
    }

    if (mtx.try_lock()) {
        owner = std::this_thread::get_id();
        return true;
    }
    return false;
}

void Exit() {
    if (count != 0) {
        count -= 1;
        return;
    }
    owner = 0;
    mtx.unlock();
}

I highlighted in blue the parts that are executed under the lock (in a correct program, Exit will always be called after the lock has been acquired). Notice that the variable count is always accessed under the lock, so no data races there. However, the variable owner may be read outside of the lock– I highlighted that part of code in red. That’s the data race we are talking about.

Try to analyze this code and imagine what could go wrong. Notice that the compiler or the processor can’t be too malicious. The code still has to work correctly if the data race is removed, for instance if the racy read is put under the lock.

Here’s an attempt at the “proof” of correctness. First, Duarte observed that “The valid values for the owner variable are zero and the id of any thread in the process.” That sort of makes sense, doesn’t it? Now, the only way the racy read can have any effect is if the value of owner is equal to the current thread’s ID. But that’s exactly the value that could have been written only by the current thread– and under the lock.

There are two possibilities when reading owner: either we are still under that lock, in which case the read is not at all racy; or we have already released the lock. But the release of the lock happens only after the current thread zeroes owner.

Notice that this is a single-thread analysis and, within a single thread, events must be ordered (no need to discuss memory fences or acquire/release semantics at this point). A read following a write in the same thread cannot see the value that was there before the write. That would break regular single-threaded programs. Of course, other threads may have overwritten this zero with their own thread IDs, but never with the current thread ID. Or so the story goes…

Brace yourself: Here’s an example how a compiler or the processor may “optimize” the program:

void Exit() {
    if (count != 0) {
        count -= 1;
        return;
    }
    owner = 42;
    owner = 0;
    mtx.unlock();
}

You might argue that this is insane and no compiler in its right mind would do a thing like this; but the truth is: It’s a legal program transformation. The effect of this modification is definitely not observable in the current thread. Neither is it observable by other threads in the absence of data races. Now, the unfortunate thread whose ID just happens to be 42 might observe this value and take the wrong turn. But it can only observe it through a racy read. For instance, it would never see this value if it read owner under the lock. Moreover, it would never see it if the variable owner were defined as ‘atomic’:

std::atomic<int> owner = 0

Stores and loads of atomic variables are, by default, sequentially consistent. Unfortunately sequential consistency, even on an x86, requires memory fences, which can be quite costly. It would definitely be an overkill in our example. So here’s the trick: Tell the compiler to forgo sequential consistency on a per read/write basis. For instance, here’s how you read an atomic variable without imposing any ordering constraints:

owner.load(memory_order_relaxed)

Such ‘relaxed’ operations will not introduce any memory fences– at least not on any processor I know about.

Here’s the version of the code that is exactly equivalent to the original, except that it’s correct and portable:

std::atomic<int> owner = 0;
int count = 0;
std::mutex mtx;

bool TryEnter() {
    if (owner.load(memory_order_relaxed) == std::this_thread::get_id()) {
        count += 1;
        return true;
    }

    if (mtx.try_lock()) {
        owner.store(std::this_thread::get_id(), memory_order_relaxed);
        return true;
    }
    return false;
}

void Exit() {
    if (count != 0) {
        count -= 1;
        return;
    }
    owner.store(0, memory_order_relaxed);
    mtx.unlock();
}

So what has changed? Can’t the compiler still do the same dirty trick, and momentarily store 42 in the owner variable? No, it can’t! Since the variable is declared ‘atomic,’ the compiler can no longer assume that the write can’t be observed by other threads.

The new version has no data races: The Standard specifically states that ‘atomic’ variables don’t contribute to data races, even in their most relaxed form.

C++ Standard, (1.10.5):
[…] “Relaxed” atomic operations are not synchronization operations even though, like synchronization operations, they cannot contribute to data races.

With those changes, I believe that our “proof” of correctness may actually be turned into a more rigorous proof using the axioms of the C++ memory model (although I’m not prepared to present one). We can have our cake (correct, portable code) and eat it too (no loss of performance). And, by the way, the same trick may be used in the case of lossy counters from my previous post.

Warning: I do not recommend this style of coding, or the use of weak atomics, to anybody who is not implementing operating system kernels or concurrency libraries.

Acknowledgments

I’d like to thank Luis Ceze, Hans Boehm, and Anthony Williams for useful remarks and for verifying my assumptions about the C++ memory model.

Bibliography

  1. C++ Draft Standard

Can a data race not be a bug? In the strictest sense I would say it’s always a bug. A correct program written in a high-level language should run the same way on every processor present, past, and future. But there is no proscription, or even a convention, about what a processor should (or shouldn’t) do when it encounters a race. This is usually described in higher-level language specs by the ominous phrase: “undefined behavior.” A data race could legitimately reprogram your BIOS, wipe out your disk, and stop the processor’s fan causing a multi-core meltdown.

Data race: Multiple threads accessing the same memory location without intervening synchronization, with at least one thread performing a write.

However, if your program is only designed to run on a particular family of processors, say the x86, you might allow certain types of data races for the sake of performance. And as your program matures, i.e., goes through many cycles of testing and debugging, the proportion of buggy races to benign races keeps decreasing. This becomes a real problem if you are using a data-race detection tool that cannot distinguish between the two. You get swamped by false positives.

Microsoft Research encountered and dealt with this problem when running their race detector called DataCollider on the Windows kernel (see Bibliography). Their program found 25 actual bugs, and almost an order of magnitude more benign data races. I’ll summarize their methodology and discuss their findings about benign data races.

Data Races in the Windows Kernel

The idea of the program is very simple. Put a hardware breakpoint on a shared memory access and wait for one of the threads to stumble upon it. This is a code breakpoint, which is triggered when the particular code location is executed. The x86 also supports another kind of a breakpoint, which is called a data breakpoint, triggered when the program accesses a specific memory location. So when a thread hits the code breakpoint, DataCollider installs a data breakpoint at the location the thread was just accessing. It then stalls the current thread and lets all other threads run. If any one of them hits the data breakpoint, it’s a race (as long as one of the accesses is a write). Consider this: If there was any synchronization (say, a lock acquisition was attempted) between the two accesses, the second thread would have been blocked from accessing that location. Since it wasn’t, it’s a classic data race.

Notice that this method might not catch all data races, but it doesn’t produce false positives. Except, of course, when the race is considered benign.

There are other interesting details of the algorithm. One is the choice of code locations for installing breakpoints. DataCollider first analyzes the program’s assembly code to create a pool of memory accesses. It discards all thread-local accesses and explicitly synchronized instructions (for instance, the ones with the LOCK prefix). It then randomly picks locations for breakpoints from this pool. Notice that rarely executed paths are as likely to be sampled as the frequently executed ones. This is important because data races, like fugitive criminals, often hide in less frequented places.

Pruning Benign Races

90% of data races caught by DataCollider in the Windows kernel were benign. For several reasons it’s hard to say how general this result is. First, the kernel had already been tested and debugged for some time, so many low-hanging concurrency bugs have been picked. Second, operating system kernels are highly optimized for a particular processor and might use all kinds of tricks to improve performance. Finally, kernels often use unusual synchronization strategies. Still, it’s interesting to see what shape benign data races take.

It turns out that half of all false positives came from lossy counters. There are many places where statistics are gathered: counting certain kinds of events, either for reporting or for performance enhancements. In those situations losing a few increments is of no relevance. However not all counters are lossy and, for instance, a data race in reference counting is a serious bug. DataCollider uses simple heuristic to detect lossy counters–they are the ones that are always incremented. A reference counter, on the other hand, is as often incremented as decremented.

Another benign race happens when one thread reads a particular bit in a bitfield while another thread updates another bit. A bit update is a read-modify-write (RMW) sequence: The thread reads the previous value of the bitfield, modifies one bit, and writes the whole bitfield back. Other bits are overwritten in the process too, but their new values are the same as the old values. A read from another thread of any of the the non-changed bits does not interfere with the write, at least not on the x86. Of course if yet another thread modified one of those bits, it would be a real bug, and it would be caught separately. The pruning of this type of race requires analysis of surrounding code (looking for the masking of other bits).

Windows kernel also has some special variables that are racy by design–current time is one such example. DataCollider has these locations hard-coded and automatically prunes them away.

There are benign races that are hard to prune automatically, and those are left for manual pruning (in fact, DataCollider reports all races, it just de-emphasizes the ones it considers benign). One of them is the double-checked locking pattern (DCLP), where a thread makes a non-synchronized read to be later re-confirmed under the lock. This pattern happens to work on the x86, although it definitely isn’t portable.

Finally, there is the interesting case of idempotent writes— two racing writes that happen to write the same value to the same location. Even though such scenarios are easy to prune, the implementers of DataCollider decided not to prune them because more often than not they led to the uncovering of concurrency bugs. Below is a table that summarizes various cases.

Benign race Differentiate from Pruned?
Lossy counter Reference counting Yes
Read and write of different bits Read and write of the whole word Yes
Deliberately racy variables Yes
DCLP No
Idempotent writes No

Conclusion

In the ideal world there would be no data races. But a concurrency bug detector must take into account the existence of benign data races. In the early stages of product testing the majority of detected races are real bugs. It’s only when chasing the most elusive of concurrency bugs that it becomes important to weed out benign races. But it’s the elusive ones that bite the hardest.

Bibliography

  1. John Erickson, Madanlal Musuvathi, Sebastian Burckhardt, Kirk Olynyk, Effective Data-Race Detection for the Kernel

C++ is like an oil tanker — it takes a long time for it to change course. The turbulent reefs towards which C++ has been heading were spotted on the horizon more than ten years ago. I’m talking, of course, about the end of smooth sailing under the Moore’s law and the arrival of the Multicore. It took six years to acknowledge the existence of concurrency in the C++11 Standard, but that’s only the beginning. It’s becoming more and more obvious that a major paradigm shift is needed if C++ is to remain relevant in the new era.

Why do we need a new paradigm to deal with concurrency? Can’t we use object oriented programming with small modifications? The answer to this question goes to the heart of programming: it’s about composability. We humans solve complex problems by splitting them into smaller subproblems. This is a recursive process, we split subproblems into still smaller pieces, and so on. Eventually we reach the size of the problem which can be easily translated into computer code. We then have to compose all these partial solutions into larger programs.

The key to composability is being able to hide complexity at each level. This is why object oriented programming has been so successful. When you’re implementing an object, you have to deal with its internals, with state transitions, intermediate states, etc. But once the object is implemented, all you see is the interface. The interface must be simpler than the implementation for object oriented programming to make sense. You compose larger objects from smaller objects based on their interfaces, not the details of their implementation. That’s how object oriented programming solves the problem of complexity.

Unfortunately, objects don’t compose in the presence of concurrency. They hide the wrong kind of things. They hide sharing and mutation. Let me quote the definition of data race: Two or more threads accessing the same piece of memory at the same time, at least one of them writing. In other words: Sharing + Mutation = Data Race. Nothing in the object’s interface informs you about the possibility of sharing and mutation inside the object’s implementation. Each object in isolation may be data-race-free but their composition may inadvertently introduce data races. And you won’t know about it unless you study the details of their implementation down to every single memory access.

In Java, an attempt had been made to mollify this problem: Every object is equipped with a mutex that can be invoked by declaring the method synchronized. This is not a scalable solution. Even Java’s clever thin lock implementation incurs non-negligible performance overhead, so it is used only when the programmer is well aware of potential races, which requires deep insight into the implementation of all subobjects, exactly the thing we are trying to avoid.

More importantly, locking itself doesn’t compose. There’s a classic example of a locked bank account whose deposit and withdraw methods are synchronized by a lock. The problem occurs when one tries to transfer money from one account to another. Without exposing the locks, it’s impossible to avoid a transient state in which the funds have already left one account but haven’t reached the second. With locks exposed, one may try to hold both locks during the transfer, but that creates a real potential for deadlocks. (Software Transactional Memory provides a composable solution to this problem, but there are no practical implementations of STM outside of Haskell and Clojure.)

Moreover, if we are interested in taking advantage of multicores to improve performance, the use of locks is a non-starter. Eking out parallel performance is hard enough without locks, given all the overheads of thread management and the Amdahl’s law. Parallelism requires a drastically different approach.

Since the central problem of concurrency is the conflict between sharing and mutation, the solution is to control these two aspects of programming. We can do mutation to our heart’s content as long as there’s no sharing. For instance, we can mutate local variables; or we can ensure unique ownership by making deep copies, using move semantics, or by employing unique_ptrs. Unique ownership plays very important role in message passing, allowing large amounts of data to be passed cheaply between threads.

However, the key to multicore programming is controlling mutation. This is why functional languages have been steadily gaining ground in concurrency and parallelism. In a nutshell, functional programmers have found a way to program using what, to all intents and purposes, looks like immutable data. An imperative programmer, when faced with immutability, is as confused as a barbecue cook in a vegetarian kitchen. And the truth is that virtually all data structures from the C++ standard library are unsuitable for this kind of programming — the standard vector being the worst offender. A continuous slab of memory is perfect for random or sequential access, but the moment mutation is involved, you can’t share it between threads. Of course, you can use a mutex to lock the whole vector every time you access it, but as I explained already, you can forget about performance and composability of such a solution.

The trick with functional data structures is that they appear immutable, and therefore require no synchronization when accessed from multiple threads. Mutation is replaced by construction: you construct a new object that’s a clone of the source object but with the requested modification in place. Obviously, if you tried to do this with a vector, you’d end up with a lot of copying. But functional data structures are designed for maximum sharing of representation. So a clone of a functional object will share most of its data with the original, and only record a small delta. The sharing is totally transparent since the originals are guaranteed to be immutable.

A singly-linked list is a classical, if not somewhat trivial, example of such a data structure. Adding an element to the front of a list requires only the creation of a single node to store the new value and a pointer to the original (immutable) list. There are also many tree-like data structures that are logarithmically cheap to clone-mutate (red-black trees, leftist heaps). Parallel algorithms are easy to implement with functional data structures, since the programmer doesn’t have to worry about synchronization.

Functional data structures, also known as “persistent” data structures, are naturally composable. This follows from the composability of immutable data — you can build larger immutable objects from smaller immutable objects. But there’s more to it: This new way of mutating by construction also composes well. A composite persistent object can be clone-mutated by clone-mutating only the objects on the path to the mutation; everything else can be safely shared.

Concurrency also introduces nonstandard flows of control. In general, things don’t progress sequentially. Programmers have to deal with inversion of control, jumping from handler to handler, keeping track of shared mutable state, etc. Again, in functional programming this is nothing unusual. Functions are first class citizens and they can be composed in many ways. A handler is nothing but a continuation in the continuation passing style. Continuations do compose, albeit in ways that are not familiar to imperative programmers. Functional programmers have a powerful compositional tool called a monad that, among other things, can linearize inverted flow of control. The design of libraries for concurrent programming makes much more sense once you understand that.

A paradigm shift towards functional programming is unavoidable and I’m glad to report that there’s a growing awareness of that new trend among C++ programmers. I used to be the odd guy talking about Haskell and monads at C++ meetings and conferences. This is no longer so. There was a sea change at this year’s C++Now. The cool kids were all talking about functional programming, and the presentation “Functional Data Structures in C++” earned me the most inspiring session award. I take it as a sign that the C++ community is ready for a big change.


Lazy evaluation can be a powerful tool for structuring your code. For instance, it can let you turn your code inside out, inverting the flow of control. Many a Haskell program take advantage of laziness to express algorithms in clear succinct terms, turning them from recipes to declarations.

The question for today’s blog post is: How can we tap the power of lazy evaluation in an inherently eager language like C++? I’ll lead you through a simple coding example and gradually introduce the building blocks of lazy programming: the suspension, the lazy stream, and a whole slew of functional algorithms that let you operate on them. In the process we’ll discover some fundamental functional patterns like functors, monads, and monoids. I have discussed them already in my post about C++ futures. It’s very edifying to see them emerge in a completely different context.

The Problem

Let’s write a program that prints the first n Pythagorean triples. A Pythagorean triple consists of three integers, x, y, and z, that satisfy the relation x2 + y2 = z2. Let’s not be fancy and just go with the brute force approach. Here’s the program in C:

void printNTriples(int n)
{
    int i = 0;
    for (int z = 1; ; ++z)
        for (int x = 1; x <= z; ++x)
            for (int y = x; y <= z; ++y)
                if (x*x + y*y == z*z) {
                    printf("%d, %d, %d\n", x, y, z);
                    if (++i == n)
                        return;
                }
}

Here, a single C function serves three distinct purposes: It

  1. Generates Pythagorean triples,
  2. Prints them,
  3. Counts them; and when the count reaches n, breaks.

This is fine, as long as you don’t have to modify or reuse this code. But what if, for instance, instead of printing, you wanted to draw the triples as triangles? Or if you wanted to stop as soon as one of the numbers reached 100? The problem with this code is that it’s structured inside out: both the test and the sink for data are embedded in the innermost loop of the algorithm. A more natural and flexible approach would be to:

  1. Generate the list of Pythagorean triples,
  2. Take the first ten of them, and
  3. Print them.

And that’s exactly how you’d write this program in Haskell:

main = print (take 10 triples)

triples = [(x, y, z) | z <- [1..]
                     , x <- [1..z]
                     , y <- [x..z]
                     , x^2 + y^2 == z^2]

This program reads: take 10 triples and print them. It declares triples as a list (square brackets mean a list) of triples (x, y, z), where (the vertical bar reads “where”) z is an element of the list of integers from 1 to infinity, x is from 1 to z, y is from x to z, and the sum of squares of x and y is equal to the square of z. This notation is called “list comprehension” and is characteristic of Haskell terseness.

You see the difference? Haskell let’s you abstract the notion of the list of Pythagorean triples so you can operate on it as one entity, whereas in C (or, for that matter, in C++) we were not able to disentangle the different, orthogonal, aspects of the program.

The key advantage of Haskell in this case is its ability to deal with infinite lists. And this ability comes from Haskell’s inherent laziness. Things are never evaluated in Haskell until they are absolutely needed. In the program above, it was the call to print that forced Haskell to actually do some work: take 10 elements from the list of triples. Since the triples weren’t there yet, it had to calculate them, but only as many as were requested and not a number more.

Suspension

We’ll start with the most basic building block of laziness: a suspended function. Here’s the first naive attempt:

template<class T>
class Susp {
public:
    explicit Susp(std::function<T()> f)
        : _f(f)
    {}
    T get() { return _f(); }
private:
    std::function<T()> _f;
};

We often create suspensions using lambda functions, as in:

int x = 2;
int y = 3;
Susp<int> sum([x, y]() { return x + y; });
...
int z = sum.get();

Notice that the suspended lambda may capture variables from its environment: here x and y. A lambda, and therefore a suspension, is a closure.

The trouble with this implementation is that the function is re-executed every time we call get. There are several problems with that: If the function is not pure, we may get different values each time; if the function has side effects, these may happen multiple times; and if the function is expensive, the performance will suffer. All these problems may be addressed by memoizing the value.

Here’s the idea: The first time the client calls get we should execute the function and store the returned value in a member variable. Subsequent calls should go directly to that variable. We could implement this by setting a Boolean flag on the first call and then checking it on every subsequent call, but there’s a better implementation that uses thunks.

A thunk is a pointer to a free function taking a suspension (the this pointer) and returning a value (by const reference). The get method simply calls this thunk, passing it the this pointer.

Initially, the thunk is set to thunkForce, which calls the method setMemo. This method evaluates the function, stores the result in _memo, switches the thunk to thunkGet, and returns the memoized value. On subsequent calls get goes through the getMemo thunk which simply returns the memoized value.

template<class T>
class Susp
{
    // thunk
    static T const & thunkForce(Susp * susp) {
        return susp->setMemo();
    }
    // thunk
    static T const & thunkGet(Susp * susp) {
        return susp->getMemo();
    }
    T const & getMemo() {
        return _memo;
    }
    T const & setMemo() {
        _memo = _f();
        _thunk = &thunkGet;
        return getMemo();
    }
public:
    explicit Susp(std::function<T()> f)
        : _f(f), _thunk(&thunkForce), _memo(T())
    {}
    T const & get() {
        return _thunk(this);
    }
private:
    T const & (*_thunk)(Susp *);
    mutable T   _memo;

    std::function<T()> _f;
};

(By the way, the function pointer declaration of _thunk looks pretty scary in C++, doesn’t it?)

[Edit: I decided to remove the discussion of the thread safe implementation since it wasn’t ready for publication. The current implementation is not thread safe.]

You can find a lot more detail about the Haskell implementation of suspended functions in the paper by Tim Harris, Simon Marlow, and Simon Peyton Jones, Haskell on a Shared-Memory Multiprocessor.

Lazy Stream

The loop we used to produce Pythagorean triples in C worked on the push principle — data was pushed towards the sink. If we want to deal with infinite lists, we have to use the pull principle. It should be up to the client to control the flow of data. That’s the inversion of control I was talking about in the introduction.

We’ll use a lazy list and call it a stream. In C++ a similar idea is sometimes expressed in terms of input and forward iterators, although it is understood that an iterator itself is not the source or the owner of data — just an interface to one. So we’ll stick with the idea of a stream.

We’ll implement the stream in the functional style as a persistent data structure fashioned after persistent lists (see my series of blog post on persistent data structures). It means that a stream, once constructed, is never modified. To “advance” the stream, we’ll have to create a new one by calling the const method pop_front.

Let’s start with the definition: A stream is either empty or it contains a suspended cell. This immediately suggests the implementation as a (possibly null) pointer to a cell. Since the whole stream is immutable, the cell will be immutable too, so it’s perfectly safe to share it between copies of the stream. We can therefore use a shared pointer:

template<class T>
class Stream
{
private:
    std::shared_ptr <Susp<Cell<T>>> _lazyCell;
};

Of course, because of reference counting and memoization, the stream is only conceptually immutable and, in the current implementation, not thread safe.

So what’s in the Cell? Remember, we want to be able to generate infinite sequences, so Stream must contain the DNA for not only producing the value of type T but also for producing the offspring — another (lazy) Stream of values. The Cell is just that: A value and a stream.

template<class T>
class Cell
{
public:
    Cell() {} // need default constructor for memoization
    Cell(T v, Stream<T> const & tail)
        : _v(v), _tail(tail)
    {}
    explicit Cell(T v) : _v(v) {}
    T val() const {
        return _v;
    }
    Stream<T> pop_front() const {
        return _tail;
    }
private:
    T _v;
    Stream<T> _tail;
};

This mutually recursive pair of data structures works together amazingly well.

template<class T>
class Stream
{
private:
    std::shared_ptr <Susp<Cell<T>>> _lazyCell;
public:
    Stream() {}
    Stream(std::function<Cell<T>()> f)
        : _lazyCell(std::make_shared<Susp<Cell<T>>>(f))
    {}
    Stream(Stream && stm)
        : _lazyCell(std::move(stm._lazyCell))
    {}
    Stream & operator=(Stream && stm)
    {
        _lazyCell = std::move(stm._lazyCell);
        return *this;
    }
    bool isEmpty() const
    {
        return !_lazyCell;
    }
    T get() const
    {
        return _lazyCell->get().val();
    }
    Stream<T> pop_front() const
    {
        return _lazyCell->get().pop_front();
    }
};

There are several things worth pointing out. The two constructors follow our formal definition of the Stream: one constructs an empty stream, the other constructs a suspended Cell. A suspension is created from a function returning Cell.

I also added a move constructor and a move assignment operator for efficiency. We’ll see it used in the implementation of forEach.

The magic happens when we call get for the first time. That’s when the suspended Cell comes to life. The value and the new stream are produced and memoized for later use. Or, this may happen if the first call is to pop_front. Notice that pop_front is a const method — the Stream itself is immutable. The method returns a new stream that encapsulates the rest of the sequence.

Let’s get our feet wet by constructing a stream of integers from n to infinity. The constructor of a Stream takes a function that returns a Cell. We’ll use a lambda that captures the value of n. It creates a Cell with that value and a tail, which it obtains by calling intsFrom with n+1:

Stream<int> intsFrom(int n)
{
    return Stream<int>([n]()
    {
        return Cell<int>(n, intsFrom(n + 1)); 
    });
}

It’s a recursive definition, but without the usual recursive function calls that eat up the stack. The call to the inner intsFrom is not made from the outer intsFrom. Instead it’s made the first time get is called on the emerging Stream.

Of course, we can also create finite streams, like this one, which produces integers from n to m:

Stream<int> ints(int n, int m)
{
    if (n > m)
        return Stream<int>();
    return Stream<int>([n, m]()
    {
        return Cell<int>(n, ints(n + 1, m));
    });
}

The trick is to capture the limit m as well as the recursion variable n. When the limit is reached, we simply return an empty Stream.

We’ll also need the method take, which creates a Stream containing the first n elements of the original stream:

Stream take(int n) const {
    if (n == 0 || isEmpty())
        return Stream();
    auto cell = _lazyCell;
    return Stream([cell, n]()
    {
        auto v = cell->get().val();
        auto t = cell->get().pop_front();
        return Cell<T>(v, t.take(n - 1));
    });
}

Here we are capturing the suspended cell and use it to lazily generate the elements of the new, truncated, Stream. Again, the key to understanding why this works is to keep in mind that Streams and Cells are conceptually immutable, and therefore can be shared by the implementation. This has some interesting side effects, which don’t influence the results, but change the performance. For instance, if the caller of take forces the evaluation of the first n elements — e.g., by passing them through the consuming forEach below — these elements will appear miraculously memoized in the original Stream.

Finally, we’ll need some way to iterate through streams. Here’s an implementation of forEach that consumes the stream while enumerating it and feeding its elements to a function.

template<class T, class F>
void forEach(Stream<T> strm, F f)
{
    while (!strm.isEmpty())
    {
        f(strm.get());
        strm = strm.pop_front();
    }
}

It’s the assignment:

strm = strm.pop_front();

which consumes the stream by decreasing the reference count of the head of the Stream. In particular, if you pass an rvalue Stream to forEach, its elements will be generated and deleted in lockstep. The algorithm will use constant memory, independent of the virtual length of the Stream. What Haskell accomplishes with garbage collection, we approximate in C++ with reference counting and shared_ptr.

Working with Streams

It’s not immediately obvious how to translate our Pythagorean triple program from nested loops to lazy streams, so we’ll have to take inspiration from the corresponding Haskell program. Let me first rewrite it using a slightly different notation:

triples = do
    z <- [1..]
    x <- [1..z]
    y <- [x..z]
    guard (x^2 + y^2 == z^2)
    return (x, y, z)

The general idea is this: Start with the stream of integers from 1 to infinity. For every such integer — call it z — create a stream from 1 to z. For each of those — call them x — create a stream from x to z. Filter out those which don’t satisfy the Pythagorean constraint. Finally, output a stream of tuples (x, y, z).

So far we’ve learned how to create a stream of integers — we have the function intsFrom. But now we’ll have to do something for each of these integers. We can’t just enumerate those integers and apply a function to each, because that would take us eternity. So we need a way to act on each element of a stream lazily.

In functional programming this is called mapping a function over a list. In general, a parameterized data structure that can be mapped over is called a functor. I’m going to show you that our Stream is a functor.

Stream as a Functor

The idea is simple: we want to apply a function to each element of a stream to get a new transformed stream (it’s very similar to the std::transform algorithm from STL). The catch is: We want to do it generically and lazily.

To make the algorithm — we’ll call it fmap — generic, we have to parameterize it over types. The algorithm starts with a Stream of elements of type T and a function from T to some other type U. The result should be a stream of U.

We don’t want to make U the template argument, because then the client would have to specify it explicitly. We want the compiler to deduce this type from the type of the function. We want, therefore, the function type F to be the parameter of our template (this will also allow us to call it uniformly with function pointers, function objects, and lambdas):

template<class T, class F>
auto fmap(Stream<T> stm, F f)

Without the use of concepts, we have no way of enforcing, or even specifying, that F be a type of a function from T to U. The best we can do is to statically assert it inside the function:

static_assert(std::is_convertible<F, std::function<U(T)>>::value,
        "fmap requires a function type U(T)");

But what is U? We can get at it using decltype:

decltype(f(stm.get()));

Notice that decltype takes, as an argument, an expression that can be statically typed. Here, the expression is a function call of f. We also need a dummy argument for this function: we use the result of stm.get(). The argument to decltype is never evaluated, but it is type-checked at compile time.

One final problem is how to specify the return type of fmap. It’s supposed to be Stream<U>, but we don’t know U until we apply decltype to the arguments of fmap. We have to use the new auto function declaration syntax of C++11. So here are all the type-related preliminaries:

template<class T, class F>
auto fmap(Stream<T> stm, F f)->Stream<decltype(f(stm.get()))>
{
    using U = decltype(f(stm.get()));
    static_assert(std::is_convertible<F, std::function<U(T)>>::value,
        "fmap requires a function type U(T)");
    ...
}

Compared to that, the actual implementation of fmap seems rather straightforward:

    if (stm.isEmpty()) return Stream<U>();
    return Stream<U>([stm, f]()
    {
        return Cell<U>(f(stm.get()), fmap(stm.pop_front(), f));
    });

In words: If the stream is empty, we’re done — return an empty stream. Otherwise, create a new stream by suspending a lambda function. That function captures the original stream (by value) and the function f, and returns a Cell. That cell contains the value of f acting on the first element of the original stream, and a tail. The tail is created with fmap acting on the rest of the original stream.

Equipped with fmap, we can now attempt to take the first step towards generating our triples: apply the function ints(1, z) to each element of the stream intsFrom(1):

fmap(intsFrom(1), [](int z)
{
    return ints(1, z);
});

The result is a Stream of Streams of integers of the shape:

1
1 2
1 2 3
1 2 3 4
1 2 3 4 5
...

But now we are stuck. We’d like to apply ints(x, z) to each element of that sequence, but we don’t know how to get through two levels of Stream. Our fmap can only get through one layer. We need a way to flatten a Stream of Streams. That functionality is part of what functional programmers call a monad. So let me show you that Stream is indeed a monad.

Stream as a Monad

If you think of a Stream as a list, the flattening of a list of lists is just concatenation. Suppose for a moment that we know how to lazily concatenate two Streams (we’ll get to it later) and let’s implement a function mjoin that concatenates a whole Stream of Streams.

You might have noticed a pattern in the implementation of lazy functions on streams. We use some kind of recursion, which starts with “Are we done yet?” If not, we do an operation that involves one element of the stream and the result of a recursive call to the function itself.

The “Are we done yet?” question usually involves testing for an empty stream. But here we are dealing with a Stream of Streams, so we have to test two levels deep. This way we’ll ensure that the concatenation of a Stream of empty Streams immediately returns an empty Stream.

The recursive step in mjoin creates a Cell whose element is the head of the first stream, and whose tail is the concatenation of the tail of the first stream and the result of mjoin of the rest of the streams:

template<class T>
Stream<T> mjoin(Stream<Stream<T>> stm)
{
    while (!stm.isEmpty() && stm.get().isEmpty())
    {
        stm = stm.pop_front();
    }
    if (stm.isEmpty()) return Stream<T>();
    return Stream<T>([stm]()
    {
        Stream<T> hd = stm.get();
        return Cell<T>( hd.get()
                      , concat(hd.pop_front(), mjoin(stm.pop_front())));
    });
}

The combination of fmap and mjoin lets us compose function like intsFrom or ints that return Streams. In fact, this combination is so common that it deserves its own function, which we’ll call mbind:

template<class T, class F>
auto mbind(Stream<T> stm, F f) -> decltype(f(stm.get()))
{
    return mjoin(fmap(stm, f));
}

If we use mbind in place of fmap:

mbind(intsFrom(1), [](int z)
{
    return ints(1, z);
});

we can produce a flattened list:

1 1 2 1 2 3 1 2 3 4...

But it’s not just the list: Each element of the list comes with variables that are defined in its environment — here the variable z. We can keep chaining calls to mbind and capture more variables in the process:

mbind(intsFrom(1), [](int z)
{
    return mbind(ints(1, z), [z](int x)
    {
        return mbind(ints(x, z), [x, z](int y)
        {
            ...
        }
    }
}

At this point we have captured the triples x, y, z, and are ready for the Pythagorean testing. But before we do it, let’s define two additional functions that we’ll use later.

The first one is mthen which is a version of mbind that takes a function of no arguments. The idea is that such a function will be executed for each element of the stream, but it won’t use the value of that element. The important thing is that the function will not be executed when the input stream is empty. In that case, mthen will return an empty stream.

We implement mthen using a slightly modified version of fmap that takes a function f of no arguments:

template<class T, class F>
auto fmapv(Stream<T> stm, F f)->Stream<decltype(f())>
{
    using U = decltype(f());
    static_assert(std::is_convertible<F, std::function<U()>>::value,
        "fmapv requires a function type U()");

    if (stm.isEmpty()) return Stream<U>();
    return Stream<U>([stm, f]()
    {
        return Cell<U>(f(), fmapv(stm.pop_front(), f));
    });
}

We plug it into the definition of mthen the same way fmap was used in mbind:

template<class T, class F>
auto mthen(Stream<T> stm, F f) -> decltype(f())
{
    return mjoin(fmapv(stm, f));
}

The second useful function is mreturn, which simply turns a value of any type into a one-element Stream:

template<class T>
Stream<T> mreturn(T v)
{
    return Stream<T>([v]() {
        return Cell<T>(v);
    });
}

We’ll need mreturn to turn our triples into Streams.

It so happens that a parameterized type equipped with mbind and mreturn is called a monad (it must also satisfy some additional monadic laws, which I won’t talk about here). Our lazy Stream is indeed a monad.

Stream as a Monoid and a Monad Plus

When implementing mjoin we used the function concat to lazily concatenate two Streams. Its implementation follows the same recursive pattern we’ve seen so many times:

template<class T>
Stream<T> concat( Stream<T> lft
                , Stream<T> rgt)
{
    if (lft.isEmpty())
        return rgt;
    return Stream<T>([=]()
    {
        return Cell<T>(lft.get(), concat<T>(lft.pop_front(), rgt));
    });
}

What’s interesting is that the concatenation of streams puts them under yet another well known functional pattern: a monoid. A monoid is equipped with a binary operation, just like concat, which must be associative and possess a unit element. It’s easy to convince yourself that concatenation of Streams is indeed associative, and that the neutral element is an empty Stream. Concatenating an empty Stream, whether in front or in the back of any other Stream, doesn’t change the original Stream.

What’s even more interesting is that being a combination of a monoid and a monad makes Stream into a monad plus, and every monad plus defines a guard function — exactly what we need for the filtering of our triples. This function takes a Boolean argument and outputs a Stream. If the Boolean is false, the Stream is empty (the unit element of monad plus!), otherwise it’s a singleton Stream. We really don’t care what value sits in this Stream — we never use the result of guard for anything but the flow of control. In Haskell, there is a special “unit” value () — here I use a nullptr as its closest C++ analog.

Stream<void*> guard(bool b)
{
    if (b) return Stream<void*>(nullptr);
    else return Stream<void*>();
}

We can now pipe the result of guard into mthen, which will ignore the content of the Stream but won’t fire when the Stream is empty. When the Stream is not empty, we will call mreturn to output a singleton Stream with the result tuple:

Stream<std::tuple<int, int, int>> triples()
{
    return mbind(intsFrom(1), [](int z)
    {
        return mbind(ints(1, z), [z](int x)
        {
            return mbind(ints(x, z), [x, z](int y)
            {
                return mthen(guard(x*x + y*y == z*z), [x, y, z]()
                {
                    return mreturn(std::make_tuple(x, y, z));
                });
            });
        });
    });
}

These singletons will then be concatenated by the three levels of mbind to create one continuous lazy Stream of Pythagorean triples.

Compare this function with its Haskell counterpart:

triples = do
    z <- [1..]
    x <- [1..z]
    y <- [x..z]
    guard (x^2 + y^2 == z^2)
    return (x, y, z)

Now, the client can take 10 of those triples from the Stream — and the triples still won’t be evaluated!. It’s the consuming forEach that finally forces the evaluation:

void test()
{
    auto strm = triples().take(10);
    forEach(std::move(strm), [](std::tuple<int, int, int> const & t)
    {
        std::cout << std::get<0>(t) << ", " 
                  << std::get<1>(t) << ", " 
                  << std::get<2>(t) << std::endl;
    });
}

Conclusion

The generation of Pythagorean triples is a toy example, but it shows how lazy evaluation can be used to restructure code in order to make it more reusable. You can use the same function triples to print the values in one part of your program and draw triangles in another. You can filter the triples or impose different termination conditions. You can use the same trick to generate an infinite set of approximation to the solution of a numerical problem, and then use different strategies to truncate it. Or you can create an infinite set of animation frames, and so on.

The building blocks of laziness are also reusable. I have used them to implement the solution to the eight-queen problem and a conference scheduling program. Once they made thread safe, the combinators that bind them are thread safe too. This is, in general, the property of persistent data structures.

You might be concerned about the performance of lazy data structures, and rightly so. They use the heap heavily, so memory allocation and deallocation is a serious performance bottleneck. There are many situation, though, where code structure, reusability, maintenance, and correctness (especially in multithreaded code) are more important than performance. And there are some problems that might be extremely hard to express without the additional flexibility gained from laziness.

I made the sources to all code in this post available on GitHub.


[If you prefer, you may watch the video of my talk on this topic (here are the slides).]

If you thought you were safe from functional programming in your cozy C++ niche, think again! First the lambdas and function objects and now the monad camouflaged as std::future. But do not despair, it’s all just patterns. You won’t find them in the Gang of Four book, but once you see them, they will become obvious.

Let me give you some background: I was very disappointed with the design of C++11 std::future. I described my misgivings in: Broken Promises — C++0x futures. I also made a few suggestions as how to fix it: Futures Done Right. Five years went by and, lo and behold, a proposal to improve std::future and related API, N3721, was presented to the Standards Committee for discussion. I thought it would be a no brainer, since the proposal was fixing obvious holes in the original design. A week ago I attended the meetings of the C++ Standards Committee in Issaquah — since it was within driving distance from me — and was I in for a surprise! Apparently some design patterns that form the foundation of functional programming are not obvious to everybody. So now I find myself on the other side of the discussion and will try to explain why the improved design of std::future is right.

Design arguments are not easy. You can’t mathematically prove that one design is better than another, or a certain set of abstractions is better than another — unless you discover some obvious design flaws in one of them. You might have a gut feeling that a particular solution is elegant, but how do you argue about elegance?

Thankfully, when designing a library, there are some well known and accepted criteria. The most important ones, in my mind, are orthogonality, a.k.a., separation of concerns, and composability. It also helps if the solution has been previously implemented and tested, especially in more than one language. I will argue that this is indeed the case with the extended std::future design. In the process, I will describe some programming patterns that might be new to C++ programmers but have been tried and tested in functional languages. They tend to pop up more and more in imperative languages, especially in connection with concurrency and parallelism.

The Problem

In a nutshell, the problem that std::future is trying to solve is that of returning the result of a computation that’s being performed in parallel, or returning the result of an asynchronous call. For instance, you start a computation in a separate thread (or a more general execution agent) and you want to, at some point in time, get back the result of that computation. This is one of the simplest models of concurrency: delegating the execution of a function (a closure) to another thread.

To return a value from one thread to another you need some kind of a communication channel. One thread puts a value in the channel, another picks it up. Instead of providing one channel abstraction, as does ML or Haskell, C++11 splits it into two separate abstractions: the promise and the future. The promise is the push end of the channel, the future is the pull end. (In Rust there are similar objects called Chan and Port.)

The general pattern is for the client to construct a promise, get the future from it using get_future, and start a thread, passing it the promise. When the thread is done, it puts the result in the promise using set_value. In the meanwhile, the calling thread may do some other work and eventually decide to retrieve the result from the future by calling its method get. If the promise has been fulfilled, get returns immediately with the value, otherwise it blocks until the value is available.

This pattern involves some boilerplate code dealing with the promise side of things, so the Standard introduced a shortcut called std::async to simplify it. You call std::async with a plain function (closure) and its result is automatically put into a hidden promise. All the client sees is the future side of the channel. (I am simplifying things by ignoring exception handling and various modes of starting async.)

The Functor Pattern

Here’s the first abstraction: A future is an object that encapsulates a value. By itself, this would be a pretty useless abstraction unless the encapsulation came with some other functionality or restriction. For instance, std::unique_ptr encapsulates a value, but also manages the lifetime of the memory it occupies. A future encapsulates a value, but you might have to block to get it. Functional languages have a very useful pattern for just this kind of situation: the functor pattern (not to be confused with the C++ misnomer for a function object). A functor encapsulates a value of an arbitrary type, plus it lets you act on it with a function.

Notice that the functor doesn’t necessarily give you access to the value — instead it lets you modify it. The beauty of it is that, in the case of a future, a functor gives you the means to modify the value that potentially isn’t there yet — and it lets you do it without blocking. Of course, behind the scenes, the function (closure) that you provide is stored in the future and only applied when the value is ready and is being accessed using get.

The first part of the fix that was proposed to the Committee was to turn std::future into a functor. Technically, this is done by adding a new method, then:

template<typename F>
auto future::then(F&& func) -> future<decltype(func(*this))>;

This method takes a function object func to be applied to the future in question. The result is a new future of the type that is returned by the function object, decltype(func(*this)).

Things are slightly muddled by the fact that a future not only encapsulates the value to be calculated but also the possibility of an exception. This is why the function passed to then takes the whole future, from which it can extract the value using get, which at that point is guaranteed not to block, but may rethrow an exception. There is an additional proposal N3865 to introduce another method, next, that would deal only with the value, not the exception. The advantage of next is that it could be called with a regular function unaware of the existence of futures, with no additional boilerplate. For simplicity, I’ll be using next in what follows.

The functor pattern makes perfect sense for composing a regular function on top of an asynchronous function (one returning a future), but it’s more general than that. Any time you have an object that is parameterized by an arbitrary type, you might be dealing with a functor. In C++, that would be a template class that doesn’t impose any restrictions on its template argument. Most containers have this property. In order for a generic class to be a functor it must also support a means to operate on its contents. Most containers in STL provide this functionality through the algorithm std::transform. For an imperative programmer it might come as a surprise that such disparate things as futures and containers fall under the same functional pattern — a functor.

Unlike in functional languages, in C++ there is no natural reusable expression for the functor pattern, so it’s more of the pattern in the head of the programmer. For instance, because of memory management considerations, std::transform operates on iterators rather than containers — the storage for the target container must be either pre-allocated or allocated on demand through iterator adapters. One could try to provide iterator adapters for futures, so they could be operated on by std::transform, but ultimately the transformation has to act on the internals of the future (i.e., store the function object in it) so it either has to be a method or a friend of the future.

The Monad Pattern

The functor pattern is not enough to provide full composability for futures. The likely scenario is that the user creates a library of future-returning functions, each performing a specific task. He or she then needs the means to combine such functions into more complex tasks. This is, for instance, the case when combining asynchronous operations, such as opening a file and then reading from it. Suppose we have the async_open function that returns a file handle future:

future<HANDLE> async_open(string &);

and the async_read function that takes a file handle and returns a future with the buffer filled with data:

future<Buffer> async_read(HANDLE fh);

If you combine the two using next, the result will be a future of a future:

future<future<Buffer>> ffBuf = async_open("foo").next(&async_read);

In order to continue chaining such calls without blocking — for instance to asynchronously process the buffer — you need a way to collapse the double future to a single future and then call next on it.

The collapsing method, unwrap, is another part of the extended future proposal. When called on a future<future<T>> it returns future<T>. It lets you chain asynchronous functions using next followed by unwrap.

async_open("foo").next(&async_read).unwrap().next(&async_process);

In functional programming such a collapsing function is called join. The combination next followed by unwrap (or, in Haskell, fmap followed by join) is so common that it has its own name, bind (in Haskell it’s the operator >>=). It might make sense to make bind another method of future (possibly under a different name). [Edit: In fact, the proposal (n3721) is to overload then to automatically perform unwrap whenever the result is a future of a future. This way then would also work as bind.]

There’s one more important usage pattern: a function that may execute asynchronously, but sometimes returns the result immediately. This often happens in recursive algorithms, when the recursion bottoms up. For instance, a parallel tree traversal function may spawn asynchronous tasks for traversing the children of a node, but when it reaches a leaf, it might want to return the result synchronously. Instead of writing complicated conditional code at each level, it’s easier to provide a “fake” future whose contents is immediately available — whose get method never blocks. Such fake future and the function that creates it called make_ready_future are also part of the proposal.

Together, the methods next (or then) and unwrap, and the function make_ready_future are easily recognizable by a functional programmer as forming the monad pattern (in Haskell, they would be called, respectively, fmap, join, and return). It’s a very general pattern for composing functions that return encapsulated values. Using a monad you may work with such functions directly, rather than unwrapping their results at every step. In the case of futures, this is an important issue, since the “unwrapping” means making a potentially blocking call to get and losing precious opportunities for parallelism. You want to set up as much computation up front and let the system schedule the most advantageous execution.

Combining functions using next, unwrap (or, equivalently, bind), and make_ready_future is equivalent to specifying data dependencies between computations and letting the runtime explore opportunities for parallelism between independent computations.

The Applicative Pattern

The combinators then and next are designed for linear composition: the output of one computation serves as the input for another. A more general pattern requires the combining of multiple asynchronous sources of data. In functional programming the problem would be described as applying a function to multiple arguments, hence the name “applicative” pattern. A functional programmer would take a multi-argument function and “lift” it to accept futures instead of immediate values.

As expected, in imperative programming things are a little messier. You have to create a barrier for all the input futures, retrieve the values, and then pass them to the multi-argument function or algorithm. The proposal contains a function called when_all that implements the first part of the process — the barrier. It takes either a pair of iterators to a container of futures or a variable number of futures, and returns a future that fires when all the arguments are ready. Conceptually, it performs a logical AND of all input futures.

The iterator version of when_all returns a future of a vector of futures, while the variadic version returns a future of a tuple of futures. It’s up to the client to get the resulting vector or tuple and iterate over it. Because of that, it’s not possible to directly chain the results of when_all the way then or next does it.

If you’re wondering how this kind of chaining is done in a functional language, you have to understand what partial application is. A function of many arguments doesn’t have to be applied to all of the arguments at once. You can imagine that applying it to the first argument doesn’t yield a value but rather a function on n-1 arguments. In C++11, this can be accomplished by calling std::bind, which takes a multi-parameter function and a value of the first argument, and returns a function object (a closure) that takes the remaining n-1 arguments (actually, you may pass it more than one argument at a time).

In this spirit, you could bind a multi-parameter function to a single future and get a future of a function of n-1 arguments. Then you are left with the problem of applying a future of a function to a future of an argument, and that’s exactly what the applicative pattern is all about. In Haskell, the Applicative class defines the operator <*> that applies an encapsulated function to an encapsulated value.

The Monoid Pattern

A very common pattern is to start several computations in parallel and pick the one that finishes first. This is the basis of speculative computation, where you pitch several algorithms against each other. Or you might be waiting for any of a number of asynchronous events, and attend to them as soon as they happen.

At a minimum you would expect a combinator that acts like a logical OR of two futures. A functional programmer would be immediately on the lookout for the monoid pattern. A monoid is equipped with a binary operation and a unit element. If the binary operation on futures picks the one that finishes first, what should the unit future be? A unit combined with any element must give back that same element. Therefore we need a future that would lose the race with any other future. We could call this special future “never.” Calling get on such a future would block forever.

In practice, one could slightly relax the definition of the “never” future. It would never return a result, but it could still throw an exception. A future like this could be used to implement a timeout. Pitching it against another future would either let the other future complete, or result in a timeout exception.

This is not the way the future extension proposal went, though. The proposed combinator is called when_any and it takes either a pair of iterators to a container of futures or a variable number of futures. It returns a future of either a vector or a tuple of futures. It’s up to the client to iterate over those futures and find the one (or the ones) that fired by calling is_ready on each of them.

The advantage of this approach is that the client may still write code to wait for the remaining futures to finish. The disadvantage is that the client is responsible for writing a lot of boilerplate code, which will obscure the program logic.

Performance and Programming Considerations

An objection to using futures as the main vehicle for asynchronous programming was raised in N3896: Library Foundations for Asynchronous Operations. The point it that it’s possible for an asynchronous API to have a result ready before the client had the opportunity to provide the continuation by calling then (or next). This results in unnecessary synchronization, which may negatively impact performance.

The alternative approach is to pass the continuation (the handler) directly to the asynchronous API. This is how a lot of asynchronous APIs are implemented at the lowest level anyway. The two approaches don’t exclude each other, but supporting both at the same time, as proposed in N3896, adds a lot of complexity to the programming model.

From the programmer’s perspective, the continuation passing model of N3896 is probably the hardest to use. The programming model is that of a state machine, with the client responsible for writing handlers for every transition.

Futures provide a useful abstraction by reifying the anticipated values. The programmer can write code as if the values were there. Futures also provide a common language between concurrent, parallel, and asynchronous worlds. It doesn’t matter if a value is to be evaluated by spawning a thread, creating a lightweight execution agent, or by calling an asynchronous API, as long as it’s encapsulated in a future. The compositional and expressional power of futures is well founded in major patterns of functional programming: the functor, the monad, the applicative, and the monoid.

There is another, even more attractive programming model that’s been proposed for C++, Resumable Functions, which makes asynchronous code look more like sequential code. This is based on a trick that’s well known to Haskell programmers in the form of the “do” notation. In C++, a resumable function would be chopped by the compiler into a series of continuations separated by await keywords. Instead of creating a future and calling then with a lambda function, the programmer would insert await and continue writing code as if the value were available synchronously.

Acknowledgment

I’d like to thank Artur Laksberg for reading the draft of this blog and providing useful feedback.


A heap is a great data structure for merging and sorting data. It’s implemented as a tree with the special heap property: A parent node is always less or equal than its children nodes, according to some comparison operator. In particular, the top element of the heap is always its smallest element. To guarantee quick retrieval and insertion, the tree doesn’t necessarily have to be well balanced. A leftist heap, for instance, is lopsided, with left branches always larger or equal to right branches.

The invariant of the leftist heap is expressed in terms of its right spines. The right spine of a tree is its rightmost path. Its length is called the rank of the tree. In a leftist heap the rank of the right child is always less or equal to the rank of the left child — the tree is leaning left. Because of that, the rank can grow at most logarithmically with the number of elements.

Leftist heap with ranks and spines. Ranks take into account empty leaf nodes, not shown.

Leftist heap with ranks and spines. Ranks take into account empty leaf nodes, not shown.

You can always merge two heaps by merging their right spines because they are just sorted linked lists. Since the right spines are at most logarithmically long, the merge can be done in logarithmic time. Moreover, it’s always possible to rotate nodes in the merged path to move heavier branches to the left and thus restore the leftist property.

With merging thus figured out, deletion from the top and insertion are trivial. After removing the top, you just merge left and right children. When inserting a new element, you create a singleton heap and merge it with the rest.

Implementation

The implementation of the functional leftist heap follows the same pattern we’ve seen before. We start with the definition:

A heap can either be empty or consist of a rank, a value, and two children: left and right heaps.

Let’s start with the definition of a non-empty heap as a private structure inside the Heap class:

template<class T>
class Heap
{
private:
    struct Tree
    {
        Tree(T v) : _rank(1), _v(v) {}
        Tree(int rank
            , T v
            , std::shared_ptr<const Tree> const & left
            , std::shared_ptr<const Tree> const & right)
        : _rank(rank), _v(v), _left(left), _right(right)
        {}

        int _rank;
        T   _v;
        std::shared_ptr<const Tree> _left;
        std::shared_ptr<const Tree> _right;
    };
    std::shared_ptr<Tree> _tree;
    ...
};

Heap data is just the shared_ptr<Tree>. An empty shared_ptr encodes an empty heap, otherwise it points to a non-empty Tree.

We’ll make the constructor of a non-empty heap private, because not all combinations of its arguments create a valid heap — see the two assertions:

Heap(T x, Heap const & a, Heap const & b)
{
    assert(a.isEmpty() || x <= a.front());
    assert(b.isEmpty() || x <= b.front());
    // rank is the length of the right spine
    if (a.rank() >= b.rank())
        _tree = std::make_shared<const Tree>(
                b.rank() + 1, x, a._tree, b._tree);
    else
        _tree = std::make_shared<const Tree>(
                a.rank() + 1, x, b._tree, a._tree);
}

We’ll make sure these assertions are true whenever we call this constructor from inside Heap code. This constructor guarantees that, as long as the two arguments are leftist heaps, the result is also a leftist heap. It also calculates the rank of the resulting heap by adding one to the rank of its right, shorter, branch. We’ll set the rank of an empty heap to zero (see implementation of rank).

As always with functional data structures, it’s important to point out that the construction takes constant time because the two subtrees are shared rather than copied. The sharing is thread-safe because, once constructed, the heaps are always immutable.

The clients of the heap will need an empty heap constructor:

Heap() {}

A singleton constructor might come in handy too:

explicit Heap(T x) : _tree(std::make_shared(x)) {}

They will need a few accessors as well:

bool isEmpty() const { return !_tree; }
int rank() const {
    if (isEmpty()) return 0;
    else return _tree->_rank;
}

The top, smallest, element is accessed using front:

T front() const { return _tree->_v; }

As I explained, the removal of the top element is implemented by merging left and right children:

Heap pop_front() const {
    return merge(left(), right()); 
}

Again, this is a functional data structure, so we don’t mutate the original heap, we just return the new heap with the top removed. Because of the sharing, this is a cheap operation.

The insertion is also done using merging. We merge the original heap with a singleton heap:

Heap insert(T x) {
    return merge(Heap(x), *this);
}

The workhorse of the heap is the recursive merge algorithm below:

static Heap merge(Heap const & h1, Heap const & h2)
{
    if (h1.isEmpty())
        return h2;
    if (h2.isEmpty())
        return h1;
    if (h1.front() <= h2.front())
        return Heap(h1.front(), h1.left(), merge(h1.right(), h2));
    else
        return Heap(h2.front(), h2.left(), merge(h1, h2.right()));
}

If neither heap is empty, we compare the top elements. We create a new heap with the smaller element at the top. Now we have to do something with the two children of the smaller element and the other heap. First we merge the right child with the other heap. This is the step I mentioned before: the merge follows the right spines of the heaps, guaranteeing logarithmic time. The left child is then combined with the result of the merge. Notice that the Heap constructor will automatically rotate the higher-rank tree to the left, thus keeping the leftist property. The code is surprisingly simple.

You might wonder how come we are not worried about the trees degenerating — turning into (left leaning) linked lists. Consider, however, that such a linked list, because of the heap property, would always be sorted. So the retrieval of the smallest element would still be very fast and require no restructuring. Insertion of an element smaller than the existing top would just prepend it to the list — a very cheap operation. Finally, the insertion of a larger element would turn this element into a length-one right spine — the right child of the top of the linked list. The degenerate case is actually our best case.

Turning an unsorted list of elements into a heap could naively be done in O(N*log(N)) time by inserting the elements one by one. But there is a better divide-and-conquer algorithm that does it in O(N) time (the proof that it’s O(N) is non-trivial though):

template<class Iter>
static Heap heapify(Iter b, Iter e)
{
    if (b == e)
        return Heap();
    if (e - b == 1)
        return Heap(*b);
    else
    {
        Iter mid = b + (e - b) / 2;
        return merge(heapify(b, mid), heapify(mid, e));
    }
}

This function is at the core of heap sort: you heapify a list and then extract elements from the top one by one. Since the extraction takes O(log(N)) time, you end up with a sort algorithm with the worst case performance O(N*log(N)). On average, heapsort is slower than quicksort, but quicksort’s worst case performance is O(N2), which might be a problem in some scenarios.

Next Page »