Lazy evaluation can be a powerful tool for structuring your code. For instance, it can let you turn your code inside out, inverting the flow of control. Many a Haskell program take advantage of laziness to express algorithms in clear succinct terms, turning them from recipes to declarations.

The question for today’s blog post is: How can we tap the power of lazy evaluation in an inherently eager language like C++? I’ll lead you through a simple coding example and gradually introduce the building blocks of lazy programming: the suspension, the lazy stream, and a whole slew of functional algorithms that let you operate on them. In the process we’ll discover some fundamental functional patterns like functors, monads, and monoids. I have discussed them already in my post about C++ futures. It’s very edifying to see them emerge in a completely different context.

# The Problem

Let’s write a program that prints the first n Pythagorean triples. A Pythagorean triple consists of three integers, x, y, and z, that satisfy the relation x^{2} + y^{2} = z^{2}. Let’s not be fancy and just go with the brute force approach. Here’s the program in C:

void printNTriples(int n) { int i = 0; for (int z = 1; ; ++z) for (int x = 1; x <= z; ++x) for (int y = x; y <= z; ++y) if (x*x + y*y == z*z) { printf("%d, %d, %d\n", x, y, z); if (++i == n) return; } }

Here, a single C function serves three distinct purposes: It

- Generates Pythagorean triples,
- Prints them,
- Counts them; and when the count reaches n, breaks.

This is fine, as long as you don’t have to modify or reuse this code. But what if, for instance, instead of printing, you wanted to draw the triples as triangles? Or if you wanted to stop as soon as one of the numbers reached 100? The problem with this code is that it’s structured inside out: both the test and the sink for data are embedded in the innermost loop of the algorithm. A more natural and flexible approach would be to:

- Generate the list of Pythagorean triples,
- Take the first ten of them, and
- Print them.

And that’s exactly how you’d write this program in Haskell:

main = print (take 10 triples) triples = [(x, y, z) | z <- [1..] , x <- [1..z] , y <- [x..z] , x^2 + y^2 == z^2]

This program reads: take 10 triples and print them. It declares `triples`

as a list (square brackets mean a list) of triples `(x, y, z)`

, where (the vertical bar reads “where”) `z`

is an element of the list of integers from 1 to infinity, `x`

is from 1 to `z`

, `y`

is from `x`

to `z`

, and the sum of squares of `x`

and `y`

is equal to the square of `z`

. This notation is called “list comprehension” and is characteristic of Haskell terseness.

You see the difference? Haskell let’s you abstract the notion of the list of Pythagorean triples so you can operate on it as one entity, whereas in C (or, for that matter, in C++) we were not able to disentangle the different, orthogonal, aspects of the program.

The key advantage of Haskell in this case is its ability to deal with infinite lists. And this ability comes from Haskell’s inherent laziness. Things are never evaluated in Haskell until they are absolutely needed. In the program above, it was the call to `print`

that forced Haskell to actually do some work: take 10 elements from the list of triples. Since the triples weren’t there yet, it had to calculate them, but only as many as were requested and not a number more.

# Suspension

We’ll start with the most basic building block of laziness: a suspended function. Here’s the first naive attempt:

template<class T> class Susp { public: explicit Susp(std::function<T()> f) : _f(f) {} T get() { return _f(); } private: std::function<T()> _f; };

We often create suspensions using lambda functions, as in:

int x = 2; int y = 3; Susp<int> sum([x, y]() { return x + y; }); ... int z = sum.get();

Notice that the suspended lambda may capture variables from its environment: here `x`

and `y`

. A lambda, and therefore a suspension, is a `closure`

.

The trouble with this implementation is that the function is re-executed every time we call `get`

. There are several problems with that: If the function is not pure, we may get different values each time; if the function has side effects, these may happen multiple times; and if the function is expensive, the performance will suffer. All these problems may be addressed by memoizing the value.

Here’s the idea: The first time the client calls `get`

we should execute the function and store the returned value in a member variable. Subsequent calls should go directly to that variable. We could implement this by setting a Boolean flag on the first call and then checking it on every subsequent call, but there’s a better implementation that uses thunks.

A thunk is a pointer to a free function taking a suspension (the `this`

pointer) and returning a value (by const reference). The `get`

method simply calls this thunk, passing it the `this`

pointer.

Initially, the thunk is set to `thunkForce`

, which calls the method `setMemo`

. This method evaluates the function, stores the result in `_memo`

, switches the thunk to `thunkGet`

, and returns the memoized value. On subsequent calls `get`

goes through the `getMemo`

thunk which simply returns the memoized value.

template<class T> class Susp { // thunk static T const & thunkForce(Susp * susp) { return susp->setMemo(); } // thunk static T const & thunkGet(Susp * susp) { return susp->getMemo(); } T const & getMemo() { return _memo; } T const & setMemo() { _memo = _f(); _thunk = &thunkGet; return getMemo(); } public: explicit Susp(std::function<T()> f) : _f(f), _thunk(&thunkForce), _memo(T()) {} T const & get() { return _thunk(this); } private: T const & (*_thunk)(Susp *); mutable T _memo; std::function<T()> _f; };

(By the way, the function pointer declaration of `_thunk`

looks pretty scary in C++, doesn’t it?)

But this is not the whole story. We have to make the suspension thread-safe or else we’ll have a data race when switching thunks. If we add a mutex to our data structure, we’ll pretty much turn it into an academic curiosity. So atomics to the rescue:

mutable std::atomic<T const & (*)(Susp *)> _thunk;

Fortunately, this is just a simple case of publication safety: the switch from one thunk to another happens only once. We can make it thread safe by using `memory_order_release`

when storing the pointer:

T const & setMemo() { _memo = _f(); _thunk.store(&thunkGet, std::memory_order_release); return getMemo(); }

and `memory_order_acquire`

when reading it (which turns into a regular load on the x86):

T const & get() { auto th = _thunk.load(std::memory_order_acquire); return th(this); }

The *release* of the thunk also guarantees that all changes to `_memo`

(which is *not* declared `atomic`

) become visible to the clients who *acquire* the new thunk, because of the “happens before” relationship they enforce.

Unfortunately, this lock free solution cannot guarantee that the suspended function will always be called only once. There is a window of opportunity between the first call to the suspended function and the setting of the thunk during which another thread might start executing the function in parallel. As long as the function is pure, however, this is perfectly safe. But it is the duty of the client to ensure the purity of the function they suspend!

You can find a lot more detail about the Haskell implementation of suspended functions in the paper by Tim Harris, Simon Marlow, and Simon Peyton Jones, Haskell on a Shared-Memory Multiprocessor.

# Lazy Stream

The loop we used to produce Pythagorean triples in C worked on the push principle — data was pushed towards the sink. If we want to deal with infinite lists, we have to use the pull principle. It should be up to the client to control the flow of data. That’s the inversion of control I was talking about in the introduction.

We’ll use a lazy list and call it a stream. In C++ a similar idea is sometimes expressed in terms of input and forward iterators, although it is understood that an iterator itself is not the source or the owner of data — just an interface to one. So we’ll stick with the idea of a stream.

We’ll implement the stream in the functional style as a persistent data structure fashioned after persistent lists (see my series of blog post on persistent data structures). It means that a stream, once constructed, is never modified. To “advance” the stream, we’ll have to create a new one by calling the `const`

method `pop_front`

.

Let’s start with the definition: A stream is either empty or it contains a suspended cell. This immediately suggests the implementation as a (possibly null) pointer to a cell. Since the whole stream is immutable, the cell will be immutable too, so it’s perfectly safe to share it between copies of the stream, including inter-thread sharing. We can therefore use a shared pointer:

template<class T> class Stream { private: std::shared_ptr <Susp<Cell<T>>> _lazyCell; };

Of course, because of reference counting and memoization, the stream is only conceptually immutable; but we (and the implementers of `shared_ptr`

) have made sure that the hidden mutation is thread safe.

So what’s in the `Cell`

? Remember, we want to be able to generate infinite sequences, so `Stream`

must contain the DNA for not only producing the value of type `T`

but also for producing the offspring — another (lazy) `Stream`

of values. The `Cell`

is just that: A value and a stream.

template<class T> class Cell { public: Cell() {} // need default constructor for memoization Cell(T v, Stream<T> const & tail) : _v(v), _tail(tail) {} explicit Cell(T v) : _v(v) {} T val() const { return _v; } Stream<T> pop_front() const { return _tail; } private: T _v; Stream<T> _tail; };

This mutually recursive pair of data structures works together amazingly well.

template<class T> class Stream { private: std::shared_ptr <Susp<Cell<T>>> _lazyCell; public: Stream() {} Stream(std::function<Cell<T>()> f) : _lazyCell(std::make_shared<Susp<Cell<T>>>(f)) {} Stream(Stream && stm) : _lazyCell(std::move(stm._lazyCell)) {} Stream & operator=(Stream && stm) { _lazyCell = std::move(stm._lazyCell); return *this; } bool isEmpty() const { return !_lazyCell; } T get() const { return _lazyCell->get().val(); } Stream<T> pop_front() const { return _lazyCell->get().pop_front(); } };

There are several things worth pointing out. The two constructors follow our formal definition of the `Stream`

: one constructs an empty stream, the other constructs a suspended `Cell`

. A suspension is created from a function returning `Cell`

.

I also added a move constructor and a move assignment operator for efficiency. We’ll see it used in the implementation of `forEach`

.

The magic happens when we call `get`

for the first time. That’s when the suspended `Cell`

comes to life. The value and the new stream are produced and memoized for later use. Or, this may happen if the first call is to `pop_front`

. Notice that `pop_front`

is a `const`

method — the `Stream`

itself is immutable. The method returns a new stream that encapsulates the rest of the sequence.

Let’s get our feet wet by constructing a stream of integers from `n`

to infinity. The constructor of a `Stream`

takes a function that returns a `Cell`

. We’ll use a lambda that captures the value of `n`

. It creates a `Cell`

with that value and a tail, which it obtains by calling `intsFrom`

with `n+1`

:

Stream<int> intsFrom(int n) { return Stream<int>([n]() { return Cell<int>(n, intsFrom(n + 1)); }); }

It’s a recursive definition, but without the usual recursive function calls that eat up the stack. The call to the inner `intsFrom`

is not made from the outer `intsFrom`

. Instead it’s made the first time `get`

is called on the emerging `Stream`

.

Of course, we can also create finite streams, like this one, which produces integers from `n`

to `m`

:

Stream<int> ints(int n, int m) { if (n > m) return Stream<int>(); return Stream<int>([n, m]() { return Cell<int>(n, ints(n + 1, m)); }); }

The trick is to capture the limit `m`

as well as the recursion variable `n`

. When the limit is reached, we simply return an empty `Stream`

.

We’ll also need the method `take`

, which creates a `Stream`

containing the first `n`

elements of the original stream:

Stream take(int n) const { if (n == 0 || isEmpty()) return Stream(); auto cell = _lazyCell; return Stream([cell, n]() { auto v = cell->get().val(); auto t = cell->get().pop_front(); return Cell<T>(v, t.take(n - 1)); }); }

Here we are capturing the suspended cell and use it to lazily generate the elements of the new, truncated, `Stream`

. Again, the key to understanding why this works is to keep in mind that `Stream`

s and `Cell`

s are conceptually immutable, and therefore can be shared by the implementation. This has some interesting side effects, which don’t influence the results, but change the performance. For instance, if the caller of `take`

forces the evaluation of the first n elements — e.g., by passing them through the consuming `forEach`

below — these elements will appear miraculously memoized in the original `Stream`

.

Finally, we’ll need some way to iterate through streams. Here’s an implementation of `forEach`

that consumes the stream while enumerating it and feeding its elements to a function.

template<class T, class F> void forEach(Stream<T> strm, F f) { while (!strm.isEmpty()) { f(strm.get()); strm = strm.pop_front(); } }

It’s the assignment:

strm = strm.pop_front();

which consumes the stream by decreasing the reference count of the head of the `Stream`

. In particular, if you pass an rvalue `Stream`

to `forEach`

, its elements will be generated and deleted in lockstep. The algorithm will use constant memory, independent of the virtual length of the `Stream`

. What Haskell accomplishes with garbage collection, we approximate in C++ with reference counting and `shared_ptr`

.

# Working with Streams

It’s not immediately obvious how to translate our Pythagorean triple program from nested loops to lazy streams, so we’ll have to take inspiration from the corresponding Haskell program. Let me first rewrite it using a slightly different notation:

triples = do z <- [1..] x <- [1..z] y <- [x..z] guard (x^2 + y^2 == z^2) return (x, y, z)

The general idea is this: Start with the stream of integers from 1 to infinity. For every such integer — call it `z`

— create a stream from 1 to `z`

. For each of those — call them `x`

— create a stream from `x`

to `z`

. Filter out those which don’t satisfy the Pythagorean constraint. Finally, output a stream of tuples `(x, y, z)`

.

So far we’ve learned how to create a stream of integers — we have the function `intsFrom`

. But now we’ll have to do something for each of these integers. We can’t just enumerate those integers and apply a function to each, because that would take us eternity. So we need a way to act on each element of a stream lazily.

In functional programming this is called mapping a function over a list. In general, a parameterized data structure that can be mapped over is called a functor. I’m going to show you that our `Stream`

is a functor.

# Stream as a Functor

The idea is simple: we want to apply a function to each element of a stream to get a new transformed stream (it’s very similar to the `std::transform`

algorithm from STL). The catch is: We want to do it generically and lazily.

To make the algorithm — we’ll call it `fmap`

— generic, we have to parameterize it over types. The algorithm starts with a `Stream`

of elements of type `T`

and a function from `T`

to some other type `U`

. The result should be a stream of `U`

.

We don’t want to make `U`

the template argument, because then the client would have to specify it explicitly. We want the compiler to deduce this type from the type of the function. We want, therefore, the function type `F`

to be the parameter of our template (this will also allow us to call it uniformly with function pointers, function objects, and lambdas):

template<class T, class F> auto fmap(Stream<T> stm, F f)

Without the use of concepts, we have no way of enforcing, or even specifying, that `F`

be a type of a function from `T`

to `U`

. The best we can do is to statically assert it inside the function:

static_assert(std::is_convertible<F, std::function<U(T)>>::value, "fmap requires a function type U(T)");

But what is `U`

? We can get at it using `decltype`

:

decltype(f(stm.get()));

Notice that `decltype`

takes, as an argument, an expression that can be statically typed. Here, the expression is a function call of `f`

. We also need a dummy argument for this function: we use the result of `stm.get()`

. The argument to `decltype`

is never evaluated, but it is type-checked at compile time.

One final problem is how to specify the return type of `fmap`

. It’s supposed to be `Stream<U>`

, but we don’t know `U`

until we apply `decltype`

to the arguments of `fmap`

. We have to use the new `auto`

function declaration syntax of C++11. So here are all the type-related preliminaries:

template<class T, class F> auto fmap(Stream<T> stm, F f)->Stream<decltype(f(stm.get()))> { using U = decltype(f(stm.get())); static_assert(std::is_convertible<F, std::function<U(T)>>::value, "fmap requires a function type U(T)"); ... }

Compared to that, the actual implementation of `fmap`

seems rather straightforward:

if (stm.isEmpty()) return Stream<U>(); return Stream<U>([stm, f]() { return Cell<U>(f(stm.get()), fmap(stm.pop_front(), f)); });

In words: If the stream is empty, we’re done — return an empty stream. Otherwise, create a new stream by suspending a lambda function. That function captures the original stream (by value) and the function `f`

, and returns a `Cell`

. That cell contains the value of `f`

acting on the first element of the original stream, and a tail. The tail is created with `fmap`

acting on the rest of the original stream.

Equipped with `fmap`

, we can now attempt to take the first step towards generating our triples: apply the function `ints(1, z)`

to each element of the stream `intsFrom(1)`

:

fmap(intsFrom(1), [](int z) { return ints(1, z); });

The result is a `Stream`

of `Stream`

s of integers of the shape:

1 1 2 1 2 3 1 2 3 4 1 2 3 4 5 ...

But now we are stuck. We’d like to apply `ints(x, z)`

to each element of that sequence, but we don’t know how to get through two levels of `Stream`

. Our `fmap`

can only get through one layer. We need a way to flatten a `Stream`

of `Stream`

s. That functionality is part of what functional programmers call a monad. So let me show you that `Stream`

is indeed a monad.

# Stream as a Monad

If you think of a `Stream`

as a list, the flattening of a list of lists is just concatenation. Suppose for a moment that we know how to lazily concatenate two `Stream`

s (we’ll get to it later) and let’s implement a function `mjoin`

that concatenates a whole `Stream`

of `Stream`

s.

You might have noticed a pattern in the implementation of lazy functions on streams. We use some kind of recursion, which starts with “Are we done yet?” If not, we do an operation that involves one element of the stream and the result of a recursive call to the function itself.

The “Are we done yet?” question usually involves testing for an empty stream. But here we are dealing with a `Stream`

of `Stream`

s, so we have to test two levels deep. This way we’ll ensure that the concatenation of a `Stream`

of empty `Stream`

s immediately returns an empty `Stream`

.

The recursive step in `mjoin`

creates a `Cell`

whose element is the head of the first stream, and whose tail is the concatenation of the tail of the first stream and the result of `mjoin`

of the rest of the streams:

template<class T> Stream<T> mjoin(Stream<Stream<T>> stm) { while (!stm.isEmpty() && stm.get().isEmpty()) { stm = stm.pop_front(); } if (stm.isEmpty()) return Stream<T>(); return Stream<T>([stm]() { Stream<T> hd = stm.get(); return Cell<T>( hd.get() , concat(hd.pop_front(), mjoin(stm.pop_front()))); }); }

The combination of `fmap`

and `mjoin`

lets us compose function like `intsFrom`

or `ints`

that return `Stream`

s. In fact, this combination is so common that it deserves its own function, which we’ll call `mbind`

:

template<class T, class F> auto mbind(Stream<T> stm, F f) -> decltype(f(stm.get())) { return mjoin(fmap(stm, f)); }

If we use `mbind`

in place of `fmap`

:

mbind(intsFrom(1), [](int z) { return ints(1, z); });

we can produce a flattened list:

1 1 2 1 2 3 1 2 3 4...

But it’s not just the list: Each element of the list comes with variables that are defined in its environment — here the variable `z`

. We can keep chaining calls to `mbind`

and capture more variables in the process:

mbind(intsFrom(1), [](int z) { return mbind(ints(1, z), [z](int x) { return mbind(ints(x, z), [x, z](int y) { ... } } }

At this point we have captured the triples `x`

, `y`

, `z`

, and are ready for the Pythagorean testing. But before we do it, let’s define two additional functions that we’ll use later.

The first one is `mthen`

which is a version of `mbind`

that takes a function of no arguments. The idea is that such a function will be executed for each element of the stream, but it won’t use the value of that element. The important thing is that the function will *not* be executed when the input stream is empty. In that case, `mthen`

will return an empty stream.

We implement `mthen`

using a slightly modified version of `fmap`

that takes a function `f`

of no arguments:

template<class T, class F> auto fmapv(Stream<T> stm, F f)->Stream<decltype(f())> { using U = decltype(f()); static_assert(std::is_convertible<F, std::function<U()>>::value, "fmapv requires a function type U()"); if (stm.isEmpty()) return Stream<U>(); return Stream<U>([stm, f]() { return Cell<U>(f(), fmapv(stm.pop_front(), f)); }); }

We plug it into the definition of `mthen`

the same way `fmap`

was used in `mbind`

:

template<class T, class F> auto mthen(Stream<T> stm, F f) -> decltype(f()) { return mjoin(fmapv(stm, f)); }

The second useful function is `mreturn`

, which simply turns a value of any type into a one-element `Stream`

:

template<class T> Stream<T> mreturn(T v) { return Stream<T>([v]() { return Cell<T>(v); }); }

We’ll need `mreturn`

to turn our triples into `Stream`

s.

It so happens that a parameterized type equipped with `mbind`

and `mreturn`

is called a monad (it must also satisfy some additional monadic laws, which I won’t talk about here). Our lazy `Stream`

is indeed a monad.

# Stream as a Monoid and a Monad Plus

When implementing `mjoin`

we used the function `concat`

to lazily concatenate two `Stream`

s. Its implementation follows the same recursive pattern we’ve seen so many times:

template<class T> Stream<T> concat( Stream<T> lft , Stream<T> rgt) { if (lft.isEmpty()) return rgt; return Stream<T>([=]() { return Cell<T>(lft.get(), concat<T>(lft.pop_front(), rgt)); }); }

What’s interesting is that the concatenation of streams puts them under yet another well known functional pattern: a monoid. A monoid is equipped with a binary operation, just like `concat`

, which must be associative and possess a unit element. It’s easy to convince yourself that concatenation of `Stream`

s is indeed associative, and that the neutral element is an empty `Stream`

. Concatenating an empty `Stream`

, whether in front or in the back of any other `Stream`

, doesn’t change the original `Stream`

.

What’s even more interesting is that being a combination of a monoid and a monad makes `Stream`

into a *monad plus*, and every monad plus defines a `guard`

function — exactly what we need for the filtering of our triples. This function takes a Boolean argument and outputs a `Stream`

. If the Boolean is false, the `Stream`

is empty (the unit element of monad plus!), otherwise it’s a singleton `Stream`

. We really don’t care what value sits in this `Stream`

— we never use the result of `guard`

for anything but the flow of control. In Haskell, there is a special “unit” value `()`

— here I use a `nullptr`

as its closest C++ analog.

Stream<void*> guard(bool b) { if (b) return Stream<void*>(nullptr); else return Stream<void*>(); }

We can now pipe the result of `guard`

into `mthen`

, which will ignore the content of the `Stream`

but won’t fire when the `Stream`

is empty. When the `Stream`

is not empty, we will call `mreturn`

to output a singleton `Stream`

with the result tuple:

Stream<std::tuple<int, int, int>> triples() { return mbind(intsFrom(1), [](int z) { return mbind(ints(1, z), [z](int x) { return mbind(ints(x, z), [x, z](int y) { return mthen(guard(x*x + y*y == z*z), [x, y, z]() { return mreturn(std::make_tuple(x, y, z)); }); }); }); }); }

These singletons will then be concatenated by the three levels of `mbind`

to create one continuous lazy `Stream`

of Pythagorean triples.

Compare this function with its Haskell counterpart:

triples = do z <- [1..] x <- [1..z] y <- [x..z] guard (x^2 + y^2 == z^2) return (x, y, z)

Now, the client can `take`

10 of those triples from the `Stream`

— and the triples still won’t be evaluated!. It’s the consuming `forEach`

that finally forces the evaluation:

void test() { auto strm = triples().take(10); forEach(std::move(strm), [](std::tuple<int, int, int> const & t) { std::cout << std::get<0>(t) << ", " << std::get<1>(t) << ", " << std::get<2>(t) << std::endl; }); }

# Conclusion

The generation of Pythagorean triples is a toy example, but it shows how lazy evaluation can be used to restructure code in order to make it more reusable. You can use the same function `triples`

to print the values in one part of your program and draw triangles in another. You can filter the triples or impose different termination conditions. You can use the same trick to generate an infinite set of approximation to the solution of a numerical problem, and then use different strategies to truncate it. Or you can create an infinite set of animation frames, and so on.

The building blocks of laziness are also reusable. I have used them to implement the solution to the eight-queen problem and a conference scheduling program. They are by construction thread safe and the combinators that bind them are thread safe too. This is, in general, the property of persistent data structures.

You might be concerned about the performance of lazy data structures, and rightly so. They use the heap heavily, so memory allocation and deallocation is a serious performance bottleneck. There are many situation, though, where code structure, reusability, maintenance, and correctness (especially in multithreaded code) are more important than performance. And there are some problems that might be extremely hard to express without the additional flexibility gained from laziness.

I made the sources to all code in this post available on GitHub.