C++



I am frankly disappointed with C++0x futures. They lack a major feature–composability. It’s like having Booleans without OR or AND.

But let me first explain what futures are and how they’re implemented in C++0x.

What is a future?

In a nutshell, a future is an asynchronous function call. You call a function, but don’t wait until it completes–the body of the function executes in a separate thread. Instead you get a future, a promise to deliver the result at some later point in time. Only when your program desperately needs that value, do you synchronize on the future. If the calculation hasn’t been completed by then, you block until it has. Otherwise you get the result of the calculation.

C++0x futures

C++ splits the implementation of futures into a set of small blocks–it almost creates an assembly language of futures. Here are the primitives.

-promise

A promise is a vehicle for passing the return value (or an exception) from the thread executing a function to the thread that cashes in on the function future. The function (or a callable object) that is executed in a separate thread must have access to the promise, and it explicitly sets the return value.

You can think of a promise as a primitive channel; the function in this case is the sender. The equivalent of the “send” method is called promise::set_value (to pass an exception to the caller, the callee can call promise::set_exception.

There is no corresponding “receive” method–the receiving is abstracted into the future object.

Here’s some code that illustrates the use of promise. Notice how the function to be called asynchronously must be aware of the promise.

void asyncFun(promise<int> intPromise)
{
    int result;
    try {
        // calculate the result
        intPromise.set_value(result);
    } catch (MyException e) {
        intPromise.set_exception(std::copy_exception(e));
    }
}

The calling thread creates the promise and passes it to the worker thread executing asyncFun (I’ll show the details later).

-future

A future is the synchronization object constructed around the receiving end of the promise channel. The calling thread obtains a future using promise::get_future.

The most important method of the future is get. If the result is not ready, get will block. When get completes, it either returns a value or throws an exception. The return value or the exception has been set by the called function through the promise associated with the future (see above).

get can also be split into its more basic parts: wait, optionally followed by has_value (or has_exception) and the call to get, which is guaranteed not to block. The advantage of the latter approach is that one can use versions of wait, wait_for and wait_until, that set timeouts.

There is also an asynchronous method, is_ready, that can be used for polling rather than blocking.

There are two separate implementations of future: regular future (which used to be called unique_future in Draft Standard) that works like a unique_ptr as far as passing values goes, and shared_future that works more like shared_ptr. For instance, the method get of future cannot be called twice because it transfers the ownership of the result.

Here’s an example how a future could be used in the caller’s code:

promise<int> intPromise;
future<int> intFuture = intPromise.get_future();
std::thread t(asyncFun, std::move(intPromise));
// do some other stuff
int result = intFuture.get(); // may throw MyException

-packaged_task

In most cases the use of promises can be hidden from the programmer. There is a template class, packaged_task that takes any function (or callable object) and instruments it for use with futures. In essence, it creates a promise and calls the function from inside a try/catch block. If the function returns, packaged_task puts the value in the promise, otherwise it sets the exception. A packaged_task is a callable object (has the function call operator() defined) can be passed directly to a thread.

Here’s a more complete example:

vector<int> primesUpTo(int n);

int main() {
    packaged_task<vector<int>(int)> task(&primesUpTo);
    future<vector<int>> fut = task.get_future();
    thread t(move(task), 100);
    t.detach();
    // do some other work while the primes are computing
    vector<int> primes = fut.get();
    // print primes
    return 0;
}

Composability

So far so good. The creation of futures might be a bit verbose in C++, but it can be easily adapted to concrete environments. The futures standard library should be treated as a set of primitive building blocks from which to build higher abstractions.

Notice how C++ separated the channel (promise) from the synchronization mechanism (future); and thread creation from communication and synchronization. You may, for instance, use futures with thread pools or create threads on the spot.

What was omitted from the standard, however, was the ability to compose futures. Suppose you start several threads to perform calculations or retrieve data in parallel. You want to communicate with those threads using futures. And here’s the problem: you may block on any individual future but not on all of them. While you are blocked on one, other futures might become ready. Instead of spending your precious time servicing those other futures, you might be blocked on the most time-consuming one. The only option to process futures in order of completion is by polling (calling is_ready on consecutive futures in a loop) and burning the processor.

The need for composability of synchronization events has been widely recognized. Windows has WaitForMultipleObjects, Unix has select, and several higher order languages have explicit constructs that serve the same purpose. But that’s a topic for my next installment.

Acknowledgments

I’d like to thank Anthony Williams for helpful comments. Among other things, he is the author of one of the futures proposals for Boost. A stripped down version of it will become part of the threadpool library. Unfortunately, the stripping got rid of composability features.


An immutable object never changes. You can bet your program on it. As I explained in my previous post, the same is not true for const objects (or readonly objects, in dialects of Java). They may be changed through mutable aliases. An immutable object has no mutable aliases. Ever!

Small print: this guarantee is predicated on the programmer not overriding the type system with casts and other escape mechanisms.

To my knowledge, immutability is currently available in the D programming language and in a Java dialect called IGJ (Immutability Generic Java). It is the default in functional languages.

The closest you can get to immutability in C++ is by using const as a storage class:

const double pi = 3.141592;
const char ERRORMSG[] = "Your bank went belly up.";

Here, the value of pi or ERRORMSG is guaranteed to never change. Global or static immutable values can be used at compile time (for instance as labels in a switch statement).

Creating Immutable Objects

Defining immutable numbers, arrays, or POD structs is pretty straightforward. But what about more complex objects that don’t have static initializers? How do you create an immutable linked list? In functional languages, lists are treated as built-in types; like arrays in general purpose languages. They can be statically initialized. But in C++ or Java the creation of a list involves memory allocations and reference manipulation. Since, by definition, we can’t manipulate an immutable list, it seems like we can never create one!

How about relaxing the constraints a little to allow mutation inside the constructor of an immutable object. Here’s a hypothetical example of a list in D (the current D compiler doesn’t fully implement immutability, so my examples are written in pseudo-D):

class IntList
{
public:
  // default constructor
  this() {} // _head is default-initialized to null
  // one element constructor
  this(int i) {
    _head = new IntLink(i); // mutates _head
  }
private:
  IntLink _head;
}
// will this work?
immutable IntList oneList= new immutable IntList(1);

There is a significant problem with this solution. If this is not considered immutable inside the constructor then there is no guarantee that a mutable reference to this or any of its subobjects won’t escape its scope. Consider this example:

IntLink globalLink; // mutable!

IntList.this(int i) {
  _head = new IntLink(i);
  globalLink = _head; // escape!
}

immutable IntList immList= new immutable IntList(1);
globalLink.setValue(2); // mutates immList!

Here, an immutable object immList has been mutated through an alias globalLink. We can’t allow this!

It’s true that a compiler could perform escape analysis on the constructor of IntList, provided it has access to its source code; which might not always be true when it’s compiling the statement that creates the immutable object. After all, class IntList might be implemented in a third-party library.

In the absence of source code, the only other possibility is to include immutability information in the type signature of the constructor. When an immutable object is created, the compiler would use an immutable constructor, and it would fail if one didn’t exist. Conversely, an immutable constructor would not compile if it allowed a mutable reference to escape. This bad code would not compile:

IntList.this(int i) immutable {
  _head = new IntLink(i);
  globalLink = _head; // error!
}

Of course, no mutable methods may be called from inside an immutable constructor–they couldn’t guarantee the non-escape of mutable aliases.

This solution works, even if it’s not perfect. It often leads to code duplication (the immutable constructor being identical to the mutable one, as in the IntList example). Moreover, it prevents some forms of refactoring. Even though, inside an immutable constructor, you may initialize an object’s fields, you can’t delegate this task to a (perforce immutable) method of the object.

Assignment is not the same as Mutation

The key to immutable construction is the observation that, when constructing an immutable object, it’s okay to assign the object’s fields but not to mutate them. During construction only such “shallow” mutation should be possible.

In my example, the assignment to _head is okay, but the mutation of the IntLink object attached to it should be prohibited. Indeed, I don’t need to mutate the head link once it’s constructed. Of course the construction of an immutable IntLink follows the same rules. Here’s the relevant code:

class IntLink
{
  this(int i) immutable {
    // _next is default-initialized to null
    _val = i; // field assignment
  }
  int _val;
  IntLink _next;
}

With this understanding, it’s actually possible to minimize code duplication. To this end, IGJ introduces a new type modifier, AssignFields. A constructor or a method that performs no other mutation but field assignment may be declared AssignFields. Since AssignFields methods and AssignFields constructors can also be used in mutable contexts, they don’t have to be duplicated. Expanding on the above example:

class IntLink
{
  this(int i) assignfields {
    // _next is default-initialized to null
    SetValue(i); // Ok: it's an assignfields method
  }
  void SetValue(int i) assignfields {
    _val = i; // field assignment
  }
  int _val;
  IntLink _next;
}

As you can see, I was even able to refactor the part of the constructor that does the assignment to _val. I can now use the same constructor in both, mutable and immutable, contexts. The SetValue method can only be called in a mutable or assignfields context.

immutable IntLink iLink = new immutable IntLink(1);
IntLink mLink = new IntLink(2);
mLink.SetValue(3);

Subtyping relationships

It is okay to pass an immutable object to a function that expects a const object. After all, such a function will not mutate the object. If a reference to the object escapes the function, it can only be a const reference. And again, const reference cannot be used to mutate the object, so we’re fine.

The compiler will allow this subsumption if we establish that immutable is a subtype of const (more precisely, for any type T, immutable T is a subtype of const T). This is very similar to the compiler allowing the passing a derived class object to a function that accepts a base class objects–the Liskov substitution principle.

The full subtyping hierarchy between various mutability annotations is as follows:

  • immutable is a subtype of const: You can pass an immutable object to a function taking a const argument.
  • assignfields is a subtype of const: You can call a const method from an assignfields method (subsumption of this).
  • mutable is a subtype of assignfields. You can call an assignfields method on a mutable object (as in mLink.SetValue()).

Because of transitivity, mutable is also a subtype of const. You can pass a mutable object to a function that takes a const argument (you do it in C++ without much thinking).

subtyping relationships

Conclusion

There is some advantage to introducing yet another type modifier, assignfields, to smooth out the process of constructing immutable objects. On the other hand, is it worth additional complexity? How often does one construct non-trivial immutable objects? If it’s not a common programming pattern then maybe we can live with current restrictions and some code duplication. We still have very little experience in using immutable in D, so we might have to wait and see.

Recent news: A video of my talk to the Vancouver C++ Users group was just posted. The black shadow in front of the bright screen talking about memory fences is yours truly.

If you found this post interesting, please register your vote on reddit, so that more people can find it.


Why const?

With judicious use of const you can catch many bugs and improve code documentation and maintenance. For me the classic example is reading from and writing to files. Look at these two interfaces in stylized C++:

bool Read(File & file, char * buffer, unsigned & size)
bool Write(File & file, char const * buffer, unsigned const & size);

The first one tells me that the buffer may be modified by the function. In fact, the Read function will most likely write characters into it. It is also allowed to modify the parameter size, putting the number of characters read into it. The incorrect call

Read("abc", 3);

will not compile.

The second declaration promises that the buffer will not be modified and neither will size. So the following (correct) call will compile:

Write("abc", 3);

What’s more, when compiling the implementation of Write, the compiler will ensure that the arguments are not modified (unless the programmer subverts the type system using const_cast).

What does const guarantee? Not much!

This optimistic picture is marred by a few gotchas. You’d think that if you call a function taking a const reference, the object you are passing won’t change (assume there’s no use of const_cast or mutable members). Think again!

  1. In a multithreaded program, another thread might change the object concurrently. You may avoid this problem by preventing concurrent access using critical sections.
  2. But even in a single threaded program there is no guarantee the const object won’t change. That’s because there might be a non-const alias to the argument you’re passing. This alias might be accessible to the function either through a global variable, or through another argument. The classic example is copying elements between overlapping ranges. The source for the copy is passed as const, the target as non-const, but if they point at overlapping regions, the source may end up modified.

    There is a stronger version of constness that might guarantee immutability. The qualifier is usually called immutable and it’s available in the D programming language and in some dialects of Java. I’ll come back to it later.

  3. On top of those gotchas, even in a single-threaded C++ program with no aliasing of arguments, a const function argument might get mutated. That’s because constness is shallow–it doesn’t protect indirect data members (pointers or references). See next section.

Transitivity or “deep” const

Consider a list that is defined recursively:

class List {
public:
  List * GetNext() const { return _next; }
  void SetNext(List * next) { _next = next; }
private:
  List * _next;
};
void clip(List const * list) {
  List * next = list.GetNext();
  if (next)
    next.SetNext(0);
}

Function clip takes a const list and blatantly modifies it. That’s because the indirect member, _next, of a const List is not const. C++ constness is not transitive! This fact is reflected in GetNext being a const method and returning a non-const pointer to _next. So even if this is const inside GetNext, indirect parts of this are open to modification.

I always found the non-transitive definition of const counter-intuitive, and made sure that in my code constness was always transitive. For instance, I would implement GetNext to return a const pointer:

List const * GetNext() const { return _next; }

and maybe provide another non-const method that returns a non-const pointer. Notice the need for code duplication in this approach.

The depths of immutability

I have an even larger problem with non-transitive immutability (as opposed to constness). I expect a const method of an immutable object to leave the object unchanged. In particular I expect such method to behave like a pure function (not to be confused with pure virtual function). A pure function returns the same value every time it’s called with the same arguments. I expect the length of an immutable string to never change. Every call to size should return the same value. I would even expect the compiler to eliminate redundant calls to size is such a case. Here’s a hypothetical example:

immutable string("Hello!");
doSomething(string.data(), string.size());
doSomethingElse(string.data(), string.size());

A smart compiler could cache the results of data and size from the first set of calls and reuse them when calling doSomethingElse. However, if immutable is not transitive, such an optimization is incorrect. Just imagine a bizarre implementation of string that stores the length indirectly:

class string {
public:
  string(char const * src) {
    _length = new unsigned;
    *_length = strlen(src);
    ...
  }
  unsigned size() const {
    ++*_length;
    return *_length;
  }
private:
  unsigned * _length;
};

The D-language model

Taking all the above arguments into account, we decided to implement both const and immutable as transitive type modifiers in the D programming language. This enabled a lot of interesting compiler optimizations.

Making const and immutable transitive is essential for D’s support for functional-style programming. It’s important that const methods of an immutable object be pure functions. It’s also important to let the compiler optimize functional-style code, which otherwise tends to be less efficient than its imperative-style equivalent.

Functional programming is one of the safest paradigms for concurrency. Pure functions and immutable objects don’t require locking.

The price of transitivity

Having said that, transitivity of const imposes some constraints on the programming style. It enforces the view that anything reachable from an object is part of the object. When you are given a const linked list, all links in it must necessarily be const. That makes logical sense–the links are part of the list.

The picture gets muddier when the object contains references to some external services. Just by calling them “external,” I’m admitting that they are not part of the object.

Can an object that contains a Windows handle be const? Is it enough that it doesn’t modify the handle itself? Or shouldn’t it also refrain from modifying the window or the file behind the handle? Handles are often implemented as opaque pointers so, strictly speaking, constness should propagate through them. But how do I know which APIs don’t mutate the state of Windows and which do? This is an extreme example, but there are many in-between cases.

In an editor you might have an undo stack that stores editing commands. When you call the undo method of the command object, you may pass it a reference to the document object on which it is to operate. In that case undo may be declared as const. But if you store a reference to the document within each command, the undo method can no longer be const–it modifies the document. The classic Command Pattern uses the latter approach.

It hasn’t been decided yet if and how D will support an escape mechanism from const transitivity (and constness in general). Following C++, D could use the mutable type qualifier for that purpose. This would also solve the problem of memoizing and caching inside a const object.

In the next installment, I’ll describe more complexities of immutability and discuss a dialect of Java that supports immutability constraints.

You can vote for this article on reddit


In my last post I made a mistake of publishing a piece of code that used C++ weak atomics–which are part of the new C++ Memory Model. I qualified the post by saying that I had no proof of correctness, and the main purpose of it was to discourage the use of weak atomics. In a way the post was successful because it did prove that weak atomics are very tricky indeed. As was pointed out to me by Anthony and Dmitriy, my implementation was incorrect (I modified the post retroactively to make this point clear). Additionally, in his blog, Anthony published a proof of the algorithm posted by Dmitriy. And that’s when the fun started…

Before I describe what happened, let me give you a bird’s eye view of the C++ Memory Model.

The Model

The purpose of a memory model is to define the semantics of concurrent operations at an abstract level–independent of the platform on which the program is running. Essentially, one has to define in what order program statements may be executed, and what memory values may be observed. The basic memory model, the one used by Java, is based on sequential consistency. Program statements are executed in program order on each processor, and the actions of multiple threads are interleaved in some (global) order. A memory load can only see the value that was written last, according to this global order.

The C++ strong atomics (the default) follow this model.

It’s the weak atomics that are trouble. Weak atomics let the programmer specify memory ordering of loads and stores that is less than sequentially consistent. Unlike the sequentially consistent model, the model for weak atomics cannot be described in terms of a state machine (this was pointed out and criticized by Herb Sutter in his Prism proposal). Instead the model is formulated as a set of global constraints imposed on executions.

In a state-machine formulation, a step the program takes is based on previous history. The step doesn’t have to be uniquely defined–the machine may be non-deterministic. The execution may split into multiple branches. What is important, though, is that the step taken does not depend on the future of the execution.

This is not true in the weak-atomics model. The way to think about it is that, in a non-sequentially-consistent execution, the values seen by memory loads may come either from the past or from the future. A processor may speculate–make a guess at the values that have not been stored yet. A speculation may be discarded if the guess turns out to be wrong, or accepted if it was right. So the acceptability of a given execution can only be determined by looking at the whole execution, from start to end. Understandably, this makes any correctness proofs extremely tricky.

The Proof

The proposed implementation of the Peterson lock using weak atomics can be reduced to the following:

// Thread 0
zeroWants.save(true, memory_order_relaxed);
r0 = victim.exchange(0, memory_order_acq_rel);
r1 = oneWants.load(memory_order_acquire);

// Thread 1
oneWants.save(true, memory_order_relaxed);
r2 = victim.exchange(1, memory_order_acq_rel);
r3 = zeroWants.load(memory_order_acquire);

(The memory_order_acquire directives in the two loads are needed for synchronization with the unlock part of the algorithm, which is not included here.)

The hardest part of the proof is to show mutual exclusion–it should be impossible for both threads to enter the critical section at the same time. In Peterson’s algorithm, the entrance to the critical section is predicated on the values loaded into r1 and r3. If both values are false, the algorithm is broken.

Let’s consider the situation when both threads try to acquire the lock for the first time. The values of zeroWants and oneWants are initially false, and the value of victim is zero. Is it possible for both loads (zeroWants and oneWants) to see the initial values, rather than the ones saved by the opposing thread? The short answer is yes, as long as there is no intervening inter-thread synchronization.

In this case, the only synchronization that may happen is between a store-release in one thread and the corresponding load-acquire in the other. For instance, if the load part of the exchange operation on victim in thread 0 sees the value written by the store part of the exchange in thread 1 (that is, r0 is equal to 1), than these two operations are in a “synchronizes-with” relationship. We can then establish the following chronology:

  1. The save of true to oneWants in T1 happens before the store of 1 to victim (part of exchange) in T1. This is guaranteed by program order.
  2. The store of 1 into victim in T1 happens before the load of victim in T0 (part of exchange). This is because r0 is 1 in this part of the proof.
  3. The load of victim in T0 happens before the load of oneWants in T0. This is guaranteed by program order.

Because the happens-before relation is transitive, we can conclude that the load of oneWants in T0 (item 3) must see the value stored in T1 (item 1), which is true, and therefore T0 won’t enter the critical section. So far so good.

That was the simple case. Here’s the controversial one: Both r0 and r2 are zero. In particular, it means that r0 came from the initial value of victim. What about r2? It could have come either from the exchange in T0 or from the initial value of victim (both are zero!). According to the memory model, we have to consider both cases!

If r0 came from the exchange in T1 then we can use the same reasoning as before to conclude that T1 cannot enter the critical section.

But if the source of the zero in r0 is the original value of victim then there is no synchronization between T0 and T1. Both threads can enter the critical section!

You might be thinking: Either the exchange in T1 comes first and the exchange in T2 must see its value, or the exchange in T2 comes first and the one in T1 must see its value–they can’t both see the original value. This just shows you how ingrained sequentially consistent thinking is. There is no meaning to “comes first” in the world of weak atomics. Two events may only be sequenced if they occur in the same thread, or there is a synchronization path between them. Here we are still at the stage of trying to establish whether the two exchanges are synchronized or not.

The Controversy

Let me spell it out: Both exchanges see the initial value of victim. Since they didn’t interact, there is no synchronizes-with relationship between them. Therefore we can’t establish any inter-thread chronology, and it’s perfectly possible for both threads to see the original values of zeroWants and oneWants.

// Thread 0
zeroWants.save(true, memory_order_relaxed);
r0 = victim.exchange(0, memory_order_acq_rel); // sees initial value, 0
r1 = oneWants.load(memory_order_acquire); // sees initial value, false

// Thread 1
oneWants.save(true, memory_order_relaxed);
r2 = victim.exchange(1, memory_order_acq_rel); // sees initial value, 0
r3 = zeroWants.load(memory_order_acquire); // sees initial value, false

I had concluded that the implementation was broken, but I let Anthony defend his proof.

Here’s the argument he presented, based on the modification order of victim. The C++ memory model enforces global modification order for every memory location. The victim variable starts at zero and then is modified twice by the exchanges in T0 and T1. If thread T1 sees the original value, zero, of victim then the same exchange modifies it to one. The modification order is therefore (T1: 1, T0: 0). If we assume that an atomic exchange must always see the preceding value in the modification order, then we must conclude that the exchange in T0 must read 1. Therefore it synchronizes with the exchange in T1 and mutual exclusion is assured.

And here’s the clincher: The draft C++ Standard did not specify what value must be seen by the atomic exchange (or, more generally, by an atomic Read-Modify-Write operation). I confirmed that with Hans Boehm, who concluded that this apparent hole in the standard must be corrected.

The next draft of the Standard will say that a RMW operation must see the last value in the modification order of the memory object in question.

Only then the proof of correctness of the weak-atomics implementation of Peterson lock will be complete.

Conclusion

I had no idea what I was getting myself into when attempting to reason about C++ weak atomics. The theory behind them is so complex that it’s borderline unusable. It took three people (Anthony, Hans, and me) and a modification to the Standard to complete the proof of a relatively simple algorithm. Imagine doing the same for a lock-free queue based on weak atomics!

Hans, who was instrumental in creating the C++ memory model, hopes that weak atomics will be a stopgap measure until processor designers find more efficient ways of implementing sequential consistency. For now, it is strongly recommended to leave weak atomics to top experts whose numbers can be counted on the fingers of one hand.

I’m grateful to (in order of appearance) Anthony Williams, Dmitriy V’jukov, Hans Boehm, and others who contributed to this post.


The question that’s been bugging me lately was: How does C++ make the use atomic variables both portable and efficient.

I knew how Java volatile worked–it enforced sequential consistency, which is not always the most efficient thing to do.

C++0x has atomic variables which also enforce sequential consistency when used in the default mode. Without special ordering annotations they work almost like Java’s volatile (interestingly, Java’s volatile doesn’t enforce atomicity–there is an atomic library in Java, though, which does).

However, C++ offers various degrees of relaxation of sequential consistency which, if used correctly, should produce more efficient code.

After studying a bit of the x86 memory model, I realized that some basic lock-free patterns (like the one in double-checked locking) will work without any fences. There should be a way of coding them in C++ such that, when compiled for the x86, no fences are produced. On the other hand, the same C++ code, when compiled for, let’s say, the alpha or the Power PC, should produce the necessary fencing.

To make things more interesting, there are some other algorithms, like the Peterson lock, which do require memory fences on the x86 (see my previous post). So it’s not just the matter of skipping all fencing.

I reduced my question to: How does one write portable code in C++ that results in just the right amount of fencing on the x86?

Specifying memory ordering in C++

The C++0x atomics library proposal underwent a lot of changes before it settled into its current shape. There is now a complete C++0x draft. It’s not an easy read so I’m grateful to Hans Boehm for clarifying some of the fine points.

Without further ado, this is how the publication safety pattern (the mother of all double-checked locking patterns) could be written in C++:

  atomic<bool> ready = false;
  atomic<int> data = 0;

Thread 0:

  data.store(1);
  ready.store(true);

Thread 1:

  if (ready.load())
    assert (data.load() == 1);

As you can see, atomic objects have methods store and load that are used for writing and reading underlying shared data. By default, these methods enforce sequential consistency. What it means is that this code has the same semantics as if it were written in Java using volatile variables. Consequently, on the x86, it will generate memory fences after each store.

But we know that these fences are not necessary for publication safety. How can we write code that would produce minimal fencing on the x86?

To make such fine tuning possible, the C++0x atomics library allows the programmer to specify ordering requirements for each load and store. I’ll explain various ordering options in a moment; for now let’s have a look at the optimized version of the publication pattern:

  atomic<bool> ready = false;
  atomic<int> data = 0;

Thread 0:

  data.store(1, memory_order_release);
  ready.store(true, memory_order_release);

Thread 1:

  if (ready.load(memory_order_acquire))
    assert (data.load(memory_order_acquire) == 1);

What’s important is that this code will work correctly on any major processor, but it will produce no fences on the x86.

Warning: Even if you know that your program will only ever run on an x86, you can’t remove the atomics and the ordering constraints from your code. They are still necessary to prevent the compiler from doing the reordering.

Fine-tuning memory ordering

Memory order can be specified using the following enumeration:

namespace std {
 typedef enum memory_order {
   memory_order_relaxed,
   memory_order_consume,
   memory_order_acquire,
   memory_order_release,
   memory_order_acq_rel,
   memory_order_seq_cst
 } memory_order; 
}

The default for all operations on atomic variables is memory_order_seq_cst which, just like Java’s volatile, enforces sequential consistency. Other orderings are used to relax sequential consistency and often squeeze better performance from lock-free algorithms.

  • memory_order_acquire: guarantees that subsequent loads are not moved before the current load or any preceding loads.
  • memory_order_release: preceding stores are not moved past the current store or any subsequent stores.
  • memory_order_acq_rel: combines the two previous guarantees.
  • memory_order_consume: potentially weaker form of memory_order_acquire that enforces ordering of the current load before other operations that are data-dependent on it (for instance, when a load of a pointer is marked memory_order_consume, subsequent operations that dereference this pointer won’t be moved before it (yes, even that is not guaranteed on all platforms!).
  • memory_order_relaxed: all reorderings are okay.

As I discussed before, the x86 guarantees acquire semantics for loads and release semantics for stores, so load(memory_order_acquire) and store(x, memory_order_release) produce no fences. So, indeed, our optimized version of the publication pattern will produce optimal assembly on the x86 (and I presume on other processors too).

But I have also shown before that the Peterson’s algorithm won’t work on the x86 without fencing. So it’s not enough to use just memory_order_acquire and memory_order_release to implement it. In fact, the problem with the Peterson lock was the possibility of reordering loads with stores. The weakest constraint that prevents such reordering is memory_order_acq_rel.

Now here the funny story begins. Without much thinking, I decided that just slapping memory_order_acq_rel on any of the writes would fix the problem. Here’s what I originally wrote:

[begin quote] The correct C++ code in this case is:

Peterson::Peterson() {
   _victim.store(0, memory_order_release);
   _interested[0].store(false, memory_order_release);
   _interested[1].store(false, memory_order_release);
}
void Peterson::lock() {
   int me = threadID; // either 0 or 1
   int he = 1 – me; // the other thread
   _interested[me].exchange(true, memory_order_acq_rel);
   _victim.store(me, memory_order_release);
   while (_interested[he].load(memory_order_acquire)
       && _victim.load(memory_order_acquire) == me)
      continue; // spin
}

The ordering memory_order_acq_rel produces an mfence instruction on the x86. This fence will prevent the movement of the load of _interested[he] before the store to _interested[me], which could otherwise happen (and break the algorithm).[end quote]

You can read comments to this post–especially the ones by Anthony and Dmitriy, who made me eat crow. In short, the point is that the “acquire” part of memory_order_acq_rel has no meaning unless another thread writes to (“releases”) the same variable. Since only thread 0 writes to _interested[0] and only thread 1 one writes to _interested[1], this synchronization accomplished nothing (outside of the x86). Dmitriy’s implementation, which synchronizes on _victim, is correct (but I had to ask many questions before understanding Anthony’s proof of it).

Conclusion

What strikes me about this example is how non-trivial the reasoning behind this implementation was. I had to actually go through the x86 assembly to realize that I needed those and not some other ordering constraints (and I still got it wrong!). In comparison to that, the old fashioned assembly programming looks like a piece of cake.

Any time you deviate from sequential consistency, you increase the complexity of the problem by orders of magnitude.

Microsoft volatile

No, I’m not talking about the stock market. I mentioned before that C++ volatile has nothing to do with multithreading. This is not entirely true because some compiler vendors took the liberty to add non-standard semantics to volatile (out of pure desperation, since they had to support multi-processor code, and pre-0x C++ offered no help in this area). This new semantics, at least in the case of Microsoft compilers, did not include sequential consistency. Instead it guaranteed acquire and release semantics (which, on the x86, didn’t result in any fencing). So, although the publication pattern will work on a Microsoft compiler with volatile variables, the Peterson lock won’t! I thought that was an interesting piece of trivia.


How does Java do it? Motivation for C++ programmers

Things like double-checked locking pattern and Peterson lock work out of the box in Java, as long as you declare all shared variables volatile. (Do not confuse Java volatile with C++ volatile–the latter has nothing to do with multithreading.) Obviously the Java compiler knows what kind of fences are necessary on relaxed memory architectures, including the x86. I thought it was worth looking at.

Why the sudden interest in Java? Because Java memory model is very relevant to C++0x. The keyword here is sequential consistency. Java enforces sequential consistency on all access to volatile variables. C++0x introduces atomic objects which, by default, also follow sequential consistency. So C++ atomics will, by default, behave almost exactly like Java volatile variables.

Also, why even bother studying the quirks of the x86? Can’t we just stick to locks and, occasionally, to default atomics and let the compiler/library writers worry about how they translate into fences? Absolutely!

This should be the end of the story, except that a lot of programmers seem to “know better.” They will try to optimize multithreaded algorithms and get into a world of trouble. So if you know anybody who might be tempted to write or optimize lock-free algorithms, read on.

Why the x86? Not only because it’s the most common chip, but also because its memory model is deceptively “normal.” It’s much more likely for programmers to attempt to play games with the x86 than, for instance, with the alpha. The rest of this post should serve as motivation to stay away form such games.

What is Sequential Consistency?

Sequential consistency is how we naturally think about multithreaded programs. It’s also how we think about the world. If A happens before B then it cannot be true that B happens before A. If one processor stores 1 in variable x, and another processor stores 1 in variable y, than either the sequence of events is:

  • x flips to 1 and then y flips to 1, or
  • y flips to 1 and then x flips to 1

(assume both are initially zero). Which sequence actually happened could be observed by a third processor, which loads both variables. If, for instance, it sees x == 1 and y == 0, it concludes that the write to x happened earlier. If it sees x == 0 and y == 1, it concludes that the write to y happened earlier (if it sees x == y, it can’t tell).

Now, in a sequentially consistent world, a fourth processor could not see the two events in a different order than what the third processor observed. We assume that, for each particular run of the program, there is a global sequence of events that is seen by all processors.

Of course, on multicore processors many things can happen simultaneously, and it usually doesn’t matter–except when memory access is involved. Sequentially consistent model assumes that there is a single switch between all processors and memory, and only one processor at a time can access it. This imaginary switch serves as the serialization point.

The x86 is not sequentially consistent

Welcome to the world of special relativity! Two observers (cores) might see two events (memory writes) in different order. Even on an x86.

The technical explanation is that, instead of one bus serializing all memory accesses, each core uses its own memory cache. Writes propagate from one cache to another with finite speed (measured in clock cycles). Reads snoop around the caches before going to shared memory. All this caching is done because a trip to main memory very expensive–hundreds of cycles.

I don’t care, says Java

The Java memory model requires that all access to shared volatile variables be sequentially consistent across all threads, even when they run on different cores. How do they enforce it?

The primary source for this kind of information is Doug Lea’s excellent The JSR-133 Cookbook for Compiler Writers.

Here’s the big picture: When Java is translated into bytecode, special codes are issued around volatile-variable access. These codes tell the Java runtime where memory fences should be inserted. The bytecodes are supposed to be processor independent; the fences are highly processor dependent. I’ll call the special bytecodes memory barriers as opposed to processor-specific memory fences. Memory barriers are inserted conservatively–the compiler assumes the most relaxed memory model (in other words, the bytecode has to run correctly on an alpha, whose memory model is so relaxed you’d think it was conceived in a hot tub).

Fences should go between memory accesses, so Java barriers are named after the kind of accesses (load or store) they separate. There is a LoadLoad barrier between volatile reads, LoadStore barrier between a read and a write, StoreLoad barrier between a write and a read, and StoreStore barrier between writes.

Here’s the first problem: the compiler can often only see one part of the pair. In that case it has to assume the worst and use the most restrictive barrier. Considering that, here are some of the cookbook rules:

  • Issue a StoreStore barrier before each volatile store.
  • Issue a StoreLoad barrier after each volatile store.
  • Issue LoadLoad and LoadStore barriers after each volatile load.

That’s a lot of barriers! Fortunately, the compiler can optimize some of them away using dataflow analysis.

The next step is to run the bytecode on a particular processor (or compile it to native code). If the processor is an alpha, practically all Java barriers translate into some kind of fences. But if it’s an x86, all but the StoreLoad barriers are turned into no-ops. The StoreLoad barrier is either translated into an mfence or a locked instruction (lock xchg).

When executed on an x86, the barrier rules boil down to this one: Issue an mfence after each volatile store (or turn it into lock xchg).

Peterson lock on an x86 in Java

Figuring the details of the Peterson lock in Java was not easy. I’m grateful to Doug Lea for patiently explaining to me some of the gotchas. I am solely responsible for any mistakes though.

In my previous post I came to the conclusion that for Peterson lock to work on an x86, an mfence is needed. Here’s the relevant pseudo-code:

Thread 0 Thread 1
store(zeroWants, 1)
mfence
store(victim, 0)
// Java puts another fence here
r0 = load(oneWants)
r1 = load(victim)
store(oneWants, 1)
mfence
store(victim, 1)
// Java puts another fence here
r0 = load(zeroWants)
r1 = load(victim)

In Java, all three variables, zeroWants, oneWants, and victim, would be declared volatile. Using the x86 Java translation rules, that would mean putting an mfence after every store to these variables.

The fences after the writes to zeroWant and oneWant are just what the doctor ordered. Form my previous post we know why they are needed on an x86.

The mfence after the write to victim is something new though. It isn’t strictly necessary on an x86, but to see that you really have to analyze the whole algorithm–something that’s beyond capabilities of modern compilers.

My original (turns out, incorrect) reasoning was that no fence is needed between the write to victim and the subsequent read because, on an x86, reads and writes to the same location are never reordered. Well, that’s not the whole story because…

Intra-processor forwarding is allowed

Consider the following example from the x86 spec. Initially x == y == 0.

Processor 0 Processor 1
mov [_x], 1
mov r1, [_x]
mov r2, [_y]
mov [_y], 1

mov r3, [_y]
mov r4, [_x]

The result r2 == 0 and r4 == 0 is allowed.

Let’s analyze that. First of all, because reads and writes to the same location are never reordered, r1 and r3 must both be 1.

If r2 == 0, we would naturally conclude that the write to y at processor 1 hasn’t happened yet. It means: processor 1 hasn’t executed its code yet. When it does, it should see the earlier write to x (processor 0 saw it!). Guess what–sometimes it doesn’t.

The technical reason for this behavior is that the write to x is immediately visible at processor 0, before it propagates to processor 1. Similarly, the write to y is immediately visible at processor 1, before it propagates to processor 0. So it is possible for both processor to see each other’s writes delayed, even though they see their own writes immediately.

This result violates sequential consistency and therefore cannot be tolerated in Java. Hence the need for the mfence between the write to victim and the subsequent read of victim.

Gosh darn it! I want to optimize it!

Having said that, I must point out that, in this particular case, lack of sequential consistency is not a deal breaker. It’s okay for thread 0 of Peterson lock to read stale (local) value of victim, even if “in the meanwhile” thread 1 overwrote it. All it means is that some unnecessary spins might be performed. This doesn’t violate the lock principles and doesn’t lead to a deadlock as long as the new value of victim eventually reaches thread 0.

I’m pretty sure of this reasoning–at least at the time of this writing. However, I wouldn’t be totally surprised is somebody found a fault in it.

The bottom line is that even such an “obvious” optimization is not obviously correct. The proof of correctness of the Peterson lock is based on the assumption of sequential consistency. If we relax this assumption even slightly, we have to redo the whole proof.

Any volunteers?


Multiprocessors with relaxed memory models can be very confusing. Writes can be seen out of order, reads can be speculative and return values from the future–what a mess! In order to impose some kind of consistency you have to use memory fences, and there are several kinds of them.

The x86 seems to be an oasis in the perilous landscape of relaxed memory multicores. The Intel x86 memory model, detailed in Intel 64 Architecture Memory Ordering White Paper and the AMD spec, AMD64 Architecture Programmer’s Manual, list a lot of memory ordering guarantees, among them:

  • Loads are not reordered with other loads.
  • Stores are not reordered with other stores.
  • Stores are not reordered with older loads.
  • In a multiprocessor system, memory ordering obeys causality (memory ordering respects transitive visibility).
  • In a multiprocessor system, stores to the same location have a total order.
  • In a multiprocessor system, locked instructions have a total order.
  • Loads and stores are not reordered with locked instructions.

The x86 also has (expensive–on the order of 100 cycles) memory fence instructions, mfence, lfence, and sfence; but, considering all those guarantees, why would anyone use them? The famous double-checked locking pattern, on the x86, works just fine without any fences.

This is a very important question, because you don’t want the compiler to over-fence your code and bring it to a crawl. On the other hand, you don’t want it to emit incorrect code. I decided to look for some answers.

There is one important non-guarantee listed in the x86 spec:

Loads may be reordered with older stores to different locations

Here’s an example from the x86 spec: x and y are in shared memory and they are both initialized to zero, r1 and r2 are processor registers.

Thread 0 Thread 1
mov [_x], 1
mov r1, [_y]
mov [_y], 1
mov r2, [_x]

When the two threads execute on different cores, the non-intuitive result r1 == 0 and r2 == 0 is allowed. Notice that such result is consistent with the processor executing the loads before the stores (which access different locations).

How important is this example?

I needed to find an algorithm that would be broken by this relaxation. I’ve seen Dekker’s algorithm mentioned in this context. There is a more modern version of it used in the Peterson lock. It’s a mutual exclusion algorithm that works for two threads only. It assumes that each thread has access to a thread-local ID, which is 0 for one thread and 1 for the other. Here’s the version adapted from The Art of Multiprocessor Programming:

class Peterson
{
private:
    // indexed by thread ID, 0 or 1
    bool _interested[2];
    // who's yielding priority?
    int _victim;
public:
    Peterson()
    {
        _victim = 0;
        _interested[0] = false;
        _interested[1] = false;
    }
    void lock()
    {
        // threadID is thread local,
        // initialized either to zero or one
        int me = threadID;
        int he = 1 - me;
        _interested[me] = true;
        _victim = me;
        while (_interested[he] && _victim == me)
            continue;
    }
    void unlock()
    {
        int me = threadID;
        _interested[me] = false;
    }
}

To explain how it works, let me impersonate one of the threads. When I (the thread) want to take a lock, I set my _interested slot to true. I am the only writer of this slot, although the other guy can read it. I also toggle the _victim switch to point to myself. Now I check if the other guy is also interested. As long as he is, and I am the victim, I spin. But once he becomes uninterested or turns himself into a victim, the lock is mine. When I’m done executing critical code, I reset my _interested slot, thus potentially releasing the other guy.

Let me simplify this a little, and partition the code between the two threads. Instead of an array _interested, there are two variables zeroWants and oneWants corresponding to the two slots.

zeroWants = false;
oneWants = false;
victim = 0;
Thread 0 Thread 1
zeroWants = true;
victim = 0;
while (oneWants && victim == 0)
 continue;
// critical code
zeroWants = false;
oneWants = true;
victim = 1;
while (zeroWants && victim == 1)
 continue;
// critical code
oneWants = false;

Finally, let me rewrite the initial part of the execution in pseudo assembly.

Thread 0 Thread 1
store(zeroWants, 1)
store(victim, 0)
r0 = load(oneWants)
r1 = load(victim)
store(oneWants, 1)
store(victim, 1)
r0 = load(zeroWants)
r1 = load(victim)

Now look at the loads and stores to zeroWants and oneWants. They follow the same pattern as in the x86 reordering example. The processor is free to move the read of oneWants before the write to zeroWants (and to victim). Similarly, it can move the read of zeroWants before the write to oneWants. We may end up with the following execution:

Thread 0 Thread 1
r0 = load(oneWants)
store(zeroWants, 1)
store(victim, 0)
r1 = load(victim)
r0 = load(zeroWants)
store(oneWants, 1)
store(victim, 1)
r1 = load(victim)

Originally, both zeroWants and oneWants are initialized to zero, so both r1 and r2 may end up zero. In that case, the spin loop is never executed and both threads march into the critical section. Peterson lock on the x86 is broken!

Of course, there is a way to fix it. An mfence will force the correct ordering. It can be put anywhere between the store of zeroWants and the load of oneWants in one thread, and between the store of oneWants and the load of zeroWants in the other thread.

So here we have a real-life example that requires an actual fence on an x86. The problem is real!


In the previous two installments I tried to explain the origins of rvalue references in C++. If you were left with the impression that each design step was a reaction to some previously unanticipated problem, you’re right. As it often happens in C++, the design was reactive rather than proactive.

Passing by value

If you think about it long enough, you realize that the process that led to rvalue references started with the way C++ implemented passing by value. In C, passing by value was simple–a shallow copy of data was made. “Shallow” means, no pointers were followed in the process of copying. C++ decided to provide hooks for the programmer to override this behavior. If you define a copy constructor it will be used instead of shallow copy. A user-defined copy constructor may for instance follow pointers or references and clone the pointed-to objects, thus implementing “deep copy”. It may also do bookkeeping–count references, etc.

Whenever the compiler needs to copy an object–and that may happen during return by value or pass by value–it inserts the appropriate call to the copy constructor (if one is not provided by the programmer, the default shallow copy is used).

This is a very general mechanism, which however leads to some awkwardness and inefficiencies. For instance, the writer of a copy constructor must remember to copy all of the shallow data members–something that the default shallow copy does automatically. The inefficiencies show up when copying rvalues.

Return value optimizations

Let’s revisit the example from my previous post:

 auto_ptr<Foo> factory() {
    return auto_ptr<Foo>(new Foo);
 }
 // client code
 auto_ptr<Foo> ap = factory();

Here, a temporary auto_ptr is created inside the function, copied using a copy constructor into the caller’s auto_ptr, and then immediately destroyed.

This is so obviously inefficient that the compiler will often optimize it away by constructing the return auto_ptr directly in the space allocated by the caller (a pointer to which is passed as a hidden argument to the function). This is called return value optimization (RVO) and it eliminates the call to the copy constructor and to the destructor.

But what about more general cases, like this one:

 auto_ptr<Foo> factory() {
    auto_ptr<Foo> src;
    ...
    src.reset(new Foo);
    ...
    return src;
 }
 // client code
 auto_ptr<Foo> ap = factory();

Some compilers will attempt to optimize this code as well, using the named return value optimization (NRVO).

The C++ Standard allows such optimizations, which elide the calls to a copy constructor and a destructor, but it doesn’t mandate them. Moreover, NRVO cannot be applied to functions that have multiple return statements, with different objects being returned from each.

But notice an interesting thing: If (N)RVO were mandated by the language and applicable in every case, we could almost implement auto_ptr without the rvalue reference copy constructor. We wouldn’t need to reach back at the source auto_ptr from the copy constructor to turn off its destruction. The destructor (together with the copy constructor) would be guaranteed to be elided.

In fact, if we don’t insist on using caller’s space for creating the return value, we could cover the case of multiple return statements. And how would we transfer bits between the two spaces? We’d do a shallow, bitwise, copy!

An alternative return-by-value scheme

Here’s the scheme that was proposed for the D programming language:

  1. The compiler adds a hidden parameter to every function that returns a non-trivial data structure by value. This parameter is the address of the stack space allocated by the caller for the return value. (This is a temporary space for the rvalue returned by the function) This part is common with the C++ implementation
  2. The function allocates its local variables on the stack, as usual, including the ones that are (or might be–in case of multiple return statements) returned. RVO may be used to eliminate some of these allocations. This is still in common with C++.
  3. When executing the return statement, the value to be returned is bit-copied into the caller’s space.
  4. No copy constructor is called. No destructor is called for the variable that’s been returned.

Look ma! No copy constructor!

What is amazing about this scheme, is that it may be applied to all situations where the source of the copy is an rvalue. This is also true when passing an rvalue to a function, as in this example:

  struct S;
  void f(S s); // by value
  // caller's code
  f(S(1, 2)); // passing an rvalue

Suddenly there is no need whatsoever for an rvalue-taking copy constructor because it would never be called. Since an rvalue-taking copy constructor was the main motivation for introducing rvalue references to C++, in D we can get away without them.

It takes some getting used to thinking of passing rvalues without calling a copy constructor. Notice however that, even in C++, there is no guarantee that the copy constructor will be called (remember RVO?).

But what about non-trivial copy constructors, like in ref_ptr (reference-counting smart pointer)? It must be called to keep count of the reference count, right? Not when the source is an rvalue! Remember that we are eliding not only the copy constructor (which increments the reference count) but also the destructor of the source (which decrements the reference count). The net result is no change to the reference count.

This is a very important optimization, especially in multithreaded code, when reference counting incurs expensive synchronization overhead.

Location transparency

There is one gotcha in this scheme–we are assuming that a shallow copy will always result in a valid object. This is not true if the object contains a pointer to itself. A shallow copy will duplicate the pointer, the object itself will move to a new location. but the pointer will point at the old location. This could be a disaster.

For this and several other reasons D assumes location transparency. An object cannot point at itself (or any of its shallow members), because then it couldn’t be transparently moved to a new location.

Unfortunately this requirement cannot be enforced statically. A dynamic check during copying is possible, but it’s not clear that the overhead is always worth paying. This is still an open design area.

Notice that, in D, the scope of this problem is narrowed to structs and fixed arrays that contain pointers. There is no direct way to manipulate the location of class objects (they are always accessed indirectly)–except by the garbage collector. A generational garbage collector that moves objects in memory could definitely profit from location transparency.

The binding table

The binding table in D is slightly different than the one in C++:

  • references bind to lvalues (same as C++)
  • const references bind only to lvalues (different)
  • “values” (arguments passed by value) bind to
    • lvalues (same as C++)
    • rvalues (same as C++, except that no copy constructor is called)

Notice that in D we don’t want a const reference to bind to an rvalue. This is an inherently dangerous feature of C++. Consider the following C++ example:

  struct S;
  const S& f(const S& x) {
     return x;
  }
  // caller's code
  const S& y = f(S(1, 2));
  y.access(); // kaboom!

Instead, if you want to have a version of a function that binds to rvalues, you declare its arguments as passed by value, as in this D example:

  S f(S x) {
    return x;
  }
  // caller's code
  auto x = f(S(1, 2));
  x.access; // fine, it's a copy

In the worst case, the overhead of the D solution is a bitwise copy of a struct (which can often be optimized away) against the indirect access of the C++ solution. But considering that it buys us safety, it’s a trade off well worth it.

I’d like to thank Andrei Alexandrescu for his comments and, in general, for driving the design of pass-by-value in D.


In my last post I discussed the relationship between references and lvalues/rvalues. It can be summarized using the following binding rules.

  • references bind to lvalues
  • const references bind to both lvalues and rvalues, but they don’t allow the modification of the source

It looks like all the bases are covered, except for references that could bind to rvalues and allow their modification. But who would want that? Why modify something that’s being discarded anyway?

The auto_ptr disaster

auto_ptr is an interesting beast. It’s a member of the family of smart pointers–objects that subvert value semantics to provide tighter control over reference (pointer) semantics. The need for such controls is related to resource management–making sure that program resources have well defined lifetimes.

One way to manage a resource is to ensure that it has a single owner throughout its lifetime. auto_ptr is such a designated owner.

On the surface, auto_ptr embodies value semantics. It is allocated on the stack or as a direct member of another data structure. It is passed by value to and from functions.

When auto_ptr is passed by value, the object it contains is not copied. In this respect auto_ptr behaves like a reference. Also, you can access auto_ptr as if it were a pointer to object using the overloaded member-access operator, ->.

Since auto_ptr is supposed to be the only owner of the resource (which is a pointer to object), when an auto_ptr goes out of scope it deletes the object.

Let’s look in more detail at how auto_ptr is returned from a function. Consider this example:

 auto_ptr<Foo> create() {
   return auto_ptr<Foo>(new Foo);
 }
 // caller's code
 auto_ptr<Foo> ap = create();

Notice two things:

  • The original auto_ptr goes out of scope at the end of create(). If we don’t do something special, its destructor will be called and it will destroy the newly created Foo.
  • We are returning an rvalue. The auto_ptr is created on the spot inside create(). (In fact, every local variable turns into an rvalue right before being returned from a function.)

There is one way to prevent the Foo object from being deleted–nulling the pointer inside auto_ptr. We need a hook to do that when returning auto_ptr.

The returning of objects by value follows a specific C++ protocol. It involves copy construction and/or assignment. The caller’s instance of auto_ptr, ap, is copy-constructed from the callee’s instance.

The obvious way to prevent the resource from disappearing is to define an auto_ptr copy constructor that nulls the pointer inside its source.

And here’s the catch:

  • If you define the copy constructor to take the source auto_ptr by const reference, you won’t be able to modify it.
  • If you define it to take a non-const reference, it won’t bind to an rvalue.

Finding themselves between a rock and a hard place, C++ designers came up with some ingenuous hacks (the infamous auto_ptr_ref object). Unfortunately, some important functionality of auto_ptr had to be sacrificed in the process. In particular, you can’t return auto_ptr<Derived> as auto_ptr<Base>, which makes writing polymorphic factory functions hard.

Rvalue references

What we need is something that can bind to an rvalue and modify it too. And this is exactly what rvalue references do.

Since it was too late to fix auto_ptr, it joined the ranks of deprecated features. Its place was taken by unique_ptr. This new template has a “copy constructor” that takes another unique_ptr by rvalue reference (denoted by double ampersand).

 unique_ptr::unique_ptr(unique_ptr && src)

An rvalue reference can bind both to rvalues and lvalues and does not prohibit the modification of its source.

Move that looks like a copy

Another problem with ownership passing using auto_ptr is that it sometimes looks deceptively like copying. Consider this example:

 auto_ptr<Foo> pSrc(new Foo);
 auto_ptr<Foo> pDest = pSrc; // looks like a copy, but it nulls the source
 pSrc->method(); // runtime error, pSrc is now empty

With a little discipline, such errors could be avoided, but the problem gets virtually unmanageable in generic code. In particular, the use of auto_ptr in standard containers and algorithms could easily lead to serious trouble. Some algorithms work with temporary copies of elements, not suspecting that making such a “copy” might destroy the source.

Again, intricate hackery was employed to make sure that a container of auto_ptrs doesn’t compile.

Fine tuning

It seems like in some cases we want to enable implicit move semantics (returning auto_ptr from a function) while disabling it in others (the example above). How can we characterize those two cases? The essential feature is the lifetime of the source. If the source is no longer accessible after the move, we want the move to be implicit. But if the source remains accessible, we want to force the programmer to be more explicit. We want the programmer to say, “Hey, I’m transferring the ownership, so don’t try to access the source any more.”

This difference translates directly into rvalue/lvalue duality. Rvalues are the ones that disappear. Lvalues remain accessible.

But the unique_ptr “copy constructor”

 unique_ptr::unique_ptr(unique_ptr && src)

binds equally ot rvalues and to lvalues. Fortunately, it’s possible to overload functions based on rvalue references vs.regular references. unique_ptr has a second “copy constructor”;

 unique_ptr::unique_ptr(unique_ptr & src)

which binds exclusively to lvalues. When the source is an lvalue, the compiler is forced to pick that overload. When the source is an rvalue, the rvalue reference overload is used. All that remains is to make the lvalue-binding constructor private (and forgo its implementation) to prevent implicit moves of lvalues.

So how do you explicitly move an lvalue unique_ptr when you really need it? You use a special template function, move, like this:

 unique_ptr pSrc(new Foo);
 unique_ptr pDest = move(pSrc);

Since the move is explicit, there is no chance of confusing it with a copy. Most algorithms in the Standard Library are being rewritten to use move whenever possible (actually, swap, which uses move). The ones that require objects to be copyable will not compile for unique_ptrs.

By the way, all move does is to convert an lvalue into an rvalue.

 template <class T>
 typename remove_reference<T>::type&& move(T&& t) {
    return t;
 }

In the next installment, I’ll discuss a totally different solution adopted by D.


You might have heard of the latest invention in C++0x–rvalue references (see for instance, A Brief Introduction to Rvalue References). Depending on your point of view, rvalue references are the best thing since STL, or yet another kludge invented to fix accumulated language design errors in C++. This distinction is pretty important when you are trying to design a new language. Should the D language have rvalue references, or is there a better way? In order to figure that out, one has to understand what problem they solve and what previous language design decisions led to their emergence.

I’ll start with some pretty obvious things, so bear with me. A new perspective on old things often leads to better generalizations.

Lvalues vs rvalues

There are two ways of passing arguments to functions–by value or by reference. In C, passing by reference meant passing a pointer to data. If a function took a pointer, the client had to call it with a pointer–for instance, by taking the address of some data.

Not every piece of data has an address. The ones that do are called lvalues, the ones that don’t are called rvalues. Data has to sit in a named memory location in order to have an address. If it sits in a register, it’s not addressable in the usual sense (what’s a pointer to EAX?). Any data that, even theoretically, could be stored in registers without having a backup copy in main memory, is considered an rvalue. Examples of rvalues are data returned from functions (by value), or immediate results of arithmetic expressions.

In C’s pointer notation, it’s pretty obvious when you can and can’t take the address of data. Consider these examples:

void f(int * pi);
int g();

f(&g()); // error!
f(&(1 + 1)); // error!

Notice that small modifications will make this code compile:

int tmp1 = g();
f(&tmp1);
int tmp2 = 1 + 1;
f(&tmp2);

Why can’t the compiler create temporary variables for us and forgo error messages?

Not a good idea! The fact that f takes a pointer to integer rather than an integer usually means that it wants to modify the integer and make the modification visible to the caller. For instance:

void f(int * pi) { ++*pi; }

Calling such a function with the address of an ephemeral temporary variable that’s not addressable by the caller is usually a bug. We don’t want compilers to bury our bugs. Taking the address of an rvalue is therefore an error both in C and C++.

References

C++ made things a little more complicated by introducing references. When you call a function that takes a reference, you no longer have to explicitly take the address of data you’re passing to it. The compiler will do it implicitly. So the previous example no longer looks so wrong when function f is declared to take a reference:

void f(int & i) { ++i; }
int g();
// caller's code
f(g());
f(1 + 1);

These calls are still flagged as errors–you can’t take a reference to an rvalue–but you can’t figure it out just by examining the caller’s code.

However, passing by reference has other unrelated uses. For instance, passing large data structures by reference may be more efficient than passing them by value. In that case we don’t really care if the original data is an lvalue or an rvalue. We’d like such calls to compile.

Const references

How can a C++ programmer have a cake and eat it too? How can he or she tell the compiler that the argument is passed by reference not because it is to be modified but for the sake of performance, and it’s okay to call it with rvalues?

That should be easy: Since the callee has no interest in modifying the original data, he or she should mark the reference argument as const. The compiler will then allow binding of an rvalue to a const reference.

Indeed, the following code compiles without errors:

void f(const int & i);
int g();

f(g());
f(1 + 1);

It all made sense at the time, until the auto_ptr entered the picture.

Next time: The ato_ptr disaster.

« Previous Page