May | 2012 | Bartosz Milewski's Programming Cafe

It was my first experience working with the C++ Standardization Committee in a subgroup dedicated to concurrency and parallelism. I won’t bore you with details — they will be available at the committee web site. I’ll share my overall impressions and then focus on specific areas where I have strong opinions.

Being an outsider I considered the C++ Standard the ultimate word. If I had problems interpreting the letter of the Standard I would ask one of the committee members for interpretation and assume that I would get the same answer from any of them. Reality turned out to be more complex than that. C++ Standard is full of controversial topics. Some of those controversies could not be resolved in time, so often the wording of the Standard is intentionally vague. Some features were not ready for inclusion so little stubs were inserted into the document that sometimes don’t make much sense in isolation.

One such example is the intentional vagueness and the lack of definition of thread of execution. Not only is a thread undefined, some of the semantics are expressed using the “as if” language. In particular the thingie started by std::async is supposed to behave “as if” it were run in a separate thread of execution (whatever that means). At some point I had a long email exchange about it with Anthony Williams and Hans Boehm that resulted in a blog post. I thought the things were settled until I was alerted to the fact that Microsoft’s interpretation of the Standard was slightly different, and their “as if” didn’t include thread_local variables, at least not in the beta of the new Visual C++.

Here’s the problem: std::async was introduced in the Standard as a compromise between the idea that it’s just syntactic sugar over std::thread creation, and the idea that it’s an opening for task-based parallelism. In fact when I first tried std::async using Anthony Williams’ Just Thread library I expected it to run on a thread pool complete with work stealing and thread reuse. Not so, argued Anthony and Hans, pointing among others things to the problem of managing thread-local variables — are they supposed to be local with respect to the underlying OS thread, or to a smaller units of execution, the tasks?. If multiple tasks are reusing the same thread should they see fresh versions of thread_local variables? When should thread-local variables be destroyed if the lifetime of pool threads is theoretically infinite?

Now, Microsoft has its implementation of task-based concurrency in the form of PPL (Parallel Pattern Library). Intel has TBB (Threading Building Blocks), which is a superset of PPL and it also runs on Linux. I can understand the eagerness of those companies to bend the (intentionally vague) rules and make these libraries accessible through std::async, especially if they can dramatically improve performance.

I’d be the first to vote for this proposal, except for a few unsolved problems.

First of all, Microsoft wanted to change the semantics of std::async when called with launch_policy::async. I think this was pretty much ruled out in the ensuing discussion. Pure async case should be indistinguishable from direct creation of std::thread. Any attempt at using a thread pool behind the scenes could result in deadlocks. Essentially, the programmer must have a guarantee that all the tasks will be allowed to run in parallel no matter how many there are. Just imagine a bunch of tasks trying to communicate with each other back and forth. If thread creation is throttled down after N of them start and possibly block waiting for responses from the rest of them, they might block forever. Thread pools usually have the ability to create new threads on demand, but it’s never obvious when a new thread must be created. Even if the pool could detect all the threads that are blocked, it couldn’t detect those that are busy-spinning. This is why std::async with launch_policy::async must always create, or at least immediately steal, a thread.

The situation is different with std::async called with the default launch policy (the bitwise OR of launch_policy::async and launch_policy::deferred). In that case the runtime does not guarantee that all tasks will be able to run in parallel. In fact the programmer must be prepared for the possibility that all tasks run serially in the context of the parent thread (more specifically, in the context of the thread that calls future::get). Here the problem with using a thread pool is different. It has to do with the lifetimes of thread_local variables that I mentioned before. This is a serious problem and the semantics defined by the current Standard are far from natural. As it stands, a task created using the default launch policy must either run on a completely new thread, in which case that thread defines the lifetimes of thread_local variables; or it must be deferred, in which case it shares thread_local variables with its parent (again, strictly speaking, with the caller of future::get — if the future is passed to a different thread). This behavior might seem confusing, but at least it’s well defined.

Here’s how Herb Sutter proposed to solve the problem of making tasks run in a thread pool: Disallow non-POD thread_locals altogether. The argument was that nobody has implemented non-POD thread locals anyway, so nobody will suffer. Anthony Williams’ and Boost implementations were dismissed as library-based.

This seems to me like a violation of the spirit of C++, but there is a precedent for it: atomic variables. You can declare a POD (Plain Old Data, including simple structs) as atomic and, if it fits inside a hardware supported atomic word, it will become a lock-free atomic; otherwise a lock will be provided free of charge (well, you’ll pay for it with performance, but that’s a different story). But you can’t define a non-POD as atomic!

A quick straw poll showed that the subcommittee was equally split between those who were willing to discuss this change and those who weren’t. It seems though that Microsoft will go ahead with its PPL implementation ignoring the problems with thread_local (and also with DLL_THREAD_DETACH handlers I mentioned in my blog). So you might want to restrict the use of non-POD thread-local variables for the time being.

This discussion had a larger context: The proposal to introduce thread pools into the language/library as first class objects. Google’s Jeffrey Yaskin described their Executor library, which combines thread pools with work-stealing queues and schedulers. PPL has a similar construct called task group. In this new context, std::async would only provide an interface to a global default thread-pool/executor/task-group. The introduction of first-class thread pools would take away the pressure to modify the semantics of std::async. If you cared about the way your tasks are scheduled, you could spawn them using a dedicated thread-pool object. Having an explicit object representing a set of tasks would also allow collective operations such as wait-for-all or cancel.

Which brings me to another topic: composable futures. I wrote a blog post some time ago, Broken Promises: C++0x Futures, in which I lamented the lack of composability of futures. I followed it with another blog, Futures Done Right, proposing a solution. So I was very happy to learn about a new proposal to fix C++ futures. The proposal came from an unexpected source — C#.

The newest addition to C# is support for asynchronous interfaces (somewhat similar to Boost::ASIO). This is a hot topic at Microsoft because the new Windows 8 runtime is based on asynchronous API — any call that might take more than 50ms is implemented as an asynchronous API. Of course you can program to asynchronous API by writing completion handlers, but it’s a very tedious and error-prone method. Microsoft’s Mads Torgersen described how C# offers several layers of support for asynchronous programming.

But what caught my interest was how C# deals with composition of futures (they call them task objects). They have the analog of an aggregate join called WhenAll and an equivalent of “select” called WhenAny. However these combinators do not block; instead they return new futures. There is another important combinator, ContinueWith. You give it a function (usually a lambda) that will be called when the task completes. And again, ContinueWith doesn’t block — it returns another future, which may be composed with other futures, and so on. This is exactly what makes C# futures composable and, hopefully, C++ will adopt a similar approach.

Of course there is much more to the async proposal, and I wish I had more time to talk about it; but the composable integration of asynchronicity with task-based concurrency is in my eyes a perfect example of thoughtful design.

I noticed that there seems to be a problem with C++’s aversion to generalizations (I might be slightly biased having studied Haskell with its love for generalizations). Problems are often treated in separation, and specific solutions are provided for each, sometimes without a serious attempt at generalization. Case in point: cancellation of tasks. A very specialized solution involving cancellation tokens was proposed. You get opaque tokens from a factory, you pass them to tasks (either explicitly or by lambda capture), and the tasks are responsible for polling the tokens and performing appropriate cancellation actions. But this is an example of an asynchronous Boolean channel. Instead of defining channels, C++ is considering a special-purpose one-shot solution (unless there is a volunteer willing who will write a channels proposal). By the way, futures can be also viewed as channels, so this generalization might go a long way.

Another candidate for generalization was the Intel vectorization proposal presented by Robert Geva. Of course it would be great to support the use of vector processors in C++. But you have to see it in the larger context of data-driven parallelism. It doesn’t make sense to have separate solutions for vector processors, multicores running in SIMD mode, and GPGPUs. What’s needed is general support for data parallelism that allows multiple hardware-specific specializations. Hopefully a more general proposal will materialize.

The C++ Standards Committee is doing a great job, considering all the limitations it’s facing. The committee will not add anything to the language unless there are volunteers who will write proposals and demonstrate working implementations. Remember, you too can contribute to the future of C++.

Bartosz Milewski's Programming Cafe

May 2012

The Future of C++ Concurrency and Parallelism

Top Posts

License

Blogroll

Follow Me

Archives