The Future of C++ Concurrency and Parallelism

May 11, 2012

The Future of C++ Concurrency and Parallelism

Posted by Bartosz Milewski under C++, Concurrency, Multicore, Multithreading, Parallelism, Programming, Software Transactional Memory
[16] Comments

It was my first experience working with the C++ Standardization Committee in a subgroup dedicated to concurrency and parallelism. I won’t bore you with details — they will be available at the committee web site. I’ll share my overall impressions and then focus on specific areas where I have strong opinions.

Being an outsider I considered the C++ Standard the ultimate word. If I had problems interpreting the letter of the Standard I would ask one of the committee members for interpretation and assume that I would get the same answer from any of them. Reality turned out to be more complex than that. C++ Standard is full of controversial topics. Some of those controversies could not be resolved in time, so often the wording of the Standard is intentionally vague. Some features were not ready for inclusion so little stubs were inserted into the document that sometimes don’t make much sense in isolation.

One such example is the intentional vagueness and the lack of definition of thread of execution. Not only is a thread undefined, some of the semantics are expressed using the “as if” language. In particular the thingie started by std::async is supposed to behave “as if” it were run in a separate thread of execution (whatever that means). At some point I had a long email exchange about it with Anthony Williams and Hans Boehm that resulted in a blog post. I thought the things were settled until I was alerted to the fact that Microsoft’s interpretation of the Standard was slightly different, and their “as if” didn’t include thread_local variables, at least not in the beta of the new Visual C++.

Here’s the problem: std::async was introduced in the Standard as a compromise between the idea that it’s just syntactic sugar over std::thread creation, and the idea that it’s an opening for task-based parallelism. In fact when I first tried std::async using Anthony Williams’ Just Thread library I expected it to run on a thread pool complete with work stealing and thread reuse. Not so, argued Anthony and Hans, pointing among others things to the problem of managing thread-local variables — are they supposed to be local with respect to the underlying OS thread, or to a smaller units of execution, the tasks?. If multiple tasks are reusing the same thread should they see fresh versions of thread_local variables? When should thread-local variables be destroyed if the lifetime of pool threads is theoretically infinite?

Now, Microsoft has its implementation of task-based concurrency in the form of PPL (Parallel Pattern Library). Intel has TBB (Threading Building Blocks), which is a superset of PPL and it also runs on Linux. I can understand the eagerness of those companies to bend the (intentionally vague) rules and make these libraries accessible through std::async, especially if they can dramatically improve performance.

I’d be the first to vote for this proposal, except for a few unsolved problems.

First of all, Microsoft wanted to change the semantics of std::async when called with launch_policy::async. I think this was pretty much ruled out in the ensuing discussion. Pure async case should be indistinguishable from direct creation of std::thread. Any attempt at using a thread pool behind the scenes could result in deadlocks. Essentially, the programmer must have a guarantee that all the tasks will be allowed to run in parallel no matter how many there are. Just imagine a bunch of tasks trying to communicate with each other back and forth. If thread creation is throttled down after N of them start and possibly block waiting for responses from the rest of them, they might block forever. Thread pools usually have the ability to create new threads on demand, but it’s never obvious when a new thread must be created. Even if the pool could detect all the threads that are blocked, it couldn’t detect those that are busy-spinning. This is why std::async with launch_policy::async must always create, or at least immediately steal, a thread.

The situation is different with std::async called with the default launch policy (the bitwise OR of launch_policy::async and launch_policy::deferred). In that case the runtime does not guarantee that all tasks will be able to run in parallel. In fact the programmer must be prepared for the possibility that all tasks run serially in the context of the parent thread (more specifically, in the context of the thread that calls future::get). Here the problem with using a thread pool is different. It has to do with the lifetimes of thread_local variables that I mentioned before. This is a serious problem and the semantics defined by the current Standard are far from natural. As it stands, a task created using the default launch policy must either run on a completely new thread, in which case that thread defines the lifetimes of thread_local variables; or it must be deferred, in which case it shares thread_local variables with its parent (again, strictly speaking, with the caller of future::get — if the future is passed to a different thread). This behavior might seem confusing, but at least it’s well defined.

Here’s how Herb Sutter proposed to solve the problem of making tasks run in a thread pool: Disallow non-POD thread_locals altogether. The argument was that nobody has implemented non-POD thread locals anyway, so nobody will suffer. Anthony Williams’ and Boost implementations were dismissed as library-based.

This seems to me like a violation of the spirit of C++, but there is a precedent for it: atomic variables. You can declare a POD (Plain Old Data, including simple structs) as atomic and, if it fits inside a hardware supported atomic word, it will become a lock-free atomic; otherwise a lock will be provided free of charge (well, you’ll pay for it with performance, but that’s a different story). But you can’t define a non-POD as atomic!

A quick straw poll showed that the subcommittee was equally split between those who were willing to discuss this change and those who weren’t. It seems though that Microsoft will go ahead with its PPL implementation ignoring the problems with thread_local (and also with DLL_THREAD_DETACH handlers I mentioned in my blog). So you might want to restrict the use of non-POD thread-local variables for the time being.

This discussion had a larger context: The proposal to introduce thread pools into the language/library as first class objects. Google’s Jeffrey Yaskin described their Executor library, which combines thread pools with work-stealing queues and schedulers. PPL has a similar construct called task group. In this new context, std::async would only provide an interface to a global default thread-pool/executor/task-group. The introduction of first-class thread pools would take away the pressure to modify the semantics of std::async. If you cared about the way your tasks are scheduled, you could spawn them using a dedicated thread-pool object. Having an explicit object representing a set of tasks would also allow collective operations such as wait-for-all or cancel.

Which brings me to another topic: composable futures. I wrote a blog post some time ago, Broken Promises: C++0x Futures, in which I lamented the lack of composability of futures. I followed it with another blog, Futures Done Right, proposing a solution. So I was very happy to learn about a new proposal to fix C++ futures. The proposal came from an unexpected source — C#.

The newest addition to C# is support for asynchronous interfaces (somewhat similar to Boost::ASIO). This is a hot topic at Microsoft because the new Windows 8 runtime is based on asynchronous API — any call that might take more than 50ms is implemented as an asynchronous API. Of course you can program to asynchronous API by writing completion handlers, but it’s a very tedious and error-prone method. Microsoft’s Mads Torgersen described how C# offers several layers of support for asynchronous programming.

But what caught my interest was how C# deals with composition of futures (they call them task objects). They have the analog of an aggregate join called WhenAll and an equivalent of “select” called WhenAny. However these combinators do not block; instead they return new futures. There is another important combinator, ContinueWith. You give it a function (usually a lambda) that will be called when the task completes. And again, ContinueWith doesn’t block — it returns another future, which may be composed with other futures, and so on. This is exactly what makes C# futures composable and, hopefully, C++ will adopt a similar approach.

Of course there is much more to the async proposal, and I wish I had more time to talk about it; but the composable integration of asynchronicity with task-based concurrency is in my eyes a perfect example of thoughtful design.

I noticed that there seems to be a problem with C++’s aversion to generalizations (I might be slightly biased having studied Haskell with its love for generalizations). Problems are often treated in separation, and specific solutions are provided for each, sometimes without a serious attempt at generalization. Case in point: cancellation of tasks. A very specialized solution involving cancellation tokens was proposed. You get opaque tokens from a factory, you pass them to tasks (either explicitly or by lambda capture), and the tasks are responsible for polling the tokens and performing appropriate cancellation actions. But this is an example of an asynchronous Boolean channel. Instead of defining channels, C++ is considering a special-purpose one-shot solution (unless there is a volunteer willing who will write a channels proposal). By the way, futures can be also viewed as channels, so this generalization might go a long way.

Another candidate for generalization was the Intel vectorization proposal presented by Robert Geva. Of course it would be great to support the use of vector processors in C++. But you have to see it in the larger context of data-driven parallelism. It doesn’t make sense to have separate solutions for vector processors, multicores running in SIMD mode, and GPGPUs. What’s needed is general support for data parallelism that allows multiple hardware-specific specializations. Hopefully a more general proposal will materialize.

The C++ Standards Committee is doing a great job, considering all the limitations it’s facing. The committee will not add anything to the language unless there are volunteers who will write proposals and demonstrate working implementations. Remember, you too can contribute to the future of C++.

16 Responses to “The Future of C++ Concurrency and Parallelism”

Joel Falcou Says:

May 11, 2012 at 1:12 pm
SIMD and GPGPU are looking the same from afar but finding a proper abstractionnot botchign performances of one or the other is hard. We tried hard findign one in our own tools and we’re back were fortran started : first-class citizen array like class cause well, you can’t get better than a big chunk of data to represent big chunk of data.

Now, the real problem of such stuff is the vectorization of non trivial, control like structure.

All in all, I’ll be very if none of the vendor propsoal gets in are they are clearly far too vendor centric.
Michal Mocny Says:

May 11, 2012 at 1:42 pm
+1 to “futures can be viewed as [unbuffered] channels”, I’ve always thought of them that way (and would love to see re-settable futures, and bufferred futures/channels).

I am not sure why “The introduction of first-class thread pools would take away the pressure to modify the semantics of std::async.” It seems to me that this would be an excellent feature, but I still see all the issues you outlined with async as needing to be solved. Unless you mean to say, the wording of the standard can be changed to say that async may/will have thread pool semantics?

Again, I’ll reiterate, its so easy to write your own async that runs without a thread pool, that I would not mind seeing std::async always defaulting to using one.
Bartosz Milewski Says:

May 11, 2012 at 1:45 pm
Yes, arrays and matrices can be used to drive parallelism. More general graphs can be represented as sparse matrices. So these abstractions form a common language of data parallelism. The other ingredient is partitioning of data. This should also be done abstractly, in particular user-defined partitionings should be possible. The final ingredient is specifying the target: vector processor, GPGPU or, the default, general cores. I’m not making these up, just stealing ideas from Chapel.
Kevin Cameron Says:

May 11, 2012 at 2:34 pm
Having worked with the IEEE and Accellera committees over the last decade trying to fix hardware description languages, I’ve decided that it’s often a lost cause when up against entrenched interests. If you have something particular you want to do, best just go and implement it in the open-source compilers and see who likes it. Unfortunately there is not a good base of FOSS HDL compilers so my current plan (if ever find the time) is just to get the threading support HDLs need into the Clang/LLVM tool chain and go from there (CUDA has gone that way). My demo C++ version is here – http://parallel.cc

I don’t think the new threading stuff in C++ is particularly useful, its just a rehash of old SMP approaches. Kicking of a thread with a function (std::async) rather than a class object gives you a stack management problem that doesn’t scale well. It’s just a bad abstraction level.

Erlang has similar underlying semantics to the HDLs and there is an LLVM effort there –

http://www.phoronix.com/scan.php?page=news_item&px=MTA4Nzc

– so I expect that if i wait long enough most of the work will get done for me 😉
Scott Meyers Says:

May 11, 2012 at 2:44 pm
Regarding your comment that “you can’t define a non-POD as atomic,” what is the basis for this claim? I see in 29.5/1 that for std::atomic, T must be trivially copyable, but, as far as I know, that’s not the same as being a POD.
Bartosz Milewski Says:

May 11, 2012 at 3:31 pm
@Scott: You’re absolutely right, PODs are trivially copyable but the opposite is not true. And probably the thread_local issue is also more subtle than just POD/non-POD.
Joel Falcou Says:

May 12, 2012 at 12:00 am
@Bartosz, in fact I wonder if the real corect way is not to tie parallelism to data structure and have familly of ADT tied to different kind of aprallelism. Then we’re back to the holy trifecta, having iterators embedding parallel manipulation and ranges getting extended across the parallel structure. I think we have to go away from unconstrainted threadfest and get some patterns inside, much liek we moved from mess-of-goto to structured programming..
C++ Concurrency and Parallelism « Thoughts Serializer Says:

May 14, 2012 at 3:58 am
[…] on why questions “y u no standar threading, C++?” don’t have an easy answer.Link: The Future of C++ Concurrency and Parallelism Leave a Reply Click here to cancel reply. Name (required) Mail (will not be published) (required) […]
petke Says:

May 15, 2012 at 10:21 am
Interesting post as always. Was looking forward to it.

> “I won’t bore you with details — they will be available at the committee web site.”

I wouldn’t mind a bit of boredom. Where would I find the committee web site?
Bartosz Milewski Says:

May 17, 2012 at 10:06 pm
I’m not sure they have been published yet. I’ll post the link when I know more.
Asynchronous Calls in C++ and the Continuation Monad | FP Complete Says:

June 20, 2012 at 2:20 am
[…] C++ Standard, on which the work started even before C++11 saw the light of day. In my last blog I reported on the first meeting of the C++ study group on concurrency and parallelism that took place in […]
scottmeyers Says:

June 23, 2012 at 6:34 pm
Regarding “In fact the programmer must be prepared for the possibility that all tasks run serially in the context of the parent thread (more specifically, in the context of the thread that calls future::get)” I think it’s important to note that the deferred task will also be invoked if future::wait (but not future::wait_for or future::wait_until) is invoked. In fact, future::get’s behavior is specified in terms of future::wait’s.
Doug Gale Says:

May 1, 2013 at 9:55 pm
An implementation of std::async that creates a thread every time is utterly useless. The performance would be abysmal and nobody would use it. Are there standards committee members so naive to think that performance doesn’t matter for parallelism? The whole idea of parallelism is performance! The arguments for thread local variables are also pretty naive. What use case have they dreamed up where thread local variables are even involved? If an async operation runs in a thread then it runs in a thread. That is a detail of the underlying thread implementation. Supporting non-POD thread local variables is a ridiculous idea that introduces significant overhead. Again: async/threading is for performance, why add expensive synchronization operations at every use?
What is the issue with std::async? - BlogoSfera Says:

August 23, 2015 at 5:57 am
[…] What are the problems presented in this video? Are they related to this article? […]
Masha Says:

December 18, 2018 at 10:25 am
Very informative and interesting post, thank you! I was happy to find it as it shined a lot of light on the problem we see with std::async and thread_local on Windows. Six years after your original post MSVC 2017 still has the problem with thread_local lifetime which affects all functionality that is used with std::async. Would you have any suggestions or advice on how to go around it given that we can’t stop supporting Windows? We rely heavily on thread_local data in our design as it allows us to eliminate any shared data between the threads. Contradictory to Herb’s statement, our team implemented non-POD thread locals which make us suffer on Windows 🙂 Everything works flawlessly on OS X and Linux.
Kevin Cameron Says:

December 19, 2018 at 2:42 pm
The easier way to deal with threads –