I summarize safe ways of passing arguments to threads, and their gotchas
Show an optimization of monitors based on epochs, together with its maintenance pitfalls
Debug the resulting data race

Follow @BartoszMilewski
(You can also follow me on Google+, if you search for Bartosz Milewski.)

Why did I do six concurrency tutorials without mentioning mutexes? I think people resort to explicit locking much too early. In this installment I compare two implementations side by side and the results might be surprising. One is moving data between threads (the new C++11 move semantics), the other is using a shared monitor. Whatever the overheads of copying or locking are, they are drowned by the work the threads are doing; and locking is much more error-prone (especially if you try to optimize it).
Follow @BartoszMilewski
(You can also follow me on Google+, if you search for Bartosz Milewski.)

I wrote a new parallel directory listing program with C++11 async tasks, but this time the number of tasks operating in parallel was bounded. The result was an algorithm that reminded me of MapReduce, so I described how MapReduce works. Here’s the video.
Follow @BartoszMilewski
(You can also follow me on Google+, if you search for Bartosz Milewski.)

If you expected std::async to be just syntactic sugar over thread creation, you can stop reading right now, because that’s what it is. If you expected more, read on.

Don’t get me wrong, std::async combines several useful concurrency concepts into a nice package: It provides a std::future for the return value, and hides the std::promise side of the future. It also provides options to run a task synchronously. (See the Appendix for a short refresher.)

But tasks have a slightly different connotation in parallel programming: they are the basic blocks of task-based parallelism. And C++11 tasks fall short on that account.

Task-Based Parallelism

Tasks are an answer to performance and scalability problems associated with threads. Operating system threads are rather heavy-weight; it takes time and system resources to create a thread. If you have an algorithm that naturally decomposes into a large number of independent computations, a.k.a. tasks, you’ll probably get your best performance not by creating a separate thread for each task, but by adjusting the number of threads depending on the amount of parallelism available on your particular system, e.g., the number of cores and their current loads. This can be done by hand, using thread pools and load balancing algorithms; or by using task-based systems.

In a task-based system, the programmer specifies what can be run in parallel but lets the system decide how much parallelism to actually use. The programmer splits the algorithm into tasks and the runtime assigns them to threads — often many tasks to a thread.

There are many implementations of task-based parallelism with varying language support. There’s the Cilk language which pioneered this approach; there’s the built in support in Haskell, F#, and Scala; and there are several C++ libraries, like Microsoft PPL or Intel TBB.

Unlike thread creation, task creation is supposed to be relatively inexpensive, letting the programmer explore low-level granularity parallelism and take advantage of multicore speedups.

At the center of task-based systems are work-stealing queues. When a thread creates tasks, they are initially queued on the processor (core) that runs the thread. But if there are other idle processors, they will try to steal tasks from other queues. The stolen tasks are then run in parallel.

Notice that tasks must be able to migrate between threads. What’s more, efficient use of OS threads requires that tasks that are blocked, for instance waiting for I/O or waiting for other tasks to finish, should be taken off their threads, so that other tasks may reuse them.

C++11 Tasks

My expectation was that C++11 “tasks” that are created using std::async should be abstracted from threads, just as they are in task-based parallelism. When I started preparing a video tutorial about tasks in C++, I wrote a simple program to demonstrate it. I created async tasks using the default launch policy and waited for them to complete. Each task slept for one second and then printed its thread ID.

int main() 
{
    std::cout << "Main thread id: " << std::this_thread::get_id() 
        << std::endl;
    std::vector<std::future> futures;
    for (int i = 0; i < 20; ++i)
    {
        auto fut = std::async([]
        {
            std::this_thread::sleep_for(std::chrono::seconds(1));
            std::cout << std::this_thread::get_id() << " ";
        });
        futures.push_back(std::move(fut));
    }
    std::for_each(futures.begin(), futures.end(), [](std::future & fut)
    {
        fut.wait();
    });
    std::cout << std::endl;
}

The results were surprising. The first six tasks executed in parallel, each in its own thread. But the rest of the tasks executed in the main thread one after another, separated by 1 second intervals. (Note: this behavior was fixed in v 1.7 of Just::Thread — read on).

The output of the task test

This approach to parallelism obviously doesn’t scale very well.

Then I wrote another program that lists directories recursively, creating a separate async task for each subdirectory. But this time I explicitly requested launch policy launch::async, which guarantees that each task will start in a new thread. This program worked up to a point, but when I tried to list my whole disk, it failed by exhausting Windows’ limited thread creation capability. Again, this approach doesn’t scale very well.

What was even worse, when the program didn’t fail, it performed better with launch::deferred policy, which forces all tasks to be executed serially, than with the launch::async policy. That’s because thread creation in Windows is so expensive that it can easily nullify performance gains of parallelism (although Windows 7 supports user-level threads, which might bypass these problems).

My first reaction was to blame Anthony Williams for badly implementing the Just::Thread library I was using. When he assured me that it was Standard compliant, I turned to Herb Sutter and Hans Boehm for confirmation and they sided with Anthony. It turns out that there are serious problems that prevented C++11 from standardizing task-based concurrency.

The Problems

The foundation of task-based parallelism is the ability for tasks to share threads and to migrate between threads. This sharing and migration must be transparent.

The requirements for the default-launch tasks are the following:

The runtime can either run such task asynchronously or synchronously
When it’s run synchronously, it should be run in the context of the parent thread
When it’s run asynchronously, it should behave as if it were run on a separate thread

Strictly speaking, a task could always call this_thread::get_id() and fool any attempt at thread sharing or migration by discovering the ID of the current thread. In general, the namespace std::this_thread, which also contains sleep functions, is thread-bound.

But let’s suppose that we only require that asynchronous tasks behave as if they were run on separate threads, except when they call functions in the this_thread namespace. There are still several problems.

Thread-Local Variables

C++11 introduced a new thread_local storage qualifier. A thread-local variable is separately initialized and destroyed in every thread. It must not survive thread termination. This requirement complicates thread sharing.

In our exchange, Hans Boehm clarified the termination requirement for tasks: Thread-local variables must be fully destroyed before the owning thread returns from calling get or wait on a future produced by the corresponding std::async; or before the destructor of that future returns, whichever comes first.

This actually leaves some wiggle room: A thread could be reused if the runtime guarantees that thread-local variables of terminating tasks are correctly destroyed. Unfortunately, this might be impossible if the programmer calls OS-specific API, like Windows’ TlsAlloc. Anthony also pointed out that it’s not clear how to deal with DLL_THREAD_DETACH handlers in DLLs, when switching to task granularity.

Locks

There’s another aspect of C++11 concurrency that is tied to threads — locking. The std::mutex object is thread aware. It requires that unlock is called from the same thread as lock. Why should this be a problem?

I haven’t mentioned yet that task migration might be necessary in the middle of execution, but that is what most task-based systems do. It’s an optimization in the case when you’re dealing with blocking tasks.

There are two major blocking scenarios: external and internal. External blocking happens when a task calls an OS function (directly or indirectly) that may block, for instance waiting for I/O. My directory listing program did a lot of that. Internal blocking, which is potentially easier to intercept, happens when tasks are blocked on futures. My program did a lot of that too, when waiting for the results of tasks that were listing subdirectories of the current directory.

A blocked task doesn’t use processor resources, but it does occupy a thread. That thread could be reused to run another task. But that requires a clean way of taking a task off a thread and then restoring its state once the call unblocks. Now, if the task takes a lock on a mutex before blocking, it cannot be migrated to another thread. The unlocking wouldn’t work from another thread.

Herb Sutter observed that, if we tried to restore the task to its original thread, we might get into a spurious deadlock, when the original thread is occupied be another task waiting for the same mutex.

The other problem with locks is in the use of a recursive_mutex. A thread may lock such a mutex multiple times before calling unlock (also multiple times). But if a second thread tries to lock a mutex that’s owned by the current thread, it will block. As long as tasks run is separate threads, this works. But if they share the same thread, they may successfully acquire the same mutex and cause data corruption.

Imagine the following scenario. Task A wants to modify a shared data structure and takes a lock on its recursive mutex. It then blocks on some OS call (probably not a good idea in general, but it may happen). The task gets taken off of the current thread, and task B starts executing. It takes a lock on the same mutex — successfully, as it is executing in the same thread, and reads or modifies a data structure that was in the middle of being modified by task A. A disaster unfolds.

Notice that this is not a problem if tasks are run serially in the same thread, as it happens with the launch::deferred policy, because each task runs to completion before allowing another task to run.

Finally, such migration of running tasks would also wreaks havoc with thread-local variables.

Possible Solutions

Optimizing the Default Launch Policy

The easiest part was to change the implementation of the default policy, to defer the decision whether to run a given task asynchronously or as deferred. Anthony was quick to notice this, and almost immediately released a fix — version 1.7 of Just::Thread.

The idea is simple, you schedule N tasks asynchronously — N being some runtime number dependent on the number of available cores — and put the rest on a queue. When any of the queued tasks is forced (by the call to get or wait on its future), it is executed synchronously in the context of the forcing thread — as if the launch::deferred policy were used. Otherwise, as soon as one of the asynchronous tasks finishes, the next task from the queue is scheduled to run asynchronously. Here’s the output of the same test program after the changes in Just::Thread:

The output of the test with the new library

This time each task ran in a separate thread, but because of the default launch policy, they ran in small batches that could effectively exploit the parallelism of a multicore machine. Still, without thread reuse, the runtime had to create 22 OS threads. The hope is that the operating system caches thread resources so that the creation of the second batch of threads is substantially cheaper than the first one.

(I suggest running this test when evaluating any implementation of a task library.)

Thread Reuse

The next step in improving task performance would be to use a thread pool and reuse threads instead of creating them from scratch. Because of the problem with thread-local variables, it might be impossible to implement thread reuse without some help from the language runtime. The task library would need hooks into every creation of a thread_local variable, so it can destroy them at task exit.

That still leaves the problem of tasks calling APIs like TlsAlloc directly. An atractive option (for library writers) would be to ignore the problem — after all the language provides a portable way of dealing with thread-local storage.

Task Migration

We would like to be able to remove a blocked task from a thread in order to run another task on it. This is not easy because of thread-locals and locks.

The problem with thread_local variables is that they should really be task-local. Or at least they should behave “as if” they were task-local. So when two tasks are sharing the same thread, there has to be some mechanism for “context switching” between them. The context would have to include the state of all thread-local variables.

Migrating a task that is holding a lock could only be done if locks were task-bound rather than thread-bound. Interestingly, there is a provision in the Standard for this kind of behavior. The definition of Lockable in (30.2.5) talks about “execution agents” that could be threads, but could also be something else. This comment is of particular interest:

[ Note: Implementations or users may introduce other kinds of agents such as processes or thread-pool tasks. —end note ]

However, the Standard Library mutex is bound to threads, not tasks. The intention of (30.2.5) is that, if you create your own separate task library with your own task-local variables and mutexes, you will still be able to use the standard utilities such as lock_guard or condition variables. But the implementation of std::async tasks must work with thread_local and std::mutex.

Deadlocks

Here’s a potential scenario where two tasks could deadlock if their threads are reused while they are blocked:

Task A runs on thread T1, takes the mutex M1, and makes a blocking call
The runtime takes A off T1 (saves its state, etc.) and puts it in a Blocked queue
Task B starts executing on the same thread, T1, and tries to take M1, which is locked by A
In order to unlock M1, task A would have to run on T1 — the same thread the lock was taken on — but T1 is now occupied by B, and A can’t make progress

The only way to resolve this deadlock is to take B off the thread. So that’s what a task migration system must do — guarantee that any blocked task is taken off its thread.

In general, any spurious deadlock would involve a bunch of blocked tasks. If all of them are blocked on locks, this is an actual deadlock which would happen anyway. If there is at least one task that can make progress when its blocking call returns, it can always be assigned back to its thread, either because the task running on it completes, or because it’s blocked and will be taken off of it.

Of course if we allow lock migration, as in associating locks with tasks rather than threads, the problem disappears on its own.

Conclusion

What I learned from this exercise was that std::async with default launch policy can be made usable. However its strong association with threads makes it virtually impossible to implement full-blown task-based parallelism. A task-based system could be implemented as a library but it would have to come with severe restrictions on the use of thread_local variables and standard mutexes. Such a system would have to implement its own version of task-local variables and mutexes.

I’m grateful to Anthony Williams, Hans Boehm, Herb Sutter, and Artur Laksberg for enlightening discussions.

Appendix: `async` Refresher

Here’s some typical code that uses std::async to start a task:

auto ftr = std::async([](bool flag)->bool
{
    if (flag)
        throw std::exception("Hi!");
    return flag;
}, true); // <- pass true to lambda
// do some work in parallel...
try
{
    bool flag = ftr.get(); // may re-throw exception
}
catch(std::exception & e)
{
    std::cout << e.what() << std::endl;
}

The code calls std::async with a lambda (anonymous function) that takes a Boolean flag and returns a Boolean. The lambda can either throw an exception or return the flag back. The second argument to async (true, in this case) is passed to the lambda when it is executed.

The value or the exception is passed back to the parent code when it calls the get method on the future returned by async. The call to async may create a new thread, or defer the execution of the function until the call to get is made.

The same code may be implemented directly using std::thread, std::promise, and std::future but, among other things, it requires modifications to the thread function (here, to the lambda):

std::promise prms;
auto th = std::thread([](std::promise<bool> & prms, bool flag)
{
   if (flag)
     prms.set_exception(std::make_exception_ptr(std::exception("Hi!")));
   else
     prms.set_value(flag);
}, std::ref(prms), true);
// do some work
th.join();
auto ftr = prms.get_future();
try
{
   bool flag = ftr.get();
}
catch(std::exception & e)
{
   std::cout << e.what() << std::endl;
}

This video tutorial took a lot of effort because of my inflated expectations. I thought that std::async was a gateway to task-based parallelism. I blogged about task-based concurrency in The Future of Concurrent Programming and, in the context of Haskell, in Parallel Programming with Hints. And of course there is the problem of lack of composability of futures. So for the next 10 or so years we’ll have to stick to libraries, such as Microsoft PPL or Intel TBB or even OpenMP. Or experiment with other languages.
Follow @BartoszMilewski
(You can also follow me on Google+, if you search for Bartosz Milewski.)

Bartosz Milewski's Programming Cafe

October 2011

C++11 Concurrency Series: 8. Data Races

C++11 Concurrency Tutorial: 7. Mutex, Lock, and Monitor

C++11 Concurrency Series: 6. Map Reduce

Async Tasks in C++11: Not Quite There Yet

Task-Based Parallelism

C++11 Tasks

The Problems

Thread-Local Variables

Locks

Possible Solutions

Optimizing the Default Launch Policy

Thread Reuse

Task Migration

Deadlocks

Conclusion

Appendix: `async` Refresher

C++11 Concurrency Tutorial: 5. Tasks

Top Posts

License

Blogroll

Follow Me

Archives

Bartosz Milewski's Programming Cafe

October 2011

C++11 Concurrency Series: 8. Data Races

C++11 Concurrency Tutorial: 7. Mutex, Lock, and Monitor

C++11 Concurrency Series: 6. Map Reduce

Async Tasks in C++11: Not Quite There Yet

Task-Based Parallelism

C++11 Tasks

The Problems

Thread-Local Variables

Locks

Possible Solutions

Optimizing the Default Launch Policy

Thread Reuse

Task Migration

Deadlocks

Conclusion

Appendix: async Refresher

C++11 Concurrency Tutorial: 5. Tasks

Top Posts

License

Blogroll

Follow Me

Archives

Appendix: `async` Refresher