Multicore

Archived Posts from this Category

November 7, 2011

Supercomputing in Seattle

Posted by Bartosz Milewski under Chapel, Concurrency, Distributed Programming, Multicore, Parallelism, Programming
[8] Comments

I firmly believe that supercomputing of today is the mainstream computing of tomorrow. A year and a half ago I wrote a blog about the future of concurrent programming based on new developments in systems and languages in the HPC (High-Performance Computing) community. Hopefully, this year I’ll learn more at the SC11 conference that’s taking place in Seattle in November (my employer, Corensic, will have a booth there). I’m especially interested in Chapel, the HPCS (High Productivity Computing Systems) language under development by Cray Inc., also here in Seattle. There will be a whole-day Chapel tutorial at SC11, which I’m going to attend.

Why Chapel? Whenever I go to a conference and hear about a new language development to support parallel programming, I immediately compare it with Chapel. Chapel does task-based parallelism better than Cilk, TBB or PPL; data-based parallelism better than AMP or ArBB; generic programming better than D (sorry, Andrei, I’m really partial to concepts) — the list goes on. It’s unfortunate that Chapel is pigeonholed as an HPC language, because it’s perfectly adequate for general purpose programming. In fact I installed it on my laptop and wrote a few programs in it.

A lot of HPC is dedicated to scientific computations, modeling of complex systems, and processing of large quantities of data. That’s where parallel and distributed programming shines. There is no doubt in my mind that the kind of computational power that’s used in scientific computations today will soon be available on game consoles, desktops, and then on tablets and smartphones, likely in concert with cloud computing. But we are not going to use our iPhones to simulate chain reactions in nuclear warheads or heat conduction in rocket engines, are we?

So what everyday tasks could benefit from this kind of power? Obviously the game industry has insatiable appetite for computing resources. Enhanced and virtual reality are peeking from around the corner. Speech recognition and natural language processing have already made inroads into smartphones. But I’m sure that, once the power is there, we will find plenty of new and unexpected applications — If you build it, they will come.

The question is: How do we write programs that can harness the power of multicores, GPUs, and distributed systems like the Cloud — possibly all three at the same time? One thing I know for sure: Not by painstakingly managing threads, locks, message passing, copying of data over the network, etc. And this is where the current C++ (C++11) is stuck, and Chapel blazes the trail.

The major advantage of Chapel, in my mind, is that it separates the logic of the algorithm from the details of its implementation on a particular system. In the ideal world you would write a program in a high-level language and the compiler plus runtime would figure out how to run in on a particular system — what can be run in parallel, which parts can be delegated to GPUs, which parts can migrate to other machines on the network, and so on. Well, we can always dream! In reality the programmer must still tell the compiler all those things. Yes, you can do this in C++ but you’ll make your code totally unreadable. The details of implementation would quickly obscure the heart of the algorithm.

In Chapel, you express parallelism in terms of tasks; not threads, thread pools, processes, or computers. You express communications in terms of shared global address space that can span whole clusters of computers. Separately, on the side, you describe the distribution of computations in terms of locales. Each node on the network is a separate locale. Each GPU is a locale (this feature is still under development). You define your data structure in global address space, but you separately describe how you would like it to be cut up and distributed between various locales.

You may see elements of this approach in other languages, libraries, and language extensions, but never in such comprehensive manner as in Chapel. Tasks, for instance, appear in Cilk, PPL (Parallel Pattern Library), and TBB (Threading Buildg Blocks), together with elements of data-driven parallelism. Intel extended its TBB library to ArBB (Array Building Blocks); Microsoft came up with a C++ extension, AMP (Accelerated Massive Parallelism); AMD put its weight behind OpenCL — everybody and his brother are trying to catch the wave of parallelism and high-throughput computing. It just so happens that the HPC crowd has been riding this wave for a long time and there’s a lot we can learn from them.

Which is why Seattle will be hot during the week of November 12-18, no matter what the weather reports predict.

Additional Links

Chapel events at SC11
SCC11 schedule
Birds of a Feather, Chapel Lightning Talks

October 31, 2011

C++11 Concurrency Series: 8. Data Races

Posted by Bartosz Milewski under C++, Concurrency, Multicore, Multithreading, Tutorial
[4] Comments

In this tutorial:

I summarize safe ways of passing arguments to threads, and their gotchas
Show an optimization of monitors based on epochs, together with its maintenance pitfalls
Debug the resulting data race

[twitter-follow screen_name=’BartoszMilewski’]
(You can also follow me on Google+, if you search for Bartosz Milewski.)

October 24, 2011

C++11 Concurrency Tutorial: 7. Mutex, Lock, and Monitor

Posted by Bartosz Milewski under C++, Concurrency, Multicore, Multithreading, Parallelism, Programming, Tutorial
[21] Comments

Why did I do six concurrency tutorials without mentioning mutexes? I think people resort to explicit locking much too early. In this installment I compare two implementations side by side and the results might be surprising. One is moving data between threads (the new C++11 move semantics), the other is using a shared monitor. Whatever the overheads of copying or locking are, they are drowned by the work the threads are doing; and locking is much more error-prone (especially if you try to optimize it).
[twitter-follow screen_name=’BartoszMilewski’]
(You can also follow me on Google+, if you search for Bartosz Milewski.)

October 17, 2011

C++11 Concurrency Series: 6. Map Reduce

Posted by Bartosz Milewski under C++, Concurrency, Multicore, Multithreading, Programming, Tutorial
1 Comment

I wrote a new parallel directory listing program with C++11 async tasks, but this time the number of tasks operating in parallel was bounded. The result was an algorithm that reminded me of MapReduce, so I described how MapReduce works. Here’s the video.
[twitter-follow screen_name=’BartoszMilewski’]
(You can also follow me on Google+, if you search for Bartosz Milewski.)

October 10, 2011

Async Tasks in C++11: Not Quite There Yet

Posted by Bartosz Milewski under C++, Concurrency, Multicore, Multithreading, Parallelism, Programming
[13] Comments

If you expected std::async to be just syntactic sugar over thread creation, you can stop reading right now, because that’s what it is. If you expected more, read on.

Don’t get me wrong, std::async combines several useful concurrency concepts into a nice package: It provides a std::future for the return value, and hides the std::promise side of the future. It also provides options to run a task synchronously. (See the Appendix for a short refresher.)

But tasks have a slightly different connotation in parallel programming: they are the basic blocks of task-based parallelism. And C++11 tasks fall short on that account.

Task-Based Parallelism

Tasks are an answer to performance and scalability problems associated with threads. Operating system threads are rather heavy-weight; it takes time and system resources to create a thread. If you have an algorithm that naturally decomposes into a large number of independent computations, a.k.a. tasks, you’ll probably get your best performance not by creating a separate thread for each task, but by adjusting the number of threads depending on the amount of parallelism available on your particular system, e.g., the number of cores and their current loads. This can be done by hand, using thread pools and load balancing algorithms; or by using task-based systems.

In a task-based system, the programmer specifies what can be run in parallel but lets the system decide how much parallelism to actually use. The programmer splits the algorithm into tasks and the runtime assigns them to threads — often many tasks to a thread.

There are many implementations of task-based parallelism with varying language support. There’s the Cilk language which pioneered this approach; there’s the built in support in Haskell, F#, and Scala; and there are several C++ libraries, like Microsoft PPL or Intel TBB.

Unlike thread creation, task creation is supposed to be relatively inexpensive, letting the programmer explore low-level granularity parallelism and take advantage of multicore speedups.

At the center of task-based systems are work-stealing queues. When a thread creates tasks, they are initially queued on the processor (core) that runs the thread. But if there are other idle processors, they will try to steal tasks from other queues. The stolen tasks are then run in parallel.

Notice that tasks must be able to migrate between threads. What’s more, efficient use of OS threads requires that tasks that are blocked, for instance waiting for I/O or waiting for other tasks to finish, should be taken off their threads, so that other tasks may reuse them.

C++11 Tasks

My expectation was that C++11 “tasks” that are created using std::async should be abstracted from threads, just as they are in task-based parallelism. When I started preparing a video tutorial about tasks in C++, I wrote a simple program to demonstrate it. I created async tasks using the default launch policy and waited for them to complete. Each task slept for one second and then printed its thread ID.

int main() 
{
    std::cout << "Main thread id: " << std::this_thread::get_id() 
        << std::endl;
    std::vector<std::future> futures;
    for (int i = 0; i < 20; ++i)
    {
        auto fut = std::async([]
        {
            std::this_thread::sleep_for(std::chrono::seconds(1));
            std::cout << std::this_thread::get_id() << " ";
        });
        futures.push_back(std::move(fut));
    }
    std::for_each(futures.begin(), futures.end(), [](std::future & fut)
    {
        fut.wait();
    });
    std::cout << std::endl;
}

The results were surprising. The first six tasks executed in parallel, each in its own thread. But the rest of the tasks executed in the main thread one after another, separated by 1 second intervals. (Note: this behavior was fixed in v 1.7 of Just::Thread — read on).

The output of the task test

This approach to parallelism obviously doesn’t scale very well.

Then I wrote another program that lists directories recursively, creating a separate async task for each subdirectory. But this time I explicitly requested launch policy launch::async, which guarantees that each task will start in a new thread. This program worked up to a point, but when I tried to list my whole disk, it failed by exhausting Windows’ limited thread creation capability. Again, this approach doesn’t scale very well.

What was even worse, when the program didn’t fail, it performed better with launch::deferred policy, which forces all tasks to be executed serially, than with the launch::async policy. That’s because thread creation in Windows is so expensive that it can easily nullify performance gains of parallelism (although Windows 7 supports user-level threads, which might bypass these problems).

My first reaction was to blame Anthony Williams for badly implementing the Just::Thread library I was using. When he assured me that it was Standard compliant, I turned to Herb Sutter and Hans Boehm for confirmation and they sided with Anthony. It turns out that there are serious problems that prevented C++11 from standardizing task-based concurrency.

The Problems

The foundation of task-based parallelism is the ability for tasks to share threads and to migrate between threads. This sharing and migration must be transparent.

The requirements for the default-launch tasks are the following:

The runtime can either run such task asynchronously or synchronously
When it’s run synchronously, it should be run in the context of the parent thread
When it’s run asynchronously, it should behave as if it were run on a separate thread

Strictly speaking, a task could always call this_thread::get_id() and fool any attempt at thread sharing or migration by discovering the ID of the current thread. In general, the namespace std::this_thread, which also contains sleep functions, is thread-bound.

But let’s suppose that we only require that asynchronous tasks behave as if they were run on separate threads, except when they call functions in the this_thread namespace. There are still several problems.

Thread-Local Variables

C++11 introduced a new thread_local storage qualifier. A thread-local variable is separately initialized and destroyed in every thread. It must not survive thread termination. This requirement complicates thread sharing.

In our exchange, Hans Boehm clarified the termination requirement for tasks: Thread-local variables must be fully destroyed before the owning thread returns from calling get or wait on a future produced by the corresponding std::async; or before the destructor of that future returns, whichever comes first.

This actually leaves some wiggle room: A thread could be reused if the runtime guarantees that thread-local variables of terminating tasks are correctly destroyed. Unfortunately, this might be impossible if the programmer calls OS-specific API, like Windows’ TlsAlloc. Anthony also pointed out that it’s not clear how to deal with DLL_THREAD_DETACH handlers in DLLs, when switching to task granularity.

Locks

There’s another aspect of C++11 concurrency that is tied to threads — locking. The std::mutex object is thread aware. It requires that unlock is called from the same thread as lock. Why should this be a problem?

I haven’t mentioned yet that task migration might be necessary in the middle of execution, but that is what most task-based systems do. It’s an optimization in the case when you’re dealing with blocking tasks.

There are two major blocking scenarios: external and internal. External blocking happens when a task calls an OS function (directly or indirectly) that may block, for instance waiting for I/O. My directory listing program did a lot of that. Internal blocking, which is potentially easier to intercept, happens when tasks are blocked on futures. My program did a lot of that too, when waiting for the results of tasks that were listing subdirectories of the current directory.

A blocked task doesn’t use processor resources, but it does occupy a thread. That thread could be reused to run another task. But that requires a clean way of taking a task off a thread and then restoring its state once the call unblocks. Now, if the task takes a lock on a mutex before blocking, it cannot be migrated to another thread. The unlocking wouldn’t work from another thread.

Herb Sutter observed that, if we tried to restore the task to its original thread, we might get into a spurious deadlock, when the original thread is occupied be another task waiting for the same mutex.

The other problem with locks is in the use of a recursive_mutex. A thread may lock such a mutex multiple times before calling unlock (also multiple times). But if a second thread tries to lock a mutex that’s owned by the current thread, it will block. As long as tasks run is separate threads, this works. But if they share the same thread, they may successfully acquire the same mutex and cause data corruption.

Imagine the following scenario. Task A wants to modify a shared data structure and takes a lock on its recursive mutex. It then blocks on some OS call (probably not a good idea in general, but it may happen). The task gets taken off of the current thread, and task B starts executing. It takes a lock on the same mutex — successfully, as it is executing in the same thread, and reads or modifies a data structure that was in the middle of being modified by task A. A disaster unfolds.

Notice that this is not a problem if tasks are run serially in the same thread, as it happens with the launch::deferred policy, because each task runs to completion before allowing another task to run.

Finally, such migration of running tasks would also wreaks havoc with thread-local variables.

Possible Solutions

Optimizing the Default Launch Policy

The easiest part was to change the implementation of the default policy, to defer the decision whether to run a given task asynchronously or as deferred. Anthony was quick to notice this, and almost immediately released a fix — version 1.7 of Just::Thread.

The idea is simple, you schedule N tasks asynchronously — N being some runtime number dependent on the number of available cores — and put the rest on a queue. When any of the queued tasks is forced (by the call to get or wait on its future), it is executed synchronously in the context of the forcing thread — as if the launch::deferred policy were used. Otherwise, as soon as one of the asynchronous tasks finishes, the next task from the queue is scheduled to run asynchronously. Here’s the output of the same test program after the changes in Just::Thread:

The output of the test with the new library

This time each task ran in a separate thread, but because of the default launch policy, they ran in small batches that could effectively exploit the parallelism of a multicore machine. Still, without thread reuse, the runtime had to create 22 OS threads. The hope is that the operating system caches thread resources so that the creation of the second batch of threads is substantially cheaper than the first one.

(I suggest running this test when evaluating any implementation of a task library.)

Thread Reuse

The next step in improving task performance would be to use a thread pool and reuse threads instead of creating them from scratch. Because of the problem with thread-local variables, it might be impossible to implement thread reuse without some help from the language runtime. The task library would need hooks into every creation of a thread_local variable, so it can destroy them at task exit.

That still leaves the problem of tasks calling APIs like TlsAlloc directly. An atractive option (for library writers) would be to ignore the problem — after all the language provides a portable way of dealing with thread-local storage.

Task Migration

We would like to be able to remove a blocked task from a thread in order to run another task on it. This is not easy because of thread-locals and locks.

The problem with thread_local variables is that they should really be task-local. Or at least they should behave “as if” they were task-local. So when two tasks are sharing the same thread, there has to be some mechanism for “context switching” between them. The context would have to include the state of all thread-local variables.

Migrating a task that is holding a lock could only be done if locks were task-bound rather than thread-bound. Interestingly, there is a provision in the Standard for this kind of behavior. The definition of Lockable in (30.2.5) talks about “execution agents” that could be threads, but could also be something else. This comment is of particular interest:

[ Note: Implementations or users may introduce other kinds of agents such as processes or thread-pool tasks. —end note ]

However, the Standard Library mutex is bound to threads, not tasks. The intention of (30.2.5) is that, if you create your own separate task library with your own task-local variables and mutexes, you will still be able to use the standard utilities such as lock_guard or condition variables. But the implementation of std::async tasks must work with thread_local and std::mutex.

Deadlocks

Here’s a potential scenario where two tasks could deadlock if their threads are reused while they are blocked:

Task A runs on thread T1, takes the mutex M1, and makes a blocking call
The runtime takes A off T1 (saves its state, etc.) and puts it in a Blocked queue
Task B starts executing on the same thread, T1, and tries to take M1, which is locked by A
In order to unlock M1, task A would have to run on T1 — the same thread the lock was taken on — but T1 is now occupied by B, and A can’t make progress

The only way to resolve this deadlock is to take B off the thread. So that’s what a task migration system must do — guarantee that any blocked task is taken off its thread.

In general, any spurious deadlock would involve a bunch of blocked tasks. If all of them are blocked on locks, this is an actual deadlock which would happen anyway. If there is at least one task that can make progress when its blocking call returns, it can always be assigned back to its thread, either because the task running on it completes, or because it’s blocked and will be taken off of it.

Of course if we allow lock migration, as in associating locks with tasks rather than threads, the problem disappears on its own.

Conclusion

What I learned from this exercise was that std::async with default launch policy can be made usable. However its strong association with threads makes it virtually impossible to implement full-blown task-based parallelism. A task-based system could be implemented as a library but it would have to come with severe restrictions on the use of thread_local variables and standard mutexes. Such a system would have to implement its own version of task-local variables and mutexes.

I’m grateful to Anthony Williams, Hans Boehm, Herb Sutter, and Artur Laksberg for enlightening discussions.

Appendix: `async` Refresher

Here’s some typical code that uses std::async to start a task:

auto ftr = std::async([](bool flag)->bool
{
    if (flag)
        throw std::exception("Hi!");
    return flag;
}, true); // <- pass true to lambda
// do some work in parallel...
try
{
    bool flag = ftr.get(); // may re-throw exception
}
catch(std::exception & e)
{
    std::cout << e.what() << std::endl;
}

The code calls std::async with a lambda (anonymous function) that takes a Boolean flag and returns a Boolean. The lambda can either throw an exception or return the flag back. The second argument to async (true, in this case) is passed to the lambda when it is executed.

The value or the exception is passed back to the parent code when it calls the get method on the future returned by async. The call to async may create a new thread, or defer the execution of the function until the call to get is made.

The same code may be implemented directly using std::thread, std::promise, and std::future but, among other things, it requires modifications to the thread function (here, to the lambda):

std::promise prms;
auto th = std::thread([](std::promise<bool> & prms, bool flag)
{
   if (flag)
     prms.set_exception(std::make_exception_ptr(std::exception("Hi!")));
   else
     prms.set_value(flag);
}, std::ref(prms), true);
// do some work
th.join();
auto ftr = prms.get_future();
try
{
   bool flag = ftr.get();
}
catch(std::exception & e)
{
   std::cout << e.what() << std::endl;
}

October 3, 2011

C++11 Concurrency Tutorial: 5. Tasks

Posted by Bartosz Milewski under C++, Concurrency, Multicore, Multithreading, Parallelism, Programming, Tutorial
[4] Comments

This video tutorial took a lot of effort because of my inflated expectations. I thought that std::async was a gateway to task-based parallelism. I blogged about task-based concurrency in The Future of Concurrent Programming and, in the context of Haskell, in Parallel Programming with Hints. And of course there is the problem of lack of composability of futures. So for the next 10 or so years we’ll have to stick to libraries, such as Microsoft PPL or Intel TBB or even OpenMP. Or experiment with other languages.
[twitter-follow screen_name=’BartoszMilewski’]
(You can also follow me on Google+, if you search for Bartosz Milewski.)

September 26, 2011

C++11 Concurrency Tutorial: Part 4

Posted by Bartosz Milewski under C++, Concurrency, Multicore, Multithreading, Parallelism, Programming, Tutorial
[10] Comments

After a two-week break, attending the Intel Developers Forum and StrangeLoop, I finally had the time to record the fourth tutorial in the series. This time I’m showing how futures and promises work together to enable the passing of results back from threads. I also show how this process of calling a function asynchronously can be simplified using async. Next time I’ll talk more about async tasks and parallelism.
[twitter-follow screen_name=’BartoszMilewski’]
(You can also follow me on Google+, if you search for Bartosz Milewski.)

August 15, 2011

Data Races at the Processor Level

Posted by Bartosz Milewski under Concurrency, Memory Model, Multicore, Multithreading, Parallelism, Programming, x86
1 Comment

[twitter-follow screen_name=’BartoszMilewski’]
Back to concurrency — this time at the lowest level. Is it possible to detect a data race by looking at assembly instructions executing on an x86 multicore processor? Find out in my other blog.

June 27, 2011

The Language of Concurrency Video

Posted by Bartosz Milewski under Atomics, Concurrency, Distributed Programming, Memory Model, Multicore, Multithreading, Parallelism, Programming
[7] Comments

[twitter-follow screen_name=’BartoszMilewski’]
By popular demand I turned my introductory webinar into a video presentation. The purpose of this 50 min presentation is to familiarize the viewer with the basic ideas of concurrent programming. Here’s the list of topics:

Processes vs. Threads
Multithreading vs. Parallelization
Shared Memory vs. Message Passing
Data Races and Atomicity Violations
Relaxed Memory Models
Sequential Consistency and DRF Guarantee
Risks of Concurrency
Debugging Concurrent Programs

Comments and suggestions for future videos are very welcome.

June 6, 2011

Introduction to Concurrency Webinar

Posted by Bartosz Milewski under Concurrency, Multicore, Multithreading, Parallelism, Programming, webinar
[6] Comments

[twitter-follow screen_name=’BartoszMilewski’]
Tomorrow, June 7th, 9 a.m. PDT (12 a.m. EST), I’ll be presenting a webinar, The Language of Concurrency, (the same as two weeks ago).

« Previous Page — Next Page »

Bartosz Milewski's Programming Cafe

Multicore

Supercomputing in Seattle

Additional Links

C++11 Concurrency Series: 8. Data Races

C++11 Concurrency Tutorial: 7. Mutex, Lock, and Monitor

C++11 Concurrency Series: 6. Map Reduce

Async Tasks in C++11: Not Quite There Yet

Task-Based Parallelism

C++11 Tasks

The Problems

Thread-Local Variables

Locks

Possible Solutions

Optimizing the Default Launch Policy

Thread Reuse

Task Migration

Deadlocks

Conclusion

Appendix: `async` Refresher

C++11 Concurrency Tutorial: 5. Tasks

C++11 Concurrency Tutorial: Part 4

Data Races at the Processor Level

The Language of Concurrency Video

Introduction to Concurrency Webinar

Top Posts

License

Blogroll

Follow Me

Archives

Multicore

Additional Links

Task-Based Parallelism

C++11 Tasks

The Problems

Thread-Local Variables

Locks

Possible Solutions

Optimizing the Default Launch Policy

Thread Reuse

Task Migration

Deadlocks

Conclusion

Appendix: async Refresher

Top Posts

License

Blogroll

Follow Me

Archives

Appendix: `async` Refresher