Concurrency



I’ve been looking for a good analogy of what programming in C++ feels like and I remembered this 1990 Tim Burton movie, Edward Scissorhands.

It’s a darker version of Pinocchio, shot in suburban settings. In this poster, the scary guy (Johnny Depp) is trying to gently hug Winona Ryder but his clumsy scissor-hands are making it very dangerous for both of them. His face is already covered with deep scars.

Having scissors for hands in not all that bad. Edward has many talents: he can, for instance, create stunning dog hairdos.

I often have these kinds of thoughts after attending C++ conferences: this time it was Going Native 2013. The previous year, the excitement was all about the shiny new C++11 Standard. This year it was more of a reality check. Don’t get me wrong — there were many stunning dog hairdos on display (I mean C++ code that was elegant and simple) but the bulk of the conference was about how to avoid mutilation and how to deliver first aid in case of accidental amputation.

Little shop of horrors

There was so much talk about how not to use C++ that it occurred to me that maybe this wasn’t the problem of incompetent programmers, but that straightforward C++ is plain wrong. So if you just learn the primitives of the language and try to use them, you’re doomed.

C++ has an excuse for that: backward compatibility — in particular compatibility with C. You might think of the C subset of C++ as bona fide assembly language which you shouldn’t use it in day-to-day programming, except that it’s right there on the surface. If you reach blindly into your C++ toolbox, you’re likely to come up with naked pointers, for loops, and all this ugly stuff.

A well known example of what not to do is to use malloc to dynamically allocate memory, and free to deallocate it. malloc takes a count of bytes and returns a void pointer, which you have to cast to something more usable — it would be hard to come up with worse API for memory management. Here’s an example of really bad (but almost correct, if it weren’t for the possibility of null pointer dereference) code:

struct Pod {
    int count;
    int * counters;
};

int n = 10;
Pod * pod = (Pod *) malloc (sizeof Pod);
pod->count = n
pod->counters = (int *) malloc (n * sizeof(int));
...
free (pod->counters);
free (pod);

Hopefully, nobody writes code like this in C++, although I’m sure there are a lot of legacy apps with such constructs, so don’t laugh.

C++ “solved” the problem of redundant casting and error-prone size calculations by replacing malloc and free with new and delete. The corrected C++ version of the code above would be:

struct Pod {
    int count;
    int * counters;
};

int n = 10;
Pod * pod = new Pod;
pod->count = n;
pod->counters = new int [n];
...
delete [] pod->counters;
delete pod;

BTW, the null pointer dereference problem is solved too, because new will throw an exception when the system runs out of memory. There is still a slight chance of a memory leak if the second new fails (But how often does that happen? Hint: how big can n get?) So here’s the really correct version of the code:

class Snd { // Sophisticated New Data
public:
    Snd (int n) : _count(n), _counters(new int [n]) {}
    ~Snd () { delete [] _counters; }
private:
    int _count;
    int * _counters;
};

Snd * snd = new Snd (10);
...
delete snd;

Are we done yet? Of course not! The code is not exception safe.

The C++ lore is that you should avoid naked pointers, avoid arrays, avoid delete. So the remedy for the lameness of malloc is operator new, which is also broken because it returns a dangerous pointer and pointers are bad.

We all know (and have scars on our faces to prove it) that you should use the Standard Library containers and smart pointers whenever possible. Oh, and use value semantics for passing things around. No wait! Value semantics comes with a performance penalty because of excessive copying. So what about shared_ptr and vectors of shared_ptr? But that adds the overhead of reference counting! No, here’s a new idea: move semantics and rvalue references.

I can go on and on like this (and I often do!). Do you see the pattern? Every remedy breeds another remedy. It’s no longer just the C subset that should be avoided. Every new language feature or library addition comes with a new series of gotchas. And you know a new feature is badly designed if Scott Meyers has a talk about it. (His latest was about the pitfalls of, you guessed it, move semantics.)

The Philosophy of C++

Bjarne Stroustrup keeps stressing how important backward compatibility is for C++. It’s one of the pillars of the C++ philosophy. Considering how much legacy code there is, it makes perfect sense. Compatibility, though, takes a very heavy toll on the evolution of the language. If nature were as serious about backward compatibility as C++ is, humans would still have tails, gills, flippers, antennae, and external skeletons on top of internal ones — they all made sense at some point in the evolution.

C++ has become an extremely complex language. There are countless ways of doing the same thing — almost all of them either plain wrong, dangerous, unmaintainable, or all of the above. The problem is that most code compiles and even runs. The mistakes and shortcomings are discovered much later, often after the product has been released.

You might say: Well, that’s the nature of programming. If you think so, you should seriously look at Haskell. Your first reaction will be: I don’t know how to implement the first thing (other than the factorial and Fibonacci numbers) in this extremely restrictive language. This is totally different from the C++ experience, where you can start hacking from day one. What you don’t realize is that it will take you 10 years, if you’re lucky, to discover the “right way” of programming in C++ (if there even is such thing). And guess what, the better a C++ programmer you are, the more functional your programs look like. Ask any C++ guru and they will tell you: avoid mutation, avoid side effects, don’t use loops, avoid class hierarchies and inheritance. But you will need strict discipline and total control over your collaborators to pull that off because C++ is so permissive.

Haskell is not permissive, it won’t let you — or your coworkers — write unsafe code. Yes, initially you’ll be scratching your head trying to implement something in Haskell that you could hack in C++ in 10 minutes. If you’re lucky, and you work for Sean Parent or other exceptional programmer, he will code review your hacks and show you how not to program in C++. Otherwise, you might be kept in the dark for decades accumulating self-inflicted wounds and dreaming of dog hairdos.

Resource Management

I started this post with examples of resource management (strictly speaking, memory management), because this is one of my personal favorites. I’ve been advocating and writing about it since the nineties (see bibliography at the end). Obviously I have failed because 20 years later resource management techniques are still not universally known. Bjarne Stroustrup felt obliged to spend half of his opening talk explaining resource management to the crowd of advanced C++ programmers. Again, one could blame incompetent programmers for not accepting resource management as the foundation of C++ programming. The problem though is that there is nothing in the language that would tell a programmer that something is amiss in the code I listed in the beginning of this post. In fact it often feels like learning the correct techniques is like learning a new language.

Why is it so hard? Because in C++ the bulk of resource management is memory management. In fact it has to be stressed repeatedly that garbage collection would not solve the problem of managing resources: There will always be file handles, window handles, open databases and transactions, etc. These are important resources, but their management is overshadowed by the tedium of memory management. The reason C++ doesn’t have garbage collection is not because it can’t be done in an efficient way, but because C++ itself is hostile to GC. The compiler and the runtime have to always assume the worst — not only that any pointer can alias any other pointer but that a memory address can be stored as an integer or its lower bits could be used as bitfields (that’s why only conservative garbage collectors are considered for C++).

It’s a common but false belief that reference counting (using shared pointers in particular) is better than garbage collection. There is actual research showing that the two approaches are just two sides of the same coin. You should realize that deleting a shared pointer may lead to an arbitrary long pause in program execution, with similar performance characteristics as a garbage sweep. It’s not only because every serious reference counting algorithm must be able to deal with cycles, but also because every time a reference count goes to zero on a piece of data a whole graph of pointers reachable from that object has to be traversed. A data structure built with shared pointers might take a long time to delete and, except for simple cases, you’ll never know which shared pointer will go out of scope last and trigger it.

Careful resource management and spare use of shared_ptr might still be defendable for single-threaded programs, but the moment you start using concurrency, you’re in big trouble. Every increment or decrement of the counter requires locking! This locking is usually implemented with atomic variables, but so are mutexes! Don’t be fooled: accessing atomic variables is expensive. Which brings me to the central problem with C++.

Concurrency and Parallelism

It’s been 8 years since Herb Sutter famously exclaimed: The Free Lunch is Over! Ever since then the big C++ oil tanker has been slowly changing its course. It’s not like concurrency was invented in 2005. Posix threads have been defined in 1995. Microsoft introduced threads in Windows 95 and multiprocessor support in Windows NT. Still, concurrency has only been acknowledged in the C++ Standard in 2011.

C++11 had to start from scratch. It had to define the memory model: when and in what order memory writes from multiple threads become visible to other threads. For all practical purposes, the C++ memory model was copied from Java (minus some controversial guarantees that Java made about behavior under data races). In a nutshell, C++ programs are sequentially consistent if there are no data races. However, since C++ had to compete with the assembly language, the full memory model includes so called weak atomics, which I would describe as portable data races, and recommend staying away from.

C++11 also defined primitives for thread creation and management, and basic synchronization primitives as defined by Dijkstra and Hoare back in the 1960s, such as mutexes and condition variables. One could argue whether these are indeed the right building blocks for synchronization, but maybe that doesn’t really matter because they are known not to be composable anyway. The composable abstraction for synchronization is STM (Software Transactional Memory), which is hard to implement correctly and efficiently in an imperative language. There is an STM study group in the Standards Committee, so there is a chance it might one day become part of the Standard. But because C++ offers no control over effects, it will be very hard to use properly.

There was also a misguided and confusing attempt at providing support for task-based parallelism with async tasks and non-composable futures (both seriously considered for deprecation in C++14). Thread-local variables were also standardized making task-based approach that much harder. Locks and condition variables are also tied to threads, not tasks. So that was pretty much a disaster. The Standards Committee has the work cut out for them for many years ahead. That includes task-based composable parallelism, communication channels to replace futures (one would hope), task cancellation and, probably longer term, data-driven parallelism, including GPU support. A derivative of Microsoft PPL and Intel TBB should become part of the Standard (hopefully not Microsoft AMP).

Let’s take a great leap of faith and assume that all these things will be standardized and implemented by, say, 2015. Even if that happens, I still don’t think people will be able to use C++ for mainstream parallel programming. C++ has been designed for single thread programming, and parallel programming requires a revolutionary rather than evolutionary change. Two words: data races. Imperative languages offer no protection against data races — maybe with the exception of D.

In C++, data is shared between threads by default, is mutable by default, and functions have side effects almost by default. All those pointers and references create fertile grounds for data races, and the vulnerability of data structures and functions to races is in no way reflected in the type system. In C++, even if you have a const reference to an object, there is no guarantee that another thread won’t modify it. Still worse, any references inside a const object are mutable by default.

D at least has the notion of deep constness and immutability (no thread can change an immutable data structure). Another nod towards concurrency from D is the ability to define pure functions. Also, in D, mutable objects are not shared between threads by default. It is a step in the right direction, even though it imposes runtime cost for shared objects. Most importantly though, threads are not a good abstraction for parallel programming, so this approach won’t work with lightweight tasks and work-stealing queues, where tasks are passed between threads.

But C++ doesn’t support any of this and it doesn’t look like it ever will.

Of course, you might recognize all these pro-concurrency and parallelism features as functional programming — immutability and pure functions in particular. At the risk of sounding repetitive: Haskell is way ahead of the curve with respect to parallelism, including GPU programming. That was the reason I so easily converted to Haskell after years of evangelizing good programming practices in C++. Every programmer who’s serious about concurrency and parallelism should learn enough Haskell to understand how it deals with it. There is an excellent book by Simon Marlow, Parallel and Concurrent Programming in Haskell. After you read it, you will either start using functional techniques in your C++ programming, or realize what an impedance mismatch there is between parallel programming and an imperative language, and you will switch to Haskell.

Conclusions

I believe that the C++ language and its philosophy are in direct conflict with the requirements of parallel programming. This conflict is responsible for the very slow uptake of parallel programming in mainstream software development. The power of multicore processors, vector units, and GPUs is being squandered by the industry because of an obsolete programming paradigm.

Bibliography

Here I put together some of my publications about resource management:

  1. Bartosz Milewski, “Resource Management in C++,” Journal of Object Oriented Programming, March/April 1997, Vol. 10, No 1. p. 14-22. This is still pre-unique_ptr, so I’m using auto_ptr for what it’s worth. Since you can’t have vectors of auto_ptr I implemented an auto_vector.
  2. C++ Report in September 1998 and February 1999 (still using auto_ptr).
  3. C++ in Action (still auto_ptr), Addison Wesley 2001. See an excerpt from this book that talks about resource management.
  4. Walking Down Memory Lane, with Andrei Alexandrescu, CUJ October 2005 (using unique_ptr)
  5. unique_ptr–How Unique is it?, WordPress, 2009

Here are some of my blogs criticizing the C++11 approach to concurrency:

  1. Async Tasks in C++11: Not Quite There Yet
  2. Broken promises–C++0x futures

I’ve been dealing with concurrency for many years in the context of C++ and D (see my presentation about Software Transactional Memory in D). I worked for a startup, Corensic, that made an ingenious tool called Jinx for detecting data races using a lightweight hypervisor. I recorded a series of screencasts teaching concurrency to C++ programmers. If you follow these screencasts you might realize that I was strongly favoring functional approach to concurrency, promoting immutability and pure functions. I even showed how non-functional looking code leads to data races that could be detected by (now defunct) Jinx. Essentially I was teaching C++ programmers how to imitate Haskell.

Except that C++ could only support a small subset of Haskell functionality and provide no guarantees against even the most common concurrency errors.

It’s unfortunate that most programmers haven’t seen what Haskell can do with concurrency. There is a natural barrier to learning a new language, especially one that has the reputation of being mind boggling. A lot of people are turned off simply by unfamiliar syntax. They can’t get over the fact that in Haskell a function call uses no parentheses or commas. What in C++ looks like

f(x, y)

in Haskell is written as:

f x y

Other people dread the word “monad,” which in Haskell is essentially a tool for embedding imperative code inside functional code.

Why is Haskell syntax the way it is? Couldn’t it have been more C- or Java- like? Well, look no farther than Scala. Because of Java-like syntax the most basic functional trick, currying a function, requires a Scala programmer to anticipate this possibility in the function definition (using special syntax: multiple pairs of parentheses). If you have no access to the source code for a function, you can’t curry it (at least not without even more syntactic gymnastics).

If you managed to wade through this post so far, you are probably not faint of heart and you will accept the challenge of reading an excellent article by Simon Peyton Jones, Beautiful Concurrency that does a great job of explaining to the uninitiated Haskell’s beautiful approach to concurrency. It’s not a new article, but it has been adapted to the new format of the School of Haskell by Yours Truly, which means you’ll be able to run code examples embedded in it. If you feel adventurous, you might even edit them and see how the results change.


It was my first experience working with the C++ Standardization Committee in a subgroup dedicated to concurrency and parallelism. I won’t bore you with details — they will be available at the committee web site. I’ll share my overall impressions and then focus on specific areas where I have strong opinions.

Being an outsider I considered the C++ Standard the ultimate word. If I had problems interpreting the letter of the Standard I would ask one of the committee members for interpretation and assume that I would get the same answer from any of them. Reality turned out to be more complex than that. C++ Standard is full of controversial topics. Some of those controversies could not be resolved in time, so often the wording of the Standard is intentionally vague. Some features were not ready for inclusion so little stubs were inserted into the document that sometimes don’t make much sense in isolation.

One such example is the intentional vagueness and the lack of definition of thread of execution. Not only is a thread undefined, some of the semantics are expressed using the “as if” language. In particular the thingie started by std::async is supposed to behave “as if” it were run in a separate thread of execution (whatever that means). At some point I had a long email exchange about it with Anthony Williams and Hans Boehm that resulted in a blog post. I thought the things were settled until I was alerted to the fact that Microsoft’s interpretation of the Standard was slightly different, and their “as if” didn’t include thread_local variables, at least not in the beta of the new Visual C++.

Here’s the problem: std::async was introduced in the Standard as a compromise between the idea that it’s just syntactic sugar over std::thread creation, and the idea that it’s an opening for task-based parallelism. In fact when I first tried std::async using Anthony Williams’ Just Thread library I expected it to run on a thread pool complete with work stealing and thread reuse. Not so, argued Anthony and Hans, pointing among others things to the problem of managing thread-local variables — are they supposed to be local with respect to the underlying OS thread, or to a smaller units of execution, the tasks?. If multiple tasks are reusing the same thread should they see fresh versions of thread_local variables? When should thread-local variables be destroyed if the lifetime of pool threads is theoretically infinite?

Now, Microsoft has its implementation of task-based concurrency in the form of PPL (Parallel Pattern Library). Intel has TBB (Threading Building Blocks), which is a superset of PPL and it also runs on Linux. I can understand the eagerness of those companies to bend the (intentionally vague) rules and make these libraries accessible through std::async, especially if they can dramatically improve performance.

I’d be the first to vote for this proposal, except for a few unsolved problems.

First of all, Microsoft wanted to change the semantics of std::async when called with launch_policy::async. I think this was pretty much ruled out in the ensuing discussion. Pure async case should be indistinguishable from direct creation of std::thread. Any attempt at using a thread pool behind the scenes could result in deadlocks. Essentially, the programmer must have a guarantee that all the tasks will be allowed to run in parallel no matter how many there are. Just imagine a bunch of tasks trying to communicate with each other back and forth. If thread creation is throttled down after N of them start and possibly block waiting for responses from the rest of them, they might block forever. Thread pools usually have the ability to create new threads on demand, but it’s never obvious when a new thread must be created. Even if the pool could detect all the threads that are blocked, it couldn’t detect those that are busy-spinning. This is why std::async with launch_policy::async must always create, or at least immediately steal, a thread.

The situation is different with std::async called with the default launch policy (the bitwise OR of launch_policy::async and launch_policy::deferred). In that case the runtime does not guarantee that all tasks will be able to run in parallel. In fact the programmer must be prepared for the possibility that all tasks run serially in the context of the parent thread (more specifically, in the context of the thread that calls future::get). Here the problem with using a thread pool is different. It has to do with the lifetimes of thread_local variables that I mentioned before. This is a serious problem and the semantics defined by the current Standard are far from natural. As it stands, a task created using the default launch policy must either run on a completely new thread, in which case that thread defines the lifetimes of thread_local variables; or it must be deferred, in which case it shares thread_local variables with its parent (again, strictly speaking, with the caller of future::get — if the future is passed to a different thread). This behavior might seem confusing, but at least it’s well defined.

Here’s how Herb Sutter proposed to solve the problem of making tasks run in a thread pool: Disallow non-POD thread_locals altogether. The argument was that nobody has implemented non-POD thread locals anyway, so nobody will suffer. Anthony Williams’ and Boost implementations were dismissed as library-based.

This seems to me like a violation of the spirit of C++, but there is a precedent for it: atomic variables. You can declare a POD (Plain Old Data, including simple structs) as atomic and, if it fits inside a hardware supported atomic word, it will become a lock-free atomic; otherwise a lock will be provided free of charge (well, you’ll pay for it with performance, but that’s a different story). But you can’t define a non-POD as atomic!

A quick straw poll showed that the subcommittee was equally split between those who were willing to discuss this change and those who weren’t. It seems though that Microsoft will go ahead with its PPL implementation ignoring the problems with thread_local (and also with DLL_THREAD_DETACH handlers I mentioned in my blog). So you might want to restrict the use of non-POD thread-local variables for the time being.

This discussion had a larger context: The proposal to introduce thread pools into the language/library as first class objects. Google’s Jeffrey Yaskin described their Executor library, which combines thread pools with work-stealing queues and schedulers. PPL has a similar construct called task group. In this new context, std::async would only provide an interface to a global default thread-pool/executor/task-group. The introduction of first-class thread pools would take away the pressure to modify the semantics of std::async. If you cared about the way your tasks are scheduled, you could spawn them using a dedicated thread-pool object. Having an explicit object representing a set of tasks would also allow collective operations such as wait-for-all or cancel.

Which brings me to another topic: composable futures. I wrote a blog post some time ago, Broken Promises: C++0x Futures, in which I lamented the lack of composability of futures. I followed it with another blog, Futures Done Right, proposing a solution. So I was very happy to learn about a new proposal to fix C++ futures. The proposal came from an unexpected source — C#.

The newest addition to C# is support for asynchronous interfaces (somewhat similar to Boost::ASIO). This is a hot topic at Microsoft because the new Windows 8 runtime is based on asynchronous API — any call that might take more than 50ms is implemented as an asynchronous API. Of course you can program to asynchronous API by writing completion handlers, but it’s a very tedious and error-prone method. Microsoft’s Mads Torgersen described how C# offers several layers of support for asynchronous programming.

But what caught my interest was how C# deals with composition of futures (they call them task objects). They have the analog of an aggregate join called WhenAll and an equivalent of “select” called WhenAny. However these combinators do not block; instead they return new futures. There is another important combinator, ContinueWith. You give it a function (usually a lambda) that will be called when the task completes. And again, ContinueWith doesn’t block — it returns another future, which may be composed with other futures, and so on. This is exactly what makes C# futures composable and, hopefully, C++ will adopt a similar approach.

Of course there is much more to the async proposal, and I wish I had more time to talk about it; but the composable integration of asynchronicity with task-based concurrency is in my eyes a perfect example of thoughtful design.

I noticed that there seems to be a problem with C++’s aversion to generalizations (I might be slightly biased having studied Haskell with its love for generalizations). Problems are often treated in separation, and specific solutions are provided for each, sometimes without a serious attempt at generalization. Case in point: cancellation of tasks. A very specialized solution involving cancellation tokens was proposed. You get opaque tokens from a factory, you pass them to tasks (either explicitly or by lambda capture), and the tasks are responsible for polling the tokens and performing appropriate cancellation actions. But this is an example of an asynchronous Boolean channel. Instead of defining channels, C++ is considering a special-purpose one-shot solution (unless there is a volunteer willing who will write a channels proposal). By the way, futures can be also viewed as channels, so this generalization might go a long way.

Another candidate for generalization was the Intel vectorization proposal presented by Robert Geva. Of course it would be great to support the use of vector processors in C++. But you have to see it in the larger context of data-driven parallelism. It doesn’t make sense to have separate solutions for vector processors, multicores running in SIMD mode, and GPGPUs. What’s needed is general support for data parallelism that allows multiple hardware-specific specializations. Hopefully a more general proposal will materialize.

The C++ Standards Committee is doing a great job, considering all the limitations it’s facing. The committee will not add anything to the language unless there are volunteers who will write proposals and demonstrate working implementations. Remember, you too can contribute to the future of C++.


My new blog post is at the FP Commplete web site where I work now. It explains the major unsolved problem of imperative programming and why I turned to functional programming. There is also an animated discussion on reddit.

 


How would you like a job in the supercomputing industry? Programming those powerful Ks, Jaguars, Roadrunners, Blue Genes, or gigantic clusters of computers? How inspiring would that be?

Not much, according to the luminaries of the field. I went to a panel about the future of supercomputing at SC11, and learned that the future is… Fortran, MPI, OpenMP and CUDA. I have no reason to doubt the experts; after all some of them were with the industry when it was all ferrite core memory and punch cards. But it makes me wonder if there is a future at all for supercomputing, if things keep going in this direction.

Let me explain: Programming in Fortran, MPI (Message Passing Interface), OpenMP (a system of annotations for C or Fortran to help the compiler parallelize the program), and CUDA (Compute Unified Device Architecture for programming GPGPUs) is tedious, uninspiring, and boring.

I talked to a CS student who was demonstrating his summer work at the booth belonging to one of the large national labs. It was a project to improve Monte Carlo simulations of some physical processes. It was done, unsurprisingly, using MPI and OpenMP. I asked him what the exciting part of the job was. It was the learning of the Monte Carlo method. The rest was the tedium of combining barely compatible clunky programming paradigms into a workable program.

Why does it matter? Because a thriving industry or a company must attract talent. And talent can’t be bought, at least not easily. There was once a study, which showed that, above a certain compensation level, talented people don’t care so much about salaries as they do about the novelty, excitement, and freedom. Google knows it very well: They create an exciting work environments (I call them day-care centers for programmers), and encourage their employees to spend 20% of their time pursuing their own projects. No wonder there is an underground pipeline from Microsoft to Google through which the talent keeps leaking out.

By the way, I worked for Microsoft back when it was exciting. Our salaries were rather mediocre, but we felt the urge to work long hours and weekends because we felt that our contributions mattered. Unlike today, sales and marketing were not driving the company, developers were.

To confuse matters even more for the executives, programmers are relatively cheap. The cooling bill for a data center dwarfs the cost of software development. Let’s face it, from a distance, a programmer might look just like another commodity, like a computer rack, air conditioner, or a router. This is even more pronounced in supercomputing, where a single rack might go for a million dollars–an equivalent of 10-20 programmer/years.

If you drain all the excitement from work, your company, or the whole industry, is bound to stagnate. Bored people don’t innovate. And we know from experience that, in high tech industries, if you don’t innovate, you die. Old programming paradigms might have worked for years, but new unmet challenges are piling up. A lot of work that required supercomputers in the past is now done on clusters of off-the-shelf components. Google owns one of the largest supercomputers in the world, and it’s all built from cheap commodity boxes. But Google lets its people innovate.

But not everything is bleak in the land of supercomputers. I have met two teams that were brimming with ideas and enthusiasm: one was Brad Chamberlain’s Cray Chapel team and the other was Hartmut Kaiser’s Louisiana State University Ste||ar team. I’m sure there were many others, but those were the ones I had the pleasure of meeting outside of the exhibition hall.

You can tell that a team is dedicated to a task if they can’t stop talking about their work even after a few beers. Young creative people are attracted like moths to interesting and challenging projects. I don’t think writing simulations using OpenMP and MPI, even if they run on Cray X-MP, can generate this kind of enthusiasm.


The latest tutorial is out. I talk in some depth about condition variables and then show how to use them in constructing a message queue. I use the message queue to implement message-passing server threads.

(You can also follow me on Google+, if you search for Bartosz Milewski.)


I firmly believe that supercomputing of today is the mainstream computing of tomorrow. A year and a half ago I wrote a blog about the future of concurrent programming based on new developments in systems and languages in the HPC (High-Performance Computing) community. Hopefully, this year I’ll learn more at the SC11 conference that’s taking place in Seattle in November (my employer, Corensic, will have a booth there). I’m especially interested in Chapel, the HPCS (High Productivity Computing Systems) language under development by Cray Inc., also here in Seattle. There will be a whole-day Chapel tutorial at SC11, which I’m going to attend.

Why Chapel? Whenever I go to a conference and hear about a new language development to support parallel programming, I immediately compare it with Chapel. Chapel does task-based parallelism better than Cilk, TBB or PPL; data-based parallelism better than AMP or ArBB; generic programming better than D (sorry, Andrei, I’m really partial to concepts) — the list goes on. It’s unfortunate that Chapel is pigeonholed as an HPC language, because it’s perfectly adequate for general purpose programming. In fact I installed it on my laptop and wrote a few programs in it.

A lot of HPC is dedicated to scientific computations, modeling of complex systems, and processing of large quantities of data. That’s where parallel and distributed programming shines. There is no doubt in my mind that the kind of computational power that’s used in scientific computations today will soon be available on game consoles, desktops, and then on tablets and smartphones, likely in concert with cloud computing. But we are not going to use our iPhones to simulate chain reactions in nuclear warheads or heat conduction in rocket engines, are we?

So what everyday tasks could benefit from this kind of power? Obviously the game industry has insatiable appetite for computing resources. Enhanced and virtual reality are peeking from around the corner. Speech recognition and natural language processing have already made inroads into smartphones. But I’m sure that, once the power is there, we will find plenty of new and unexpected applications — If you build it, they will come.

The question is: How do we write programs that can harness the power of multicores, GPUs, and distributed systems like the Cloud — possibly all three at the same time? One thing I know for sure: Not by painstakingly managing threads, locks, message passing, copying of data over the network, etc. And this is where the current C++ (C++11) is stuck, and Chapel blazes the trail.

The major advantage of Chapel, in my mind, is that it separates the logic of the algorithm from the details of its implementation on a particular system. In the ideal world you would write a program in a high-level language and the compiler plus runtime would figure out how to run in on a particular system — what can be run in parallel, which parts can be delegated to GPUs, which parts can migrate to other machines on the network, and so on. Well, we can always dream! In reality the programmer must still tell the compiler all those things. Yes, you can do this in C++ but you’ll make your code totally unreadable. The details of implementation would quickly obscure the heart of the algorithm.

In Chapel, you express parallelism in terms of tasks; not threads, thread pools, processes, or computers. You express communications in terms of shared global address space that can span whole clusters of computers. Separately, on the side, you describe the distribution of computations in terms of locales. Each node on the network is a separate locale. Each GPU is a locale (this feature is still under development). You define your data structure in global address space, but you separately describe how you would like it to be cut up and distributed between various locales.

You may see elements of this approach in other languages, libraries, and language extensions, but never in such comprehensive manner as in Chapel. Tasks, for instance, appear in Cilk, PPL (Parallel Pattern Library), and TBB (Threading Buildg Blocks), together with elements of data-driven parallelism. Intel extended its TBB library to ArBB (Array Building Blocks); Microsoft came up with a C++ extension, AMP (Accelerated Massive Parallelism); AMD put its weight behind OpenCL — everybody and his brother are trying to catch the wave of parallelism and high-throughput computing. It just so happens that the HPC crowd has been riding this wave for a long time and there’s a lot we can learn from them.

Which is why Seattle will be hot during the week of November 12-18, no matter what the weather reports predict.

Additional Links

  1. Chapel events at SC11
  2. SCC11 schedule
  3. Birds of a Feather, Chapel Lightning Talks

In this tutorial:

  • I summarize safe ways of passing arguments to threads, and their gotchas
  • Show an optimization of monitors based on epochs, together with its maintenance pitfalls
  • Debug the resulting data race


(You can also follow me on Google+, if you search for Bartosz Milewski.)


Why did I do six concurrency tutorials without mentioning mutexes? I think people resort to explicit locking much too early. In this installment I compare two implementations side by side and the results might be surprising. One is moving data between threads (the new C++11 move semantics), the other is using a shared monitor. Whatever the overheads of copying or locking are, they are drowned by the work the threads are doing; and locking is much more error-prone (especially if you try to optimize it).

(You can also follow me on Google+, if you search for Bartosz Milewski.)


I wrote a new parallel directory listing program with C++11 async tasks, but this time the number of tasks operating in parallel was bounded. The result was an algorithm that reminded me of MapReduce, so I described how MapReduce works. Here’s the video.

(You can also follow me on Google+, if you search for Bartosz Milewski.)

« Previous PageNext Page »