Chapel



How would you like a job in the supercomputing industry? Programming those powerful Ks, Jaguars, Roadrunners, Blue Genes, or gigantic clusters of computers? How inspiring would that be?

Not much, according to the luminaries of the field. I went to a panel about the future of supercomputing at SC11, and learned that the future is… Fortran, MPI, OpenMP and CUDA. I have no reason to doubt the experts; after all some of them were with the industry when it was all ferrite core memory and punch cards. But it makes me wonder if there is a future at all for supercomputing, if things keep going in this direction.

Let me explain: Programming in Fortran, MPI (Message Passing Interface), OpenMP (a system of annotations for C or Fortran to help the compiler parallelize the program), and CUDA (Compute Unified Device Architecture for programming GPGPUs) is tedious, uninspiring, and boring.

I talked to a CS student who was demonstrating his summer work at the booth belonging to one of the large national labs. It was a project to improve Monte Carlo simulations of some physical processes. It was done, unsurprisingly, using MPI and OpenMP. I asked him what the exciting part of the job was. It was the learning of the Monte Carlo method. The rest was the tedium of combining barely compatible clunky programming paradigms into a workable program.

Why does it matter? Because a thriving industry or a company must attract talent. And talent can’t be bought, at least not easily. There was once a study, which showed that, above a certain compensation level, talented people don’t care so much about salaries as they do about the novelty, excitement, and freedom. Google knows it very well: They create an exciting work environments (I call them day-care centers for programmers), and encourage their employees to spend 20% of their time pursuing their own projects. No wonder there is an underground pipeline from Microsoft to Google through which the talent keeps leaking out.

By the way, I worked for Microsoft back when it was exciting. Our salaries were rather mediocre, but we felt the urge to work long hours and weekends because we felt that our contributions mattered. Unlike today, sales and marketing were not driving the company, developers were.

To confuse matters even more for the executives, programmers are relatively cheap. The cooling bill for a data center dwarfs the cost of software development. Let’s face it, from a distance, a programmer might look just like another commodity, like a computer rack, air conditioner, or a router. This is even more pronounced in supercomputing, where a single rack might go for a million dollars–an equivalent of 10-20 programmer/years.

If you drain all the excitement from work, your company, or the whole industry, is bound to stagnate. Bored people don’t innovate. And we know from experience that, in high tech industries, if you don’t innovate, you die. Old programming paradigms might have worked for years, but new unmet challenges are piling up. A lot of work that required supercomputers in the past is now done on clusters of off-the-shelf components. Google owns one of the largest supercomputers in the world, and it’s all built from cheap commodity boxes. But Google lets its people innovate.

But not everything is bleak in the land of supercomputers. I have met two teams that were brimming with ideas and enthusiasm: one was Brad Chamberlain’s Cray Chapel team and the other was Hartmut Kaiser’s Louisiana State University Ste||ar team. I’m sure there were many others, but those were the ones I had the pleasure of meeting outside of the exhibition hall.

You can tell that a team is dedicated to a task if they can’t stop talking about their work even after a few beers. Young creative people are attracted like moths to interesting and challenging projects. I don’t think writing simulations using OpenMP and MPI, even if they run on Cray X-MP, can generate this kind of enthusiasm.


I firmly believe that supercomputing of today is the mainstream computing of tomorrow. A year and a half ago I wrote a blog about the future of concurrent programming based on new developments in systems and languages in the HPC (High-Performance Computing) community. Hopefully, this year I’ll learn more at the SC11 conference that’s taking place in Seattle in November (my employer, Corensic, will have a booth there). I’m especially interested in Chapel, the HPCS (High Productivity Computing Systems) language under development by Cray Inc., also here in Seattle. There will be a whole-day Chapel tutorial at SC11, which I’m going to attend.

Why Chapel? Whenever I go to a conference and hear about a new language development to support parallel programming, I immediately compare it with Chapel. Chapel does task-based parallelism better than Cilk, TBB or PPL; data-based parallelism better than AMP or ArBB; generic programming better than D (sorry, Andrei, I’m really partial to concepts) — the list goes on. It’s unfortunate that Chapel is pigeonholed as an HPC language, because it’s perfectly adequate for general purpose programming. In fact I installed it on my laptop and wrote a few programs in it.

A lot of HPC is dedicated to scientific computations, modeling of complex systems, and processing of large quantities of data. That’s where parallel and distributed programming shines. There is no doubt in my mind that the kind of computational power that’s used in scientific computations today will soon be available on game consoles, desktops, and then on tablets and smartphones, likely in concert with cloud computing. But we are not going to use our iPhones to simulate chain reactions in nuclear warheads or heat conduction in rocket engines, are we?

So what everyday tasks could benefit from this kind of power? Obviously the game industry has insatiable appetite for computing resources. Enhanced and virtual reality are peeking from around the corner. Speech recognition and natural language processing have already made inroads into smartphones. But I’m sure that, once the power is there, we will find plenty of new and unexpected applications — If you build it, they will come.

The question is: How do we write programs that can harness the power of multicores, GPUs, and distributed systems like the Cloud — possibly all three at the same time? One thing I know for sure: Not by painstakingly managing threads, locks, message passing, copying of data over the network, etc. And this is where the current C++ (C++11) is stuck, and Chapel blazes the trail.

The major advantage of Chapel, in my mind, is that it separates the logic of the algorithm from the details of its implementation on a particular system. In the ideal world you would write a program in a high-level language and the compiler plus runtime would figure out how to run in on a particular system — what can be run in parallel, which parts can be delegated to GPUs, which parts can migrate to other machines on the network, and so on. Well, we can always dream! In reality the programmer must still tell the compiler all those things. Yes, you can do this in C++ but you’ll make your code totally unreadable. The details of implementation would quickly obscure the heart of the algorithm.

In Chapel, you express parallelism in terms of tasks; not threads, thread pools, processes, or computers. You express communications in terms of shared global address space that can span whole clusters of computers. Separately, on the side, you describe the distribution of computations in terms of locales. Each node on the network is a separate locale. Each GPU is a locale (this feature is still under development). You define your data structure in global address space, but you separately describe how you would like it to be cut up and distributed between various locales.

You may see elements of this approach in other languages, libraries, and language extensions, but never in such comprehensive manner as in Chapel. Tasks, for instance, appear in Cilk, PPL (Parallel Pattern Library), and TBB (Threading Buildg Blocks), together with elements of data-driven parallelism. Intel extended its TBB library to ArBB (Array Building Blocks); Microsoft came up with a C++ extension, AMP (Accelerated Massive Parallelism); AMD put its weight behind OpenCL — everybody and his brother are trying to catch the wave of parallelism and high-throughput computing. It just so happens that the HPC crowd has been riding this wave for a long time and there’s a lot we can learn from them.

Which is why Seattle will be hot during the week of November 12-18, no matter what the weather reports predict.

Additional Links

  1. Chapel events at SC11
  2. SCC11 schedule
  3. Birds of a Feather, Chapel Lightning Talks

Concurrent programming has been around for quite some time (almost half a century), but it was mostly accessible to the highest ranks of the programming priesthood. This changed when, in the 2000s, concurrency entered prime time, prodded by the ubiquity of multicore processors. The industry can no longer afford to pay for hand-crafted hand-debugged concurrent solutions. The search is on for programming paradigms that lead to high productivity and reliability.

Recently DARPA created its HPCS, High Productivity Computing Systems, program (notice the stress on productivity), and is funding research that lead to, among other things, the development of new programming languages that support concurrency. I had a peek at those languages and saw some very interesting developments which I’d like to share with you. But first let me give you some perspective on the current state of concurrency.

Latency vs. Parallel Performance

We are living in the world of multicores and our biggest challenge is to make use of parallelism to speed up the execution of programs. As Herb Sutter famously observed, we can no longer count on the speed of processors increasing exponentially. If we want our programs to run faster we have to find ways to take advantage of the exponential increase in the number of available processors.

Even before multicores, concurrency had been widely used to reduce latencies and to take advantage of the parallelism of some peripherals (e.g., disks). Low latency is particularly important in server architectures, where you can’t wait for the completion of one request before you pick up the next one.

Initially, the same approach that was used to reduce latency–creating threads and using message-passing to communicate between them–has been used to improve parallel performance. It worked, but it turned out to be extremely tedious as a programming paradigm.

There will always be a niche for client/server architectures, which are traditionally based on direct use of threads and message passing. But in this post I will limit myself to strategies for maximizing program performance using parallelism. These strategies are becoming exponentially more important with time.

Threads? Who Needs Threads?

There are two extremes in multithreading. One is to make each thread execute different code–they all have different thread functions (MPMD–Multiple Programs Multiple Data, or even MPSD–Multiple Programs Single Data). This approach doesn’t scale very well with the number of cores–after all there is a fixed number of thread functions in one program no matter how many cores it’s running on.

The other extreme is to spawn many threads that essentially do the same thing (SPMD, Single Program Multiple Data). On the very extreme we have SIMD (Single Instruction Multiple Data), the approach taken, e.g., graphics accelerators. There, multiple cores execute the same instructions in lockstep. The trick is to start them with slightly different initial states and data sets (the MD part), so they all do independently useful work. With SPMD, you don’t have to write lots of different thread functions and you have a better chance at scaling with the number of cores.

There are two kinds of SPMD. When you start with a large data set and split it into smaller chunks and create separate threads to process each chunk, it’s called data-driven parallelism. If, on the other hand, you have a loop whose iterations don’t depend on each other (no data dependencies), and you create multiple threads to process individual iterations (or groups thereof), it’s called control-driven parallelism.

These types of parallelism become more and more important because they allow programs to be written independent of the number of available cores. And, more importantly, they can be partially automated (see my blog on semi-implicit parallelism).

In the past, to explore parallelism, you had to create your own thread pools, assign tasks to them, balance the loads, etc., by hand. Having this capacity built into the language (or a library) takes your mind off the gory details. You gain productivity not by manipulating threads but by identifying the potential for parallelism and letting the compiler and the runtime take care of the details.

In task-driven parallelism, the programmer is free to pick arbitrary granularity for potential parallelizability, rather than being forced into the large-grain of system threads. As always, one more degree of indirection solves the problem. The runtime may chose to multiplex tasks between threads and implement work stealing queues, like it’s done in Haskell, TPL, or TBB. Programming with tasks rather than threads is also less obtrusive, especially if it has direct support in the language.

Shared Memory or Message Passing?

Threads share memory, so it’s natural to use shared memory for communication between threads. It’s fast but, unfortunately, plagued with data races. To avoid races, access to shared (mutable) memory must be synchronized. Synchronization is usually done using locks (critical sections, semaphores, etc.). It’s up to the programmer to come up with locking protocols that eliminate races and don’t lead to deadlocks. This, unfortunately, is hard, very hard. It’s definitely not a high-productivity paradigm.

One way to improve the situation is to hide the locking protocols inside message queues and switch to message passing (MP). The programmer no longer worries about data races or deadlocks. As a bonus, MP scales naturally to distributed, multi-computer, programming. Erlang is the prime example of a programming language that uses MP exclusively as its concurrency paradigm. So why isn’t everybody happy with just passing messages?

Unfortunately not all problems lend themselves easily to MP solutions–in particular, data driven parallelism is notoriously hard to express using message passing. And then there is the elephant in the room–inversion of control.

We think linearly and we write (and read) programs linearly–line by line. The more non-linear the program gets, the harder it is to design, code, and maintain it (gotos are notorious for delinearizing programs). We can pretty easily deal with some of the simpler MP protocols–like the synchronous one. You send a message and wait for the response. In fact object oriented programming is based on such a protocol–in the Smalltalk parlance you don’t “call” a method, you “send a message” to an object. Things get a little harder when you send an asynchronous message, then do some other work, and finally “force” the answer (that’s how futures work); although that’s still manageable. But if you send a message to one thread and set up a handler for the result in another, or have one big receive or select statement at the top to process heterogeneous messages from different sources, you are heading towards the land of spaghetti. If you’re not careful, your program turns into a collection of handlers that keep firing at random times. You are no longer controlling the flow of execution; it’s the flow of messages that’s controlling you. Again, programmer productivity suffers. (Some research shows that the total effort to write an MPI application is significantly higher than that required to write a shared-memory version of it.)

Back to shared memory, this time without locks, but with transactional support. STM, or Software Transactional Memory, is a relatively new paradigm that’s been successfully implemented in Haskell, where the type system makes it virtually fool-proof. So far, implementations of STM in other languages haven’t been as successful, mostly because of problems with performance, isolation, and I/O. But that might change in the future–at least that’s the hope.

What is great about STM is the ease of use and reliability. You access shared memory, either for reading of writing, within atomic blocks. All code inside an atomic block executes as if there were no other threads, so there’s no need for locking. And, unlike lock-based programming, STM scales well.

There is a classic STM example in which, within a single atomic block, money is transferred between two bank accounts that are concurrently accessed by other threads. This is very hard to do with traditional locks, since you must lock both accounts before you do the transfer. That forces you not only to expose those locks but also puts you in risk of deadlocks.

As far as programmer productivity goes, STM is a keeper.

Three HPCS Languages

Yesterday’s supercomputers are today’s desktops and tomorrow’s phones. So it makes sense to look at the leading edge in supercomputer research for hints about the future. As I mentioned in the introduction, there is a well-funded DARPA program, HPCS, to develop concurrent systems. There were three companies in stage 2 of this program, and each of them decided to develop a new language:

(Sun didn’t get to the third stage, but Fortress is still under development.)

I looked at all three languages and was surprised at how similar they were. Three independent teams came up with very similar solutions–that can’t be a coincidence. For instance, all three adopted the shared address space abstraction. Mind you, those languages are supposed to cover a large area: from single-computer multicore programming (in some cases even on an SIMD graphics processor) to distributed programming over huge server farms. You’d think they’d be using message passing, which scales reasonably well between the two extremes. And indeed they do, but without exposing it to the programmer. Message passing is a hidden implementation detail. It’s considered too low-level to bother the programmer with.

Running on PGAS

All three HPCS languages support the shared address space abstraction through some form of PGAS, Partitioned Global Address Space. PGAS provides a unified way of addressing memory across machines in a cluster of computers. The global address space is partitioned into various locales (places in X10 and regions in Fortress) corresponding to local address spaces of processes and computers. If a thread tries to access a memory location within its own locale, the access is direct and shared between threads of the same locale. If, on the other hand, it tries to access a location in a different locale, messages are exchanged behind the scenes. Those could be inter-process messages on the same machine, or network messages between computers. Obviously, there are big differences in performance between local and remote accesses. Still, they may look the same from the programmer’s perspective. It’s just one happy address space.

By now you must be thinking: “What? I have no control over performance?” That would be a bad idea indeed. Don’t worry, the control is there, either explicit (locales are addressable) or in the form of locality awareness (affinity of code and data) or through distributing your data in data-driven parallelism.

Let’s talk about data parallelism. Suppose you have to process a big 2-D array and have 8 machines in your cluster. You would probably split the array into 8 chunks and spread them between the 8 locales. Each locale would take care of its chunk (and possibly some marginal areas of overlap with other chunks). If you’re expecting your program to also run in other cluster configurations, you’d need more intelligent partitioning logic. In Chapel, there is a whole embedded language for partitioning domains (index sets) and specifying their distribution among locales.

To make things more concrete, let’s say you want to distribute a (not so big) 4×8 matrix among currently available locales by splitting it into blocks and mapping each block to a locale. First you want to define a distribution–the prescription of how to distribute a block of data between locales. Here’s the relevant code in Chapel:

const Dist = new dmap(new Block(boundingBox=[1..4, 1..8]));

A Block distribution is created with a bounding rectangle of dimension 4×8. This block is passed as an argument to the constructor of a domain map. If, for instance, the program is run on 8 locales, the block will be mapped into 8 2×2 regions, each assigned to a different locale. Libraries are supposed to provide many different distributions–block distribution being the simplest and the most useful of them.

When you apply the above map to a domain, you get a mapped domain:

var Dom: domain(2) dmapped Dist = [1..4, 1..8];

Here the variable Dom is a 2-D domain (a set of indices) mapped using the distribution Dist that was defined above. Compare this with a regular local domain–a set of indices (i, j), where i is between 1 and 4 (inclusive) and j is between 1 and 8.

var Dom: domain(2) = [1..4, 1..8];

Domains are used, for instance, for iteration. When you iterate over an unmapped domain, all calculations are done within the current locale (possibly in parallel, depending on the type of iteration). But if you do the same over a mapped domain, the calculations will automatically migrate to different locales.

This model of programming is characterized by a very important property: separation of algorithm from implementation. You separately describe implementation details, such as the distribution of your data structure between threads and machines; but the actual calculation is coded as if it were local and performed over monolithic data structures. That’s a tremendous simplification and a clear productivity gain.

Task-Driven Parallelism

The processing of very large (potentially multi-dimensional) arrays is very useful, especially in scientific modeling and graphics. But there are also many opportunities for parallelism in control-driven programs. All three HPCS languages chose fine-grained tasks (activities in X10, implicit threads in Fortress), not threads, as their unit of parallel execution. A task may be mapped to a thread by the runtime system but, in general, this is not a strict requirement. Bunches of small tasks may be bundled for execution within a single system thread. Fortress went the furthest in making parallelism implicit–even the for loop is by default parallel.

From the programmer’s perspective, task-driven parallelism doesn’t expose threads (there is no need for a fork statement or other ways of spawning threads). You simply start a potentially parallel computation. In Fortress you use a for loop or put separate computations in a tuple (tuples are evaluated in parallel, by default). In Chapel, you use the forall statement for loops or begin to start a task. In X10 you use async to mark parallelizable code.

What that means is that you don’t have to worry about how many threads to spawn, how to manage a thread pool, or how to balance the load. The system will spawn the threads for you, depending on the number of available cores, and it will distribute tasks between them and take care of load balancing, etc. In many cases it will do better than a hand-crafted thread-based solution. And in all cases the code will be simpler and easier to maintain.

Global View vs. Fragmented View

If you were to implemented parallel computation using traditional methods, for instance MPI (Message Passing Interface), instead of allocating a single array you’d allocate multiple chunks. Instead of writing an algorithm to operate on this array you’d write an algorithm that operates on chunks, with a lot of code managing boundary cases and communication. Similarly, to parallelize a loop you’d have to partially unroll it and, again, take care of such details as the uneven tail, etc. These approaches results in fragmented view of the problem.

What HPCS languages offer is global view programming. You write your program in terms of data structures and control flows that are not chopped up into pieces according to where they will be executed. Global view approach results in clearer programs that are easier to write and maintain.

Synchronization

No Synchronization?

A lot of data-driven algorithms don’t require much synchronization. Many scientific simulations use read-only arrays for input, and make only localized writes to output arrays.

Consider for instance how you’d implement the famous Game of Life. You’d probably use a read-only array for the previous snapshot and a writable array for the currently evaluated state. Both arrays would be partitioned in the same way between locales. The main loop would go over all array elements and concurrently calculate each one’s new state based on the previous state of its nearest neighbors. Notice that while the neighbors are sometimes read concurrently by multiple threads, the output is always stored once. The only synchronization needed is a thread barrier at the end of each cycle.

The current approach to synchronization in HPCS languages is the biggest disappointment to me. Data races are still possible and, since parallelism is often implicit, harder to spot.

The biggest positive surprise was that all three endorsed transactional memory, at least syntactically, through the use of atomic statements. They didn’t solve the subtleties of isolation, so safety is not guaranteed (if you promise not to access the same data outside of atomic transactions, the transactions are isolated from each other, but that’s all).

The combination of STM and PGAS in Chapel necessitates the use of distributed STM, an area of active research (see, for instance, Software Transactional Memory for Large Scale Clusters).

In Chapel, you not only have access to atomic statements (which are still in the implementation phase) and barriers, but also to low level synchronization primitives such as sync and single variables–somewhat similar to locks and condition variables. The reasoning is that Chapel wants to provide multi-resolution concurrency support. The low level primitives let you implement concurrency in the traditional style, which might come in handy, for instance, in MPMD situations. The high level primitives enable global view programming that boosts programmer productivity.

However, no matter what synchronization mechanism are used (including STM), if the language doesn’t enforce their use, programmers end up with data races–the bane of concurrent programming. The time spent debugging racy programs may significantly cut into, or even nullify, potential productivity gains. Fortress is the only language that attempted to keep track of which data is shared (and, therefore, requires synchronization), and which is local. None of the HPCS languages tried to tie sharing and synchronization to the type system in the way it is done, for instance, in the D programming language (see also my posts about race-free multithreading).

Conclusions

Here’s the tongue-in-cheek summary of the trends which, if you believe that the HPCS effort provides a glimpse of the future, will soon be entering the mainstream:

  1. Threads are out (demoted to latency controlling status), tasks (and semi-implicit parallelism) are in.
  2. Message passing is out (demoted to implementation detail), shared address space is in.
  3. Locks are out (demoted to low-level status), transactional memory is in.

I think we’ve been seeing the twilight of thread-based parallelism for some time (see my previous post on Parallel Programming with Hints. It’s just not the way to fully explore hardware concurrency. Traditionally, if you wanted to increase the performance of your program on multicore machines, you had to go into the low-level business of managing thread pools, splitting your work between processors, etc. This is now officially considered the assembly language of concurrency and has no place in high level programming languages.

Message passing’s major flaw is the inversion of control–it is a moral equivalent of gotos in un-structured programming (it’s about time somebody said that message passing is considered harmful). MP still has its applications and, used in moderation, can be quite handy; but PGAS offers a much more straightforward programming model–its essence being the separation of implementation from algorithm. The Platonic ideal would be for the language to figure out the best parallel implementation for a particular algorithm. Since this is still a dream, the next best thing is getting rid of the interleaving of the two in the same piece of code.

Software transactional memory has been around for more than a decade now and, despite some negative experiences, is still alive and kicking. It’s by far the best paradigm for synchronizing shared memory access. It’s unobtrusive, deadlock-free, and scalable. If we could only get better performance out of it, it would be ideal. The hope though is in moving some of the burden of TM to hardware.

Sadly, the ease of writing parallel programs using those new paradigms does not go hand in hand with improving program correctness. My worry is that chasing concurrency bugs will eat into productivity gains.

What are the chances that we’ll be writing programs for desktops in Chapel, X10, or Fortress? Probably slim. Good chances are though that the paradigms used in those languages will continue showing up in existing and future languages. You may have already seen task driven libraries sneaking into the mainstream (e.g., the .NET TPL). There is a PGAS extension of C called UPC (Unified Parallel C) and there are dialects of several languages like Java, C, C#, Scala, etc., that experiment with STM. Expect more in the future.

Acknowledgments

Special thanks go to Brad Chamberlain of the Cray Chapel team for his help and comments. As usual, a heated exchange with the Seattle Gang, especially Andrei Alexandrescu, lead to numerous improvements in this post.

Bibliography

  1. DARPA’s High Productivity Computing Systems
  2. Cray Chapel
  3. IBM X10
  4. Sun Fortress
  5. Message Passing Interface specification
  6. Unified Parallel C, PGAS in C
  7. Microsoft Task Parallel Library
  8. Intel Thread Building Blocks
  9. HPC Programmer Productivity:
    A Case Study of Novice HPC Programmers
  10. Software Transactional Memory for Large Scale Clusters
  11. Scalable Software Transactional Memory for Global Address Space Architectures
  12. A Brief Retrospective on Transactional Memory
  13. Designing an Effective
    Hybrid Transactional Memeory System
  14. Concurrency in the D Programming Language