Concurrent programming has been around for quite some time (almost half a century), but it was mostly accessible to the highest ranks of the programming priesthood. This changed when, in the 2000s, concurrency entered prime time, prodded by the ubiquity of multicore processors. The industry can no longer afford to pay for hand-crafted hand-debugged concurrent solutions. The search is on for programming paradigms that lead to high productivity and reliability.
Recently DARPA created its HPCS, High Productivity Computing Systems, program (notice the stress on productivity), and is funding research that lead to, among other things, the development of new programming languages that support concurrency. I had a peek at those languages and saw some very interesting developments which I’d like to share with you. But first let me give you some perspective on the current state of concurrency.
Latency vs. Parallel Performance
We are living in the world of multicores and our biggest challenge is to make use of parallelism to speed up the execution of programs. As Herb Sutter famously observed, we can no longer count on the speed of processors increasing exponentially. If we want our programs to run faster we have to find ways to take advantage of the exponential increase in the number of available processors.
Even before multicores, concurrency had been widely used to reduce latencies and to take advantage of the parallelism of some peripherals (e.g., disks). Low latency is particularly important in server architectures, where you can’t wait for the completion of one request before you pick up the next one.
Initially, the same approach that was used to reduce latency–creating threads and using message-passing to communicate between them–has been used to improve parallel performance. It worked, but it turned out to be extremely tedious as a programming paradigm.
There will always be a niche for client/server architectures, which are traditionally based on direct use of threads and message passing. But in this post I will limit myself to strategies for maximizing program performance using parallelism. These strategies are becoming exponentially more important with time.
Threads? Who Needs Threads?
There are two extremes in multithreading. One is to make each thread execute different code–they all have different thread functions (MPMD–Multiple Programs Multiple Data, or even MPSD–Multiple Programs Single Data). This approach doesn’t scale very well with the number of cores–after all there is a fixed number of thread functions in one program no matter how many cores it’s running on.
The other extreme is to spawn many threads that essentially do the same thing (SPMD, Single Program Multiple Data). On the very extreme we have SIMD (Single Instruction Multiple Data), the approach taken, e.g., graphics accelerators. There, multiple cores execute the same instructions in lockstep. The trick is to start them with slightly different initial states and data sets (the MD part), so they all do independently useful work. With SPMD, you don’t have to write lots of different thread functions and you have a better chance at scaling with the number of cores.
There are two kinds of SPMD. When you start with a large data set and split it into smaller chunks and create separate threads to process each chunk, it’s called data-driven parallelism. If, on the other hand, you have a loop whose iterations don’t depend on each other (no data dependencies), and you create multiple threads to process individual iterations (or groups thereof), it’s called control-driven parallelism.
These types of parallelism become more and more important because they allow programs to be written independent of the number of available cores. And, more importantly, they can be partially automated (see my blog on semi-implicit parallelism).
In the past, to explore parallelism, you had to create your own thread pools, assign tasks to them, balance the loads, etc., by hand. Having this capacity built into the language (or a library) takes your mind off the gory details. You gain productivity not by manipulating threads but by identifying the potential for parallelism and letting the compiler and the runtime take care of the details.
In task-driven parallelism, the programmer is free to pick arbitrary granularity for potential parallelizability, rather than being forced into the large-grain of system threads. As always, one more degree of indirection solves the problem. The runtime may chose to multiplex tasks between threads and implement work stealing queues, like it’s done in Haskell, TPL, or TBB. Programming with tasks rather than threads is also less obtrusive, especially if it has direct support in the language.
Shared Memory or Message Passing?
Threads share memory, so it’s natural to use shared memory for communication between threads. It’s fast but, unfortunately, plagued with data races. To avoid races, access to shared (mutable) memory must be synchronized. Synchronization is usually done using locks (critical sections, semaphores, etc.). It’s up to the programmer to come up with locking protocols that eliminate races and don’t lead to deadlocks. This, unfortunately, is hard, very hard. It’s definitely not a high-productivity paradigm.
One way to improve the situation is to hide the locking protocols inside message queues and switch to message passing (MP). The programmer no longer worries about data races or deadlocks. As a bonus, MP scales naturally to distributed, multi-computer, programming. Erlang is the prime example of a programming language that uses MP exclusively as its concurrency paradigm. So why isn’t everybody happy with just passing messages?
Unfortunately not all problems lend themselves easily to MP solutions–in particular, data driven parallelism is notoriously hard to express using message passing. And then there is the elephant in the room–inversion of control.
We think linearly and we write (and read) programs linearly–line by line. The more non-linear the program gets, the harder it is to design, code, and maintain it (gotos are notorious for delinearizing programs). We can pretty easily deal with some of the simpler MP protocols–like the synchronous one. You send a message and wait for the response. In fact object oriented programming is based on such a protocol–in the Smalltalk parlance you don’t “call” a method, you “send a message” to an object. Things get a little harder when you send an asynchronous message, then do some other work, and finally “force” the answer (that’s how futures work); although that’s still manageable. But if you send a message to one thread and set up a handler for the result in another, or have one big receive
or select
statement at the top to process heterogeneous messages from different sources, you are heading towards the land of spaghetti. If you’re not careful, your program turns into a collection of handlers that keep firing at random times. You are no longer controlling the flow of execution; it’s the flow of messages that’s controlling you. Again, programmer productivity suffers. (Some research shows that the total effort to write an MPI application is significantly higher than that required to write a shared-memory version of it.)
Back to shared memory, this time without locks, but with transactional support. STM, or Software Transactional Memory, is a relatively new paradigm that’s been successfully implemented in Haskell, where the type system makes it virtually fool-proof. So far, implementations of STM in other languages haven’t been as successful, mostly because of problems with performance, isolation, and I/O. But that might change in the future–at least that’s the hope.
What is great about STM is the ease of use and reliability. You access shared memory, either for reading of writing, within atomic blocks. All code inside an atomic block executes as if there were no other threads, so there’s no need for locking. And, unlike lock-based programming, STM scales well.
There is a classic STM example in which, within a single atomic block, money is transferred between two bank accounts that are concurrently accessed by other threads. This is very hard to do with traditional locks, since you must lock both accounts before you do the transfer. That forces you not only to expose those locks but also puts you in risk of deadlocks.
As far as programmer productivity goes, STM is a keeper.
Three HPCS Languages
Yesterday’s supercomputers are today’s desktops and tomorrow’s phones. So it makes sense to look at the leading edge in supercomputer research for hints about the future. As I mentioned in the introduction, there is a well-funded DARPA program, HPCS, to develop concurrent systems. There were three companies in stage 2 of this program, and each of them decided to develop a new language:
(Sun didn’t get to the third stage, but Fortress is still under development.)
I looked at all three languages and was surprised at how similar they were. Three independent teams came up with very similar solutions–that can’t be a coincidence. For instance, all three adopted the shared address space abstraction. Mind you, those languages are supposed to cover a large area: from single-computer multicore programming (in some cases even on an SIMD graphics processor) to distributed programming over huge server farms. You’d think they’d be using message passing, which scales reasonably well between the two extremes. And indeed they do, but without exposing it to the programmer. Message passing is a hidden implementation detail. It’s considered too low-level to bother the programmer with.
Running on PGAS
All three HPCS languages support the shared address space abstraction through some form of PGAS, Partitioned Global Address Space. PGAS provides a unified way of addressing memory across machines in a cluster of computers. The global address space is partitioned into various locales (places in X10 and regions in Fortress) corresponding to local address spaces of processes and computers. If a thread tries to access a memory location within its own locale, the access is direct and shared between threads of the same locale. If, on the other hand, it tries to access a location in a different locale, messages are exchanged behind the scenes. Those could be inter-process messages on the same machine, or network messages between computers. Obviously, there are big differences in performance between local and remote accesses. Still, they may look the same from the programmer’s perspective. It’s just one happy address space.
By now you must be thinking: “What? I have no control over performance?” That would be a bad idea indeed. Don’t worry, the control is there, either explicit (locales are addressable) or in the form of locality awareness (affinity of code and data) or through distributing your data in data-driven parallelism.
Let’s talk about data parallelism. Suppose you have to process a big 2-D array and have 8 machines in your cluster. You would probably split the array into 8 chunks and spread them between the 8 locales. Each locale would take care of its chunk (and possibly some marginal areas of overlap with other chunks). If you’re expecting your program to also run in other cluster configurations, you’d need more intelligent partitioning logic. In Chapel, there is a whole embedded language for partitioning domains (index sets) and specifying their distribution among locales.
To make things more concrete, let’s say you want to distribute a (not so big) 4×8 matrix among currently available locales by splitting it into blocks and mapping each block to a locale. First you want to define a distribution–the prescription of how to distribute a block of data between locales. Here’s the relevant code in Chapel:
const Dist = new dmap(new Block(boundingBox=[1..4, 1..8]));
A Block
distribution is created with a bounding rectangle of dimension 4×8. This block is passed as an argument to the constructor of a domain map. If, for instance, the program is run on 8 locales, the block will be mapped into 8 2×2 regions, each assigned to a different locale. Libraries are supposed to provide many different distributions–block distribution being the simplest and the most useful of them.
When you apply the above map to a domain, you get a mapped domain:
var Dom: domain(2) dmapped Dist = [1..4, 1..8];
Here the variable Dom
is a 2-D domain (a set of indices) mapped using the distribution Dist
that was defined above. Compare this with a regular local domain–a set of indices (i, j)
, where i
is between 1 and 4 (inclusive) and j
is between 1 and 8.
var Dom: domain(2) = [1..4, 1..8];
Domains are used, for instance, for iteration. When you iterate over an unmapped domain, all calculations are done within the current locale (possibly in parallel, depending on the type of iteration). But if you do the same over a mapped domain, the calculations will automatically migrate to different locales.
This model of programming is characterized by a very important property: separation of algorithm from implementation. You separately describe implementation details, such as the distribution of your data structure between threads and machines; but the actual calculation is coded as if it were local and performed over monolithic data structures. That’s a tremendous simplification and a clear productivity gain.
Task-Driven Parallelism
The processing of very large (potentially multi-dimensional) arrays is very useful, especially in scientific modeling and graphics. But there are also many opportunities for parallelism in control-driven programs. All three HPCS languages chose fine-grained tasks (activities in X10, implicit threads in Fortress), not threads, as their unit of parallel execution. A task may be mapped to a thread by the runtime system but, in general, this is not a strict requirement. Bunches of small tasks may be bundled for execution within a single system thread. Fortress went the furthest in making parallelism implicit–even the for
loop is by default parallel.
From the programmer’s perspective, task-driven parallelism doesn’t expose threads (there is no need for a fork
statement or other ways of spawning threads). You simply start a potentially parallel computation. In Fortress you use a for
loop or put separate computations in a tuple (tuples are evaluated in parallel, by default). In Chapel, you use the forall
statement for loops or begin
to start a task. In X10 you use async
to mark parallelizable code.
What that means is that you don’t have to worry about how many threads to spawn, how to manage a thread pool, or how to balance the load. The system will spawn the threads for you, depending on the number of available cores, and it will distribute tasks between them and take care of load balancing, etc. In many cases it will do better than a hand-crafted thread-based solution. And in all cases the code will be simpler and easier to maintain.
Global View vs. Fragmented View
If you were to implemented parallel computation using traditional methods, for instance MPI (Message Passing Interface), instead of allocating a single array you’d allocate multiple chunks. Instead of writing an algorithm to operate on this array you’d write an algorithm that operates on chunks, with a lot of code managing boundary cases and communication. Similarly, to parallelize a loop you’d have to partially unroll it and, again, take care of such details as the uneven tail, etc. These approaches results in fragmented view of the problem.
What HPCS languages offer is global view programming. You write your program in terms of data structures and control flows that are not chopped up into pieces according to where they will be executed. Global view approach results in clearer programs that are easier to write and maintain.
Synchronization
No Synchronization?
A lot of data-driven algorithms don’t require much synchronization. Many scientific simulations use read-only arrays for input, and make only localized writes to output arrays.
Consider for instance how you’d implement the famous Game of Life. You’d probably use a read-only array for the previous snapshot and a writable array for the currently evaluated state. Both arrays would be partitioned in the same way between locales. The main loop would go over all array elements and concurrently calculate each one’s new state based on the previous state of its nearest neighbors. Notice that while the neighbors are sometimes read concurrently by multiple threads, the output is always stored once. The only synchronization needed is a thread barrier at the end of each cycle.
The current approach to synchronization in HPCS languages is the biggest disappointment to me. Data races are still possible and, since parallelism is often implicit, harder to spot.
The biggest positive surprise was that all three endorsed transactional memory, at least syntactically, through the use of atomic
statements. They didn’t solve the subtleties of isolation, so safety is not guaranteed (if you promise not to access the same data outside of atomic transactions, the transactions are isolated from each other, but that’s all).
The combination of STM and PGAS in Chapel necessitates the use of distributed STM, an area of active research (see, for instance, Software Transactional Memory for Large Scale Clusters).
In Chapel, you not only have access to atomic
statements (which are still in the implementation phase) and barriers, but also to low level synchronization primitives such as sync
and single
variables–somewhat similar to locks and condition variables. The reasoning is that Chapel wants to provide multi-resolution concurrency support. The low level primitives let you implement concurrency in the traditional style, which might come in handy, for instance, in MPMD situations. The high level primitives enable global view programming that boosts programmer productivity.
However, no matter what synchronization mechanism are used (including STM), if the language doesn’t enforce their use, programmers end up with data races–the bane of concurrent programming. The time spent debugging racy programs may significantly cut into, or even nullify, potential productivity gains. Fortress is the only language that attempted to keep track of which data is shared (and, therefore, requires synchronization), and which is local. None of the HPCS languages tried to tie sharing and synchronization to the type system in the way it is done, for instance, in the D programming language (see also my posts about race-free multithreading).
Conclusions
Here’s the tongue-in-cheek summary of the trends which, if you believe that the HPCS effort provides a glimpse of the future, will soon be entering the mainstream:
- Threads are out (demoted to latency controlling status), tasks (and semi-implicit parallelism) are in.
- Message passing is out (demoted to implementation detail), shared address space is in.
- Locks are out (demoted to low-level status), transactional memory is in.
I think we’ve been seeing the twilight of thread-based parallelism for some time (see my previous post on Parallel Programming with Hints. It’s just not the way to fully explore hardware concurrency. Traditionally, if you wanted to increase the performance of your program on multicore machines, you had to go into the low-level business of managing thread pools, splitting your work between processors, etc. This is now officially considered the assembly language of concurrency and has no place in high level programming languages.
Message passing’s major flaw is the inversion of control–it is a moral equivalent of gotos in un-structured programming (it’s about time somebody said that message passing is considered harmful). MP still has its applications and, used in moderation, can be quite handy; but PGAS offers a much more straightforward programming model–its essence being the separation of implementation from algorithm. The Platonic ideal would be for the language to figure out the best parallel implementation for a particular algorithm. Since this is still a dream, the next best thing is getting rid of the interleaving of the two in the same piece of code.
Software transactional memory has been around for more than a decade now and, despite some negative experiences, is still alive and kicking. It’s by far the best paradigm for synchronizing shared memory access. It’s unobtrusive, deadlock-free, and scalable. If we could only get better performance out of it, it would be ideal. The hope though is in moving some of the burden of TM to hardware.
Sadly, the ease of writing parallel programs using those new paradigms does not go hand in hand with improving program correctness. My worry is that chasing concurrency bugs will eat into productivity gains.
What are the chances that we’ll be writing programs for desktops in Chapel, X10, or Fortress? Probably slim. Good chances are though that the paradigms used in those languages will continue showing up in existing and future languages. You may have already seen task driven libraries sneaking into the mainstream (e.g., the .NET TPL). There is a PGAS extension of C called UPC (Unified Parallel C) and there are dialects of several languages like Java, C, C#, Scala, etc., that experiment with STM. Expect more in the future.
Acknowledgments
Special thanks go to Brad Chamberlain of the Cray Chapel team for his help and comments. As usual, a heated exchange with the Seattle Gang, especially Andrei Alexandrescu, lead to numerous improvements in this post.
Bibliography
- DARPA’s High Productivity Computing Systems
- Cray Chapel
- IBM X10
- Sun Fortress
- Message Passing Interface specification
- Unified Parallel C, PGAS in C
- Microsoft Task Parallel Library
- Intel Thread Building Blocks
- HPC Programmer Productivity:
A Case Study of Novice HPC Programmers - Software Transactional Memory for Large Scale Clusters
- Scalable Software Transactional Memory for Global Address Space Architectures
- A Brief Retrospective on Transactional Memory
- Designing an Effective
Hybrid Transactional Memeory System - Concurrency in the D Programming Language
August 2, 2010 at 2:18 pm
thanks. very interesting post. what is your thoughts on the concurrency in opencl?
August 2, 2010 at 2:55 pm
To Per Arneng: I checked it out on Wikipedia. It’s a good idea but C is a very low-level language. It would be interesting to compare it with CUDA.
August 2, 2010 at 2:57 pm
While at university I looked into STM on the cell microprocessor, with very interesting results. I found that I could achieve similar performance with atomic transactions in an hour of work, as I could with a fine grained lock solution in many hours of instrumentation and testing.
I firmly believe locking as a paradigm has seen its last days!
August 2, 2010 at 3:53 pm
Excellent, well-researched post. Thank you!
One thing left out: As hinted at in other comments, SIMD (or SIMT) programming styles are one area that I think will be important. CUDA-C can’t be the last word there. I hope.
August 2, 2010 at 4:04 pm
Good article, but I believe Cliff Click poses quite an intelligent argument that STM has as many issues with locks; the challenge with both is at which granularity do you warp your atomic (or lock) keyword?
You can enforce just as much safety with locks as STM by wrapping everything in a synchronized block (or having a global mutex), locking only gets hard when you need performance. If you aren’t fine grained enough, you have contention issues, too fine grained and you have race conditions.
Same with STM in the sense that you wrap too much with your Atomic blocks, and suddenly you have live-locks/infinite retries…
At least, that’s my humble paraphrasing of his argument (that convinced me).
http://www.azulsystems.com/blog/cliff-click/2008-05-27-clojure-stms-vs-locks
August 2, 2010 at 4:13 pm
Greg: You are right. I didn’t mention it but the Chapel team is working on supporting SIMD processors (they will be hiring a specialist on GPU and other exotic architectures–anyone interested?).
August 2, 2010 at 4:18 pm
Excellent post, just sad to see that we’re going through yet-another-round of high-level languages rather than integrating parallelism at a more fundamental level.
Those of us working on more traditional operating systems still have to work around the fact that parallelism is at odds with the o/s’es concept of multi-tasking and therefore software had to converse with software, in a cacophony of machine instructions, in order to go parallel.
OpenMP and Intel’s TBB, etc, often recommend that your workload comprise at least a few thousand machine instructions per worker before you begin using their constructs.
Which invariably means consuming additional machine instructions to ask the question, “Do I fit through this door?”
So I would disagree with the conclusion that parallelism has hit prime time. It’s on cable and everyone is starting to watch it so they can discuss at the water cooler, but the skits are often not funny or downright bad.
August 2, 2010 at 4:39 pm
To kfsone: I discussed the problem of granularity. You’re right, being forced to fit your program into thread-sized chunks, fitting them through the door, imposes too much burden on the programmer. Hence task-based parallelism. Tasks have arbitrary granularity. The runtime fits them through the door by bunching them together into fewer threads if needed.
August 2, 2010 at 4:52 pm
[…] Beyond Locks and Messages: The Future of Concurrent Programming […]
August 2, 2010 at 5:54 pm
Chapel doing SIMD/SIMT? Interesting. I don’t know Chapel, but PGAS doesn’t seem to fit the style Nvidia proposes, anyway.
And before I forget again: There are also all the issues I brought up a while ago in my blog, about the fact that there are already *lots* of parallel programming languages, and very little use of any of them:
http://perilsofparallel.blogspot.com/2008/09/101-parallel-languages-part-1.html
(or if that messes up with line breaks, http://bit.ly/a0xwUR )
Greg
August 2, 2010 at 8:48 pm
To Vincent: I followed your link. I must say I agree with Rich Hickey in that discussion. Conceptually, STM works as if there were a single global lock but it gives you performance that is not that far from that of fine-grained locking solutions. A lot of people would gladly sacrifice some performance for safety.
August 2, 2010 at 8:59 pm
Very thoughtful article!
I wanted to introduce you to a new programming paradigm called Cubicon. The website will provide context and I can send you a more focused paper that addresses parallel programming in Cubicon.
Sandy
August 2, 2010 at 10:20 pm
Message passing doesn’t have to mean spaghetti code. If your language supports tasks or coroutines, you can appear to have linear control flow, while keeping good performance by never actually blocking the thread–you switch to another task or coroutine while waiting for the message to arrive.
August 2, 2010 at 11:08 pm
There is parallel computing, and then there is distrbuted computing. To quote from another site “parallel processing is a subset of distributed processing. The difference between the two occurs when you look at how and when the processing occurs. In other words, with parallel, the processing is done in parallel. Distributed processing can take place in parallel, but may not be in parallel, therefore it has a much greater scope”.
All of these enhancements are for the parallel world, but the majority or large-scale complex systems (outside the modelling world) are in the distributed world. For the distributed world, message-based threads are perfect. A thread can operate at 100% of core processing if there are enough messages on the queue, and message-based threading has no lock contention at all as resources are “thread-contained”. Furthermore, message-based threading is more scalable because of the loosely-coupled architecture. Additionally, because of the coarse-grained nature of a message-based threading model organised around “services” rather than algorithms, it is actually a lot easier to understand, design, and implement, and requires no additional language features if the framework is alreasy there.
Granted, there is absolutely no new technology required, but let’s not start with a solution and look for a problem.
August 2, 2010 at 11:30 pm
I’d also like to add that with a well-designed mesage-based threading framework the code is actually simpler and a developer does not even need to understand threads at all. I have spent more then 2 years development time (over the last 6-7 years) on building a distributed framework that allows me to not just split functionality across theads, but also across processes with the same thread-messaging interface. Additionally, by moving the connections between the threaded “components” from code to configuration files I have broken any system down into a set of “Lego building blocks” that can be assembled in different configurations for different systems if so desired. Thread pooling of any stateless thread is achieved by altering a thread count configuration variable – no code changes are necessary.
Perhaps the reason that some note the limitations and complexity of message-based threading is because they have never seen a framework which includes these features in the framework rather than expecting the application developer to handle them. And the really nice thing about this is the concept is not technology dependent. I have implementations in Delphi and C#, but in reality any language/OS that can access threads in code could be used to build an identical system. While I’m sure that parllel language features have their place, IMHO the biggest impediment to building efficient, manageable, large-scale multi-threaded systems is our own inability to think of and build solutions over time, not the limits of technology. I don’t think in terms of language contructs, or patterns, I think in terms of high-level architectural concepts that might help resolve real problems, and then I wonder how to implement them.
August 3, 2010 at 3:40 am
The use of multi-core concurrency won’t become mainstream until it is implemented in a completely invisible way – so perhaps it may never happen.
August 3, 2010 at 3:45 am
[…] Beyond Locks and Messages: The Future of Concurrent Programming By Rafael Aroca https://bartoszmilewski.wordpress.com/2010/08/02/beyond-locks-and-messages-the-future-of-concurrent-p… […]
August 3, 2010 at 9:32 am
What about Google’s Go language? How does it fit into this discussion?
August 3, 2010 at 2:34 pm
Dear Bartosz, I thought you might
enjoy seeing the example message-send and recieve
code in the paper on Interprocess Communication
at http://www.civilized.com/files/msgfuncs.txt
(This is a plain ascii file.)
August 3, 2010 at 3:04 pm
To Tim Maxwell: From what I know, coroutines don’t actually run in parallel, so they are outside of the scope of my post.
August 3, 2010 at 3:14 pm
To Greg Pfister: I read your blog. I think that a language that has a DARPA seal of approval has a better chance of being treated seriously by the HPC community. My main point though was that the ideas developed for HPC may influence the mainstream programming community.
August 3, 2010 at 3:27 pm
Well, several of the 101 had DARPA support. Possibly the ones that came out of the DARPA HPCS program (the three you cite above) will have follow-up support to build their ecosystems to a point where they thrive.
DARPA seems to have moved on to Exascale, though. Perhaps they’ll find support there, or adapt to the issues there (like fault-tolerance, ungodly levels of parallelism).
August 3, 2010 at 4:58 pm
To Greg: The HPCS program’s goal is to create a high productivity system. They don’t mention languages–it just so happened that all three companies decided that they need a new language as part of the system. So a good development environment and infrastructure are as important as a well designed language. I just concentrated on languages because that’s my main interest.
You make a very good point about fault-tolerance. Should fault-tolerance be part of the language, runtime, or the underlying distributed system?
August 3, 2010 at 5:06 pm
[…] Beyond Locks and Messages: The Future of Concurrent Programming Concurrent programming has been around for quite some time (almost half a century), but it was mostly accessible to the […] […]
August 3, 2010 at 5:28 pm
Bartosz, your field might be languages, my field is building systems for corporates (e.g. information-related systems, alarm management systems, etc). I would argue, as in the above comment, that fault tolerance (and other features) should be built into the distributed system for two main reasons: 1) you have to cater for failure at the largest level (machines dying, networks going down, etc), and no language feature can ever achieve this, and 2) by building it into the distibuted system you are essentially making it technology agnostic, i.e. you are not relying on the latest technology to make it work.
We really don’t make the most of the technology we have before focusing on the next best thing. The solutions are already available just waiting for the right dedicated people to deliver them. As an individual I have managed to put together a solution that has reduced all of my multi-threaded distributed application development issues down to the “manufacturing” of single-threaded message-based components that can be plugged together via configuration files (and these can be connected across processes, e.g using internet as a comms backbone, in the exactly the same way they can be connected across threads). If can do this as an individual with 2-3 years of effort, imagine what could be achieved by a dedicated team of quality software engineers!
This might not address the issue of parallelism as it occurs in mathematical modelling, however, I wonder how much of the software landscape this comprises in comparison to distributed information systems, which are used by every corporation and individual on the planet.
August 3, 2010 at 8:51 pm
Coroutines aren’t concurrent, but they solve the inversion-of-control issues associated with message passing. Each thing that your program needs to be doing concurrently runs in a different coroutine. When you need to perform IO or wait for a message from some other thread, you transfer control from your coroutine to the main event loop. When the main event loop detects that your IO is done or a message has arrived, it passes control back to the coroutine that was waiting for it.
If your language supports tasks (“implicit threads”, “activities”, or whatever else you want to call them), then you don’t need to use coroutines because the language’s runtime is doing the same thing under the surface.
There are a number of different ways to map coroutines to threads. For example:
You can use a thread to manage a shared resource by having it act as a “server” for the other threads; each thread passes a message to the server thread when it needs the resource, and the server thread starts a coroutine to handle the job. When the job is done, the server thread passes a message back with the reply. The coroutine that sent the request ‘sleeps’ until the reply comes back, and meanwhile other coroutines on the requesting thread can continue to work.
A server application that communicated with clients over a network, such as an IRC server, could start as many threads as there were cores and then distribute incoming connections evenly among the threads. Whenever the main thread received a new connection, it would pick a thread randomly and then send a message with the connection to the chosen thread. The thread would start a new coroutine for each incoming connection, and pass control to the coroutine whenever a request came in from the client or a message was ready to be sent out to the client.
August 4, 2010 at 5:01 am
It is quite clear that the major hurdle in designing any non-trivial system of concurrency, whether shared memory, message passing, multicore or distributed is that of visualising the concurrency. The only solution that will address this problem is a graphical language. Once the concurrency has been designed then all the arguments about implementation become just that – implementation details. May I suggest this series of articles as good reading?
http://www.drdobbs.com/212903146
August 4, 2010 at 8:04 am
Adrian, an interesting series of articles, and I’d have to agree because I have a framework that implements 80-90% of what was described. In essence you need to reduce any complex system down to a set of discrete application components, and then “plug” them together to build a full working system. In my framework each application component runs in its own thread, so if you put a system together it is inherently multi-threaded.
There are a couple of things I do differently from the Blueprint system that potrntially make it even more useful for real-world applications (or at least the ones I have built):
1) Choose a coarse-grained breakdown of the system into components, more towards a service level than an alorithmic level. This is easier to achieve and tends to give better performance as the message passing/routing overhead is a smaller fraction of the message processing effort.
2) Make all interfaces of all components identical. In essence I have an interface to every component that consists of two methods: one just asynchronously adds a message to the destination component’s queue, the other synchronously adds a message to the destination component’s queue and waits for a message in response. In this way every component can be connected to every other component. The communication channels have the same interface too, so whether the message is processed locally by a thread or remotely by another process makes no difference.
3) Explicitly move all component configuration and connectivity into configuration files rather than code. In my framework these are represented as XML files but a graphical representation of these files is easily doable and would be far more useful to the system architect.
The end result is a set of pre-fabricated application components that run as threads handling inputs and generating outputs, without knowing where the inputs come from, or where the outputs are going. The system is then implemented by configuring the type and number of components, their interconnections, and how they are “housed” in processes and where these processes are located. To develop and implement this system requires no knowledge of concurrency, threading, or synchronisation primitives, because the framework manages all this at a lower level. The developer is free to develop single-threaded code as per normal, but when integrated with other single-threaded components the result is a fully multi-threaded distibuted system, without the aggravation usually associated with such development.
August 4, 2010 at 11:32 am
To Adrian and Misha: It’s a very good point. Message-passing systems are best described visually in 2D or even 3D (or more?). There is no easy way to linearize them into 1D source files, and that’s what we currently have to do. Hence my analogy with goto. It’s no accident that we call programs with gotos “spaghetti code”. Spaghetti is 3D. If you try to linearize it (laying strands end-to-end), you get a lot of discontinuities. Structured programs without gotos are easier to linearize. For instance, the if/else construct is a linearization of a fork. You first list one branch of the fork, then the second branch after “else”. Now imagine connecting each message send with the corresponding message receive (and the appropriate handler) with a strand of spaghetti. You get the picture.
It’s hard to predict, though, if visual programming will ever catch on. It comes with its own problems.
August 4, 2010 at 3:56 pm
Reading Misha’s last explanation of what happens in his framework tends to send me to a mental image of neurons. This would be great, just because it was nature’s way of getting it done and we have broken the smallest elements in terms of functionality.
What still troubles us is the fact that when we set the “magnification” so something more “macro” we loose the plot.
Having a neuron-like framework looks good in theory, since mimicking nature is one good route, but our minds are still quite unable to get the abstraction of so many things happening at once. Or the fact that the combinations of such a design are very high and cannot be explored in a lifetime.
I’ve probably ranted a tad, but my point is: We have understood the neuron quite easy but we still struggle to figure out a bigger amount of neurons together…
August 4, 2010 at 9:53 pm
Gustavo, I agree with the neuron analogy to a point, but this also ilustrates why I go with a higher-level approach. Too many moving components does my head in so, rather than try to combine 100’s if not 1000’s of smaller level components, which is way too hard for my brain, I try to combine 10-20 higher level components, which I can handle. A rough guide to the granularity of components that I use is that each component takes anywhere from around a day to a few weeks to design, implement, and test. Components tend to have a dedicated purpose, such as encapsulating file access via a few read, write, delete, and list primitives, or clock synchronisation, or application status management, etc. Thinking in terms of high-level functionality is much easier because the components tend to model real-world functionality rather than abstract algorithms.
August 5, 2010 at 9:59 am
Again, how does Google’s Go language fit into this discussion?
August 5, 2010 at 11:17 am
To Devon: Go uses standard CSP (Communicating Sequential Processes) with channels. That’s a variation of message passing.
Concurrency in Go is task-based (they call tasks goroutines, because they resemble coroutines) which, as I explained, allows finer granularity parallelism than thread-based concurrency.
AFAIK, there is no direct support for distributed systems in Go.
August 5, 2010 at 12:29 pm
Wow, do I feel old. This was new back in the 80s. Digital had coprocesing and messaging built in to the OS as well as languages long before the first mention of threads in the Unix world. So it is interesting all that is now coming back in vogue.
August 5, 2010 at 5:44 pm
QNX has inter-process communication via messaging built into the micro-kernel – it is one of the few core parts of the OS. Since it is built into the OS, IPC in QNX is probably as quick as inter-thread messaging in most other OSes. In fact, there is really no significant distinction between threads and processes in QNX.
August 6, 2010 at 11:29 am
To Misha: Also the Mach kernel is message based. But there message queues are just used as a barrier between user and kernel. The model is very simple: you call an API, it results in a message being posted to the kernel. Then, unless it’s an asynchronous API, you block waiting for the response.
August 9, 2010 at 5:07 pm
[…] Beyond Locks and Messages: The Future of Concurrent Programming « Bartosz Milewski’s Programm… (tags: concurrency programming messaging languages) […]
August 10, 2010 at 4:24 am
[…] Full Story […]
August 10, 2010 at 11:13 pm
Adrian: I agree with all you say (as you know) and would also agree that Visual Programming Languages (VPLs) may have had a bad press in the past because they have been used for tasks that they don’t do well (used for the sake-of-it in many cases).
I’d argue that humans are extremely adept at managing the concurrent world that they live in, and a game of football is a good example. It’s effortless to ‘see’ what’s happening to the ball, the 22 players and the referee, but a pain to listen to on the radio, or read about in a paper.
I’m told (but have no citation to hand), that the visual cortex and the linguistic centres in the brain are very different things, and that the latter is essentaily sequential (only my wife can have more than one convesation at a time).
So when we consider branching on data values, it’s OK to reason about it using ‘if-then-
else’, ‘while’ and so on. It’s much harder to reason about temporal control flow in the same way, and so the logical branching necessary to ensure dependency and arbitration is a very different ‘kettle of fish’.
The Blueprint project aims to present humans with two entirely separable views of the world. Scheduling logic is represented using visual branching and merging semantics (a ‘connectivity’ view of events), whilst the algorithmic logic remains sequential (plain vanilla C++ at this time).
This is analogous to electronic engineering where circuits with pulses racing around them can be ‘drawn’ (and be understood by most humans with relatively little effort) whilst the component logic is probably best considered textually using something like VHDL. The important point is that they are separable concerns (scheduling and calculation) and benefit from specific treatments.
I also agree with the findings of the research link and in my experience message passing schemes are human intensive (and if they gratuitoslu move data, aren’t ideal for modern chips).
Our VPL has no concept of a message as such, and so you can think about it as events travelling around a single process (and execute it as such by the way). Like the languages discussed above, if accreted to a distributed memory platform (like a network of machines) data is actually moved behind the scenes, but only when two components find themselves accreted to separate memory spaces at runtime. All ‘connections’ are prototyped, so you can make sure that data types and dimensionality are consistent at compile time.
Thanks for sending me this link Adrian, this is a very interesting discussion.
August 11, 2010 at 3:41 am
Nice survey of the parallel programming landscape.
Allow me to add to your list the Ateji PX language that we just released a month ago. It features data and task parallelism plus message passing, while remaining compatible with existing code and development tools.
The whitepaper is here : http://www.ateji.com/px/whitepapers/Ateji%20PX%20for%20Java%20v1.0.pdf
Applications based on message passing do not need to be messy. Just like sequential languages provide higher abstractions than goto, MP-based abstractions such as the Actor model or MapReduce lead to clean, understandable and maintainable code.
Look also at Erlang (not mentioned in your post) for successful examples of large critical systems based on message passing.
September 27, 2010 at 12:04 pm
[…] my previous post I gave a preview of how concurrent programming might evolve in the future, and the topic of […]
November 3, 2010 at 11:15 am
[…] MapReduce has competitors, one of them being PGAS (Partitioned Global Address Space; see my post on The Future of Concurrent Programming). Let’s briefly compare the two […]
December 26, 2010 at 7:34 pm
Bartosz, what are your recommendations (books) for learning more about concurrency.
I am trying to use concurrency in D, but I am not comfortable with my understanding of the concepts.
December 26, 2010 at 7:35 pm
Also, let know when do you plan to complete the book you are writing on multithreading.
December 27, 2010 at 12:54 pm
Did you notice how concurrency is usually relegated to the last chapter of a programming book? In Andrei’s “The D Programming Language” it’s chapter 13. That’s because concurrency is hard. There are a few hits on Google for “concurrency for dummies”, but I wouldn’t get excited about them.
As for “The Book,” has Andrei leaked the news? I haven’t told anybody else about it 😉 Mostly because I haven’t even started, and the more I think about it the scarier it looks.
December 27, 2010 at 7:17 pm
Thanks for the update.
I do have have the TDPL book (and I am still on chapter 8 of the book, and so no concurrency yet). Also I wanted to learn concurrency from a more generic perspective (not just D specific). I am a newbie in D, but have been using C++ for a while.
As far as information regarding your writing a book is concerned, I saw it on your twitter profile. Now that the cat is out, you might want to accelerate the process. 🙂
December 30, 2010 at 5:12 pm
Great post. There’s a lot of nice data. I did want to let you know something though – I am running Mac OS X with the latest of Firefox and the design of your blog is kind of flaky for me. I can read the articles, but the navigation doesn’t function so good.
January 2, 2011 at 12:25 pm
[…] The busiest day of the year was August 2nd with 10,171 views. The most popular post that day was Beyond Locks and Messages: The Future of Concurrent Programming. […]
January 11, 2011 at 7:53 pm
I have to disagree with the “message passing out, STM in” conclusion. Programming on multicore/manycore at high core-counts requires deterministic code descriptions that can be used with formal methods and smart compilers, and computing across multiple platforms (cloud to mobile) excludes using shared memory.
Developing highly parallel code is pretty much the same exercise as designing parallel hardware. An FSM based methodology with asynchronous communication is probably the best strategy.
[more on my website]
January 12, 2011 at 3:22 pm
@Kevin Cameron: I have to disagree with: “computing across multiple platforms (cloud to mobile) excludes using shared memory”. That’s exactly what PGAS is about–shared memory in a distributed system. There are formal methods describing STM and it’s been included in Haskell, which has very high standards as far as theoretical models of computation go.
January 12, 2011 at 4:37 pm
Shared memory is a bad abstraction in itself, that leads to all sorts of bad bugs (with pointers etc), building another abstraction on top like STM is just perverse. Stretching it across a distributed environment (virtually shared) is just asking for more trouble.
It’s fine to use those things in an implementation layer but not at the programming level.
Shared memory does not work well as a paradigm for programming FPGAs or GP-GPUs, Message passing works fine for those.
January 20, 2011 at 4:32 am
Hi Bartosz, That was a very good and detailed post. I am going through it still. In http://www.cs.brown.edu/ipp/symposia/ipp41/ibm.pdf they talk about STM interface and X10 extensions. Has STM been really implemented in X10?
January 20, 2011 at 5:44 pm
From what I’ve heard, the X10 atomic constructs, despite what the spec suggests, are not really STM. It seems like IBM researches have some reservations wrt STM. Have a look at this paper: http://queue.acm.org/detail.cfm?id=1454466 .
February 4, 2011 at 8:44 pm
This is a good place for STM and your video is also adding to it. Thank you Bartosz.
May 22, 2011 at 7:32 am
[…] · Milewsk, Bartosz; “Beyond Locks and Messages: The Future of Concurrent Programming”, https://bartoszmilewski.wordpress.com/2010/08/02/beyond-locks-and-messages-the-future-of-concurrent-p… […]
June 1, 2011 at 9:25 am
[…] like Cray’s Chapel provide clean ways of separating algorithms from data partitioning (see my blog about HPCS). I talked to several speakers at HotPar, and surprisingly few of them were aware of the Chapel […]
June 14, 2011 at 9:28 pm
[…] another bet: shared memory. (I can’t help but to point to one of my earlier blog posts about the future of concurrent programming, in which I made a similar bet). The idea is that all cores, the CPUs and the GPUs will share the […]
October 3, 2011 at 11:15 am
[…] that std::async was a gateway to task-based parallelism. I blogged about task-based concurrency in The Future of Concurrent Programming and, in the context of Haskell, in Parallel Programming with Hints. And of course there is the […]
October 3, 2011 at 3:49 pm
[…] that std::async was a gateway to task-based parallelism. I blogged about task-based concurrency in The Future of Concurrent Programming and, in the context of Haskell, in Parallel Programming with Hints. And of course there is the […]
October 27, 2011 at 8:35 am
Pr3fix’s World…
[…]Beyond Locks and Messages: The Future of Concurrent Programming « Bartosz Milewski's Programming Cafe[…]…
November 7, 2011 at 10:31 am
[…] of today is the mainstream computing of tomorrow. A year and a half ago I wrote a blog about the future of concurrent programming based on new developments in systems and languages in the HPC (High-Performance Computing) […]
April 27, 2012 at 1:12 pm
[…] Beyond Locks and Messages: The Future of Concurrent Programming https://bartoszmilewski.com/2010/08/02/beyond-locks-and-messages-the-future-of-concurrent-programming… […]
July 25, 2012 at 9:28 am
Fortress is dead
November 12, 2012 at 5:47 am
The main drawback is that you have to retry failed transactions (and they could, of course, fail repeatedly). There is also some significant overhead involved with the transaction system itself, as well as a need for additional memory to store the data you’re trying to write until it’s decided which process will succeed. Ideally, there should be hardware support for transactional memory just as there typically is support for virtual memory.
The STM approach seems more manageable to programmers than the use of locks, and it may be a good way to take advantage of concurrency as long as transactions don’t have to be restarted too often due to contention. We still consider this approach to be at its core a variant of shared memory with locks, and one that may be more help on an operating system level than on an application programming level; but it’s cur- rently a lively research topic, and things may turn out differently.
February 24, 2015 at 3:14 am
Fortress is not dead! It was part of an ongoing evolution. http://24hourlocksmith.com.au/door-security/lockwood/