Multithreading

Archived Posts from this Category

May 18, 2011

Boostcon: Day 3

Posted by Bartosz Milewski under Atomics, C++, Concurrency, Memory Model, Multicore, Multithreading, Programming
[8] Comments

Hans Boehm gave a keynote address about C++11’s support for concurrency. It was a nice overview of major features and, of course, the most interesting topic, atomics and weak atomics. The official story is that if you use locks and strong atomics, you get the DRF guarantee: If the program has no data races, it will behave in a sequentially consistent manner. How do you prove that you have no data races? You enumerate all possible interleavings, and if you can’t find one where two conflicting memory accesses happen next to each other, you’re golden. That’s more or less what Java memory model guarantees (and what Posix tried to standardize). However C++ offers the programmer a way to relax sequential consistency constraints without introducing data races. Now, if you spin it this way, it sounds like a really cool thing. Hey, look, my program is data-race free! And, get this, I don’t have to suffer sequential consistency! The natural question is, what does it buy me that the C++ Standard doesn’t treat “memory_order_relaxed” accesses as data races? I would like to hear that programs with weak atomics have well defined semantics, even if the semantics are so complex that proofs of correctness of even the simplest algorithms are non-existent. But as far as I know this is not “really” true (maybe “sort of” true?). I tried to get straight answers from Hans, but he chooses his words very carefuly, like a UN diplomat. I’ll see him again at the HotPar and I’lll press him some more.

Hans’s talk was followed by Tony Van Eerd’s presentation on lock-free programming. I liked Tony’s attitude, which was “Use Locks!” Indeed, you should look at lock-free algorithms as a last resort. He showed a few examples that were hair-raising. Even the simplest lock-free linked list is a challenge. It’s really hard to spot danger areas, like the ABA problem when the node you’re pointing at gets deallocated and reallocated when you’re not looking. Your CAS succeeds, because the addresses match, but your update ends up in the great bucket in the sky. The lock-free circular queue of integers with only one thread pushing and one thread popping turned out to be a mine field. Tony claimed that it should work with weak, relaxed memory order, atomics. But, of course, no formal proof is on the horizon. I stared at the code for minutes and it sort of made sense to me, but who knows? Hans stared at it some more and tentatively murmured that it’s probably okay. The bottom line: This is some really scary stuff.

Then I spent half a day with Hartmut and Joel: Me trying to understand Proto and they trying to understand monads. I think we’ve learned a lot from each other and the new formulation of Proto using monads is getting closer and closer. We have sort of nailed the definition of a monadic “function” in C++. I think we should call these things “hybrid” monads because they blend compile-time and runtime aspects of C++. Fascinating stuff!

May 16, 2011

Boostcon: Day 1

Posted by Bartosz Milewski under C++, Concurrency, Metaprogramming, Multithreading, Programming
[7] Comments

Aspen, Monday, May 15: These are just quick notes, and I apologize if they might not very coherent. I have little time to jot them down because the day was filled with talks and discussions. There are two main topics people are excited about: concurrency and template metaprogramming.

In the morning I went to Christopher Kohlhoff talk about Boost.Asio. Asio stands for Asynchronous IO. The problem is how to structure your program to deal with a lot of asynchronous calls. There are two main difficulties: resource management and inversion of control. An async-driven program is essentially a collection of disjoint callbacks. When control jumps from one callback to another, you can easily lose track of who is the owner of which resources, how to get hold of them, and how to reason about the whole system. Chris showed how to deal with resources, essentially by using shared, reference-counted, pointers. When you’re making an asynchronous call you usually have to make sure that the completion handler shares some resources with the caller. In Chris’s approach these resources are encapsulated in a Connection object and the completion handlers are methods of that object. So when a handler is called, it has access to the “this” pointer and can share data with the caller. A handler must take part in the reference counting of the Connection object, so that the object doesn’t suddenly disappear. Considering that a lot of event-driven code I’ve seen over the years used global shared data to store state, this is progress. The inversion of control is still a problem though.

Hartmut Kaiser talked about the Phoenix library, which essentially implements C++ inside C++. It is built on top of Proto (which is a big hit at the conference–more about it later). From what I gathered, Phoenix is mostly a better lambda. You can write anonymous functions in Phoenix using a very stylized C++: for instance, instead of braces, you use brackets, instead of semicolons, commas, etc. One advantage of Phoenix functions is that they can be statically polymorphic. Unfortunately the main example didn’t require polymorphism, and in fact would be much easier on the eyes if it were written using C++ lambdas. The other, more important advantage of Proto is that it’s highly customizable. Hartmut showed an example of how to extend C++ syntax to support parallelism. Behind the scenes, Phoenix took advantage of OpenMP, which is a system of ugly pragmas supported by many compilers to create parallel loops and other concurrent constructs.

An then there was a Proto marathon by Joel Falcou, after which I had a long discussion with him over dinner. Proto is a metaprogramming tour de force. If I described it as a library for constructing embedded domain-specific languages in C++, I wouldn’t give it justice. It’s a system that tricks the C++ compiler into parsing expressions into full-blown compile-time abstract syntax trees, which are at the same time function objects that can be executed at runtime. If this is not impressive enough, Proto provides multiple customization mechanism that allow you to plug in new constructs, give them specific semantics, and even rewrite the ASTs. Joel gave an example of an EDSL for expressing analytical functions, which could analytically calculate derivatives of functions at compile time. Joel is coming to my talk tomorrow and I hope he will be able to explain to me what I’m doing. My talk is essentially about how to get to grips with Proto by using Haskell monads. We’ll see how it goes.

April 26, 2011

Benign Data Races

Posted by Bartosz Milewski under Concurrency, Memory Model, Multicore, Multithreading, Programming
[6] Comments

I am forking out my concurrency blogging to a new site, where I am actually paid to do it (who would have thought!). I promise to keep the same quality of posts as my readers came to expect from me. My first article there is about benign data races, an interesting and controversial topic.

I will still post programming-language and functional programming articles here. The next installment of the monad cycle is in the works. I’ll be talking about Haskell, C++ metaprogramming, and monads at this year’s Boostcon (May 15-20).

November 3, 2010

MapReduce and the Build Process

Posted by Bartosz Milewski under C++, Concurrency, Distributed Programming, Multithreading, Parallelism, Programming
[3] Comments

I’ve been planning on writing about the Google’s MapReduce algorithm for some time but I couldn’t find a good practical example. Then we had a Northwest C++ Users Group presentation by Steve Yegge and a followup discussion and beers, and I had a little epiphany. Steve was talking about, among other things, the build process. And that’s just a bunch of algorithms that are perfect for explaining MapReduce.

MapReduce is usually introduced in the context of distributed systems. It’s a system that spreads processing among multiple machines, often organized in huge server farms. But it can also be used on a single machine, to work with processes; or even in a single process, with threads. The beauty of it is that, if you write a build engine that uses MapReduce, you may scale it from a few threads to thousands of servers. It’s like writing a program that, rather than reading individual disk sectors, uses a file system. Such program will magically work on a USB stick as well as on a distributed file system.

The other hot trend, besides scalability, is data mining. Huge software projects are like the Internet. Developers spend a lot of time browsing source files. We need tools that are more than just search engines, we need smart tools that understand computer languages. And data mining fits very well into the MapReduce paradigm. There is fan-out of small independent tasks operating on single files, followed by a fan-in, during which data is combined. A perfect example of such an algorithm is mapping inter-file dependencies (mediated by include statements), something every build does to minimize rebuilding after a change.

I organized this blog to follow the old parable of three blind men and an elephant. The elephant in this case is the build. The first man touches the compile-link part of the build process and exclaims: “It’s MapReduce.” He has a well-developed sense of scale. The second one notices that not all files are rebuilt after a change and shouts: “It’s Whole Program Analysis.” He has an excellent data-mining ear. The third man, with a sequential nose, observes that not everything can be done in parallel and cries: “It’s a Topological Sort.”

So what is a build?

It’s MapReduce

Obviously, you can build a project on a single machine by performing a sequence of actions. I choose to treat this as a trivial case of a distributed process. Let’s start with the simplest thing: You have a bunch of C++ files and you want to compile them and link the resulting object files into an executable.

The separate compilation model tells us that the compilation of any source file (a Translation Unit, if you wish) is independent of the compilation of any other source file. Therefore, the compilation can (and I’d say, should) be run in parallel. You will not only make more cores busy, but also overlap I/O-intensive parts of compilation.

So the build engine would fan out a number of compilation tasks and, when they are all finished, combine the outputs (object files) and pass them to the linker. Let’s call the first part “mapping” and the second part “reducing.” That’s your MapReduce in a nutshell.

To be more specific, MapReduce abstracts this process so it may be reused in many different contexts. It’s a framework into which the client plugs in the parts that define what may be done in parallel and how to combine the results.

Here are the general steps:

The client specifies input data. In our example this would be a list of files to be rebuilt.
MapReduce partitions this list among multiple worker threads, processes, or machines.
Each worker executes client-provided code–a function called map. In our example, we’d write map to run the compiler on a single file to produce the object file. This function “maps” a source file into an object file.
MapReduce then waits for all workers to finish (possibly re-distributing the work if a server crashes). This barrier is the main synchronization point in the algorithm.
MapReduce reshuffles the results and, again, distributes them to different workers for the final phase of the algorithm.
Each worker executes client-provided code–a function called reduce. In our example we’d write reduce to execute a linker on a list of object files to produce the executable.

For all this to work, the client must organize data in a particular form that is palatable to MapReduce. The input must be a list of (key, data) pairs.

Since keys and data may be of arbitrary type, why this separation? MapReduce needs keys to be able to partition the work. Each key uniquely identifies a task that may be done in parallel with other tasks. In our case the key would be a source file name (or, in general, a path). The keys are sorted by MapReduce and partitioned equitably among workers (often the client has the option to override the default partitioning algorithm or the compare function).

The map function, written by the client, must conform to a particular (generic) interface. It takes a key and the associated data. In our case map would take a file name (key) and some data to be defined later. It would compile the file to an object. Notice that map is myopic by nature. It concerns itself only with a small fraction of the whole process. Because it doesn’t have to know about the big picture (here, the build), its implementation is relatively easy (here, it just calls the compiler).

The output of map has to conform to certain standards too, so it can be shuffled by MapReduce and passed to the second client-defined function, reduce. The map function must emit new pairs of (key, data). The new keys and the new data may be totally unrelated to the original ones, both in meaning and type.

Back to our example, to make reduce nontrivial, let’s shoot for a little more general scanario: we want to build multiple targets at once. In Visual Studio, for instance, building the solution might involve building several projects, each producing a separate executable, library, or a DLL. We’d make map emit pairs (target name, object name). (The target name will have to be passed to map as input data.)

Here’s my simplified implementation of map (a more complete toy example follows at the end of this post):

void map(std::string const & file, std::string const & target)
{
    std::string cmd = "cl /c ";
    cmd += file;
    execute(cmd);
    // output new (key, value) pair
    emit(target, ObjFileName(file));
}

When all workers are done, MapReduce gathers the emitted pairs and sorts them by the key so that it can accumulate data for each key into a separate list. It then redistributes the results among workers. As before, the keys define units of concurrency. Here we assume that each target may be linked independently (this is not always true: see “It’s a Topological Sort”).

The second-phase workers execute the client-defined reduce. The first argument to reduce is the new key, and the second is a list of data for this key.

In our example MapReduce would combine the outputs of all maps to build a list of object files corresponding to each build target. And that’s exactly what the linker needs. Here’s my toy implementation of reduce:

void reduce(std::string const & target, 
    std::list<std::string> const & objFiles)
{
    std::string cmd = "link /out ";
    cmd += ExeNames[target];
    std::for_each(objFiles.begin(), objFiles.end(), 
        [&](std::string const & obj) {
            cmd += " ";
            cmd += obj;
        });
    execute(cmd);
}

(I explain my use of lambdas in the Appendix.)

Of course this is a very simplified picture of just one part of the build process. Still, I suspect many a build environment use the same overall idea, and maybe some of them even use MapReduce to implement it.

There are many distributed build environment in use today. The difference is that they either put distribution on top of non-distributed builds or write one monolithic application. That restricts their reusability and leads to duplication of functionality. Here, on the other hand, distribution is a layer below the build engine. The MapReduce implementation, whether on a single machine or over a server farm, knows nothing about building software projects, so it doesn’t duplicate any functionality and is perfectly reusable. It’s a matter of whether the build engine uses, say, message-passing API or, more abstracted, MapReduce API.

It’s Whole Program Analysis

Separate compilation model means that the compiler has access to only one source file at a time. The build environment, on the other hand, has access to the whole program. So if there is any need for whole program analysis, that’s the natural place to put it.

Figuring out dependencies between program files is an example of whole program analysis. The build needs it because knowing the dependency graph allows it to recompile only the minimum number of files after one of them (for instance, a header file) changes. There is a program, makedepend, that finds such dependencies by looking at the #include statements.

Inverted Index

Finding dependencies is a lot like creation of an inverted index — a classic example of how to use MapReduce. The map function scans a given document and emits a stream of pairs, (word, document). The reduce function gets a word as the key and a list of documents in which this word occurs (possibly with some positional information). The bulk of the work is done inside MapReduce by sorting the results of map by the word.

As an exercise, let’s see how a simple makedepend could be implemented using MapReduce. Suppose that we have a dedicated parser (or a smart preprocessor) that takes a source file and finds all the includes on which it depends. (In C++, the parser must do it recursively, going into includes of includes, etc. See item 7 in Walter Bright’s post.)

What the build process really needs is the answer to a different question: Which files must be recompiled if a given include file changes?

Here’s how we may answer this question using MapReduce. Our map function should take a source file name as key and run our parser over this file. (We also have to know what constants are predefined for a given build target, because of conditional compilation. We could pass them as data to map.)

What map would emit is a stream of pairs, (include file, current source file). Notice the inversion: For map, the name of the source file was the key; here the name of the include file is the key. Because of that trick, MapReduce will do the bulk of work for us. It will sort and group data by the new key. Thus the input to our reduce function will be the name of an include file and a list of all source files that depend on it. These are the files that will have to be recompiled if that header file changes.

Here again, a higher level of abstraction–writing the dependency engine on top of MapReduce–provides better scalability and avoids code duplication. It also opens up new possibilities. Consider the building of a dependency graph as an example of data mining. In general, a build engine that works on top of MapReduce could easily dispatch all kinds of bots that extract local data from every file and combine them into a global index or database.

In particular, many IDEs attempt, with higher or lower degree of success, to build call graphs, definition/use graphs, inheritance graphs, etc.

Also, there are programs that can infer const-ness or, in Java, non-null-ness. In dynamic languages, type inference is very useful, and it requires whole-program analysis. Not to mention programs that can find potential data races or deadlocks. They all follow the same pattern and can easily be fit into a MapReduce build engine.

It’s a Topological Sort

The build process must also deal with other types of dependencies: when the result of one operation is a prerequisite for another operation. If you’ve ever built a compiler using flex and bison, you know that before you can run flex you need to run bison to produce the appropriate include files that contain, among others, definitions of tokens. These kinds of dependencies are usually explicitly written into makefiles, as in this example:

lang.tab.hpp lang.tab.cpp: lang.ypp
    bison lang.ypp
lex.yy.c: lang.lex lang.tab.hpp
    flex lang.lex

When you have results of some operations being prerequisites for other operations, you use the good old topological sort to organize your work. It so happens that topological sort may also be done using MapReduce (see, for instance, Ricky Ho’s blog).

I’ll explain it using a classic example: getting dressed in the morning. Our set consists of objects of clothing:

{jeans, lsock, rsock, lshoe, rshoe}

You can’t get dressed by putting on those objects in random order. Traditionally, you put on a sock before you put on a shoe. A sock must be on the list of prerequisites for a shoe.

In the preliminary stage of topological sort, we prepare a list of pairs, (object, prerequisite). In our case, the list consists of:

(lshoe, lsock)
(lshoe, jeans)
(rshoe, rsock)
(rshoe, jeans)

Our map1 iterates over the list of prerequisites for a given object and emits inverted pairs: the prerequisite is the key, and the object is the data. In other words, it emits (object, dependent) pairs.

void map1(std::string const & object, std::string const & prereq)
{
    emit(prereq, object);
}

MapReduce sorts and reshuffles those pairs and then calls reduce1.

void reduce1(std::string const & object, StringList const & dependents)
{
    TheDependentsOf[object] = dependents;
};

Now we have, for each file, both the original list of prerequisites, and the newly created list of dependents. For the sake of this exposition, I’ll store these list in two global maps:

StringToList ThePrerequisitesOf;
StringToList TheDependentsOf;

In our case, the resulting dependents map will contain:

TheDependentsOf[jeans] = {lshoe, rshoe};
The DependentsOf[lsock] = {lshoe};
TheDependentsOf[rsock] = {rshoe};

The second part of the algorithm is a loop in which MapReduce is called repeatedly until all objects are consumed (I mean, put on).

The map2 function takes an object of clothing as key and a pair of lists, its prerequisites and its dependents, as data. If the list of prerequisites is not empty, it does nothing–we can’t put on that object yet.

Otherwise it performs the action: “Put the object on.” In our case, jeans and socks have no prerequisites, so they are put on immediately. Once the object is put on, our map2 removes that object from the list of objects. It also iterates over the list of its dependents and emits inverted pairs, (dependent, current object). For instance, once the jeans are put on, two pairs are emitted:

(lshoe, jeans),
(rshoe, jeans)

because both shoes depended on the jeans.

Here’s the code:

void map2(std::string const & object, TwoLists const & preqsAndDepds)
{
    if (preqsAndDepds.first.empty())
    {
        // This object has no prerequisites. Put it on.
        std::string cmd = "put on ";
        execute(cmd + object);
        // Remove it from the global list
        TheObjects.erase(
            std::find(TheObjects.begin(), TheObjects.end(), object));
        // Emit pairs (object, its completed prerequisite)
        StringList const & depds = preqsAndDepds.second;
        std::for_each(depds.begin(), depds.end(),
            [&](std::string const & dep) {
                emit(dep, object);
            });
    }
}

The reduce2 function is then called with an object and the list of its completed prerequisites. In our case, the arguments will be:

lshoe, {jeans, lsock}
rshoe, {jeans, rsock}

The reduce2 function removes the completed prerequisites from the list of all prerequisites for a given object.

In our case, reduce2 will remove both jeans and the left sock from ThePrerequisitsOf[lshoe].

It will also remove the jeans and the right sock from ThePrerequisitesOf[rshoe].

void reduce2(std::string const & object, StringList const & complPrereqs)
{
    std::for_each(complPrereqs.begin(), complPrereqs.end(),
        [&](std::string const & completed) {
            StringList & lst = ThePrerequisitesOf[object];
            lst.erase(std::find(lst.begin(), lst.end(), completed));
        });
}

As a result, the shoes end up with no prerequisites, so they will be put on in the next (and last) iteration of MapReduce.

Limitations of MapReduce

MapReduce is one particular approach to data-driven parallelism. In general it’s not a good fit for problems that exhibit task-driven parallelism. But even within the data-driven domain, MapReduce has competitors, one of them being PGAS (Partitioned Global Address Space; see my post on The Future of Concurrent Programming). Let’s briefly compare the two approaches.

MapReduce is closer to the functional-programming, message-passing, data-copying paradigm. PGAS is a generalization of shared-memory paradigm. It creates the illusion of global address space spanning multiple processes or machines. Consequently, MapReduce doesn’t require explicit synchronization while PGAS often does. On the other hand, MapReduce incurs more data copying even if it runs on threads of the same process (I’ll try to quantify this in my next post).

MapReduce requires a particular form of input — a set of (key, data) pairs. The distribution of work is encoded in the input and is driven by the choice of keys–each key defining a minimum unit of concurrency. PGAS works on shared data structures, typically multi-dimensional arrays. Distribution of work is abstracted from data and from the algorithm– it is defined by distribution maps over data structures (this is also explained in my previous post). A distribution may be modified without changing any other part of the algorithm. With MapReduce that would require either the change of keys, or a client-defined partitioning function.

A lot of MapReduce applications end up sharing data one way or another. Even my little example assumes that files, in particular header files, are accessible form every location. This is usually made possible through some kind of distributed file system. Google, the owner of the patent on MapReduce (I just hope they won’t “do evil” by suing other companies or individuals who use it or, gasp!, blog about it), has its own GFS (Google File System) that’s used in concert with MapReduce.

Finally, it’s not clear what fraction of problems lend themselves to MapReduce. Scientific simulations, for instance, are usually easier to describe in terms of large matrices, and therefore fit the PGAS mold more naturally. I plan to write more on this topic in my next post.

Appendix: Bob the Builder

Just for fun, below is a short C++ program I wrote to illustrate the use of MapReduce in the build process. Since MapReduce has its roots in functional programming, I felt free to use the C++0x lambdas at my convenience. For instance, in this snippet:

[&](Pair const & p) { targetFiles[p.first].push_back(p.second); }

I define an anonymous function that takes a Pair (by const reference) and adds its second component to the list targetFiles[p.first].

Notice that targetFiles is defined outside of the scope of the lambda–it is captured from the environment. A function that captures the environment is called a closure. Here I told the compiler to capture the whole environment, by reference, using the notation [&].

// A toy implementation of MapReduce. Sorry for using globals.

#include <iostream>
#include <string>
#include <list>
#include <map>
#include <algorithm>

std::list<std::pair<std::string, std::string>> Emitted;

void emit(std::string const & key, std::string const & data)
{
   std::cout << "  emit: (" << key.c_str() << ", " 
      << data.c_str() << ")\n";
   Emitted.push_back(make_pair(key, data));
}

void execute(std::string const & cmd)
{
   std::cout << cmd.c_str() << std::endl;
}

typedef std::pair<std::string, std::string> Pair;
typedef std::list<Pair> InputList;
typedef std::list<std::string> StringList;

void MapReduce(InputList const & input, 
   void (*map)(std::string const & key, std::string const & data),
   void (*reduce)(std::string const & key, StringList const & lst))
{
   // Distribute the map part to workers (here: do them sequentially)
   std::for_each(input.begin(), input.end(), [&](Pair const & p) {
      map(p.first, p.second);
   });
   // (--Wait for all workers to finish--)
   // Reshuffle emitted key/value pairs
   std::map<std::string, StringList> targetFiles;
   std::for_each(Emitted.begin(), Emitted.end(), 
      [&](Pair const & p) {
         targetFiles[p.first].push_back(p.second);
      });
   // Distribute the reduce part to workers (here: do them sequentially)
   std::for_each(targetFiles.begin(), targetFiles.end(), 
      [&](std::pair<std::string, StringList> const & p) {
         reduce(p.first, p.second);
      });
}

std::string ObjFileName(std::string const & srcFileName)
{
   std::string oFile = srcFileName.substr(0, srcFileName.length() - 3);
   return oFile + "obj";
}

// Maps target names to executable names
std::map<std::string, std::string> ExeNames;

void map(std::string const & file, std::string const & target)
{
   std::string cmd = "cl /c ";
   cmd += file;
   execute(cmd);
   emit(target, ObjFileName(file));
}

void reduce(std::string const & target, StringList const & objFiles)
{
   std::string cmd = "link /out ";
   cmd += ExeNames[target];
   std::for_each(objFiles.begin(), objFiles.end(), 
      [&](std::string const & obj) {
         cmd += " ";
         cmd += obj;
      });
   execute(cmd);
}

int main()
{
   // There are two targets
   ExeNames["MyApp"] = "MyApp.exe";
   ExeNames["YourApp"] = "YourApp.exe";
   // The input is a list of key/value pairs
   // Key: source file name
   // Data: target name
   InputList input;
   input.push_back(std::make_pair("foo.cpp", "MyApp"));
   input.push_back(std::make_pair("bar.cpp", "MyApp"));
   input.push_back(std::make_pair("baz.cpp", "YourApp"));

   MapReduce(input, &map, &reduce);
}

Here’s the output:

cl /c foo.cpp
  emit: (MyApp, foo.obj)
cl /c bar.cpp
  emit: (MyApp, bar.obj)
cl /c baz.cpp
  emit: (YourApp, baz.obj)
link /out MyApp.exe foo.obj bar.obj
link /out YourApp.exe baz.obj

September 11, 2010

Beyond Locks: Software Transactional Memory

Posted by Bartosz Milewski under Concurrency, Multicore, Multithreading, Programming, Software Transactional Memory
[28] Comments

1. “before every read we have to access our write set…” This is in case we have already written in this location during the current transaction. The write hasn’t been reflected in shared memory yet — it’s still sitting in our write set. We want to read this new value, not the original. So it’s not a conflict, but rather a consistency requirement.
2. “The lock and version number are sampled in one load.” Version number is combined with the lock bit in a single memory word, so it can be loaded in one instruction. Now we have a copy of version/lock locally and we can check it. If the test fails, we abort. If it succeeds, Traditional lock-based approach to concurrent programming is error prone, hard to debug, and almost impossible to scale.

In my previous post I gave a preview of how concurrent programming might evolve in the future, and the topic of Software Transactional Memory (STM) came up quite naturally. STM has the potential to replace locking with an easy to use, virtually fool-proof, scalable paradigm for concurrent access.

It’s true that STM has its problems; performance and integration with existing languages being the serious ones. The hope is that performance can be improved, especially if there’s some support in hardware, and that a new batch of concurrent languages will incorporate STM more smoothly.

In this post I won’t be elaborating on the superiority of STM over other paradigms but rather concentrate on how it works. I hope my explanation will be more approachable than academic papers.

STM in a Nutshell

Let’s analyze a very simple example, a singleton pattern in C++ (I chose C++ because it exposes the use of pointers):

Foo * getFoo() {
    static Foo * pFoo = 0;
    if (pFoo == 0) // <- read
        pFoo = new Foo(); // <- write
    return pFoo;
}

As you might know, this code doesn’t work correctly in a multithreaded environment. Let’s have a look at why, but from a slightly unusual angle. The statement:

if (pFoo == 0)

executes a memory read of the variable pFoo. The value that’s been read is usually stored in a register–in other words, is cached–at least until it’s compared with zero. Suppose that we read zero (a null pointer). We then proceed with the allocation and construction of the Foo object. We update the variable pFoo (a memory write) and return it to the caller.

The important thing is that the write to pFoo is only correct if the condition pFoo == 0 is maintained up to the moment when the new value is stored (we don’t even think of such subtleties in sequential programming).

But that condition may be broken by another thread storing its own creation in pFoo. It’s a classic data race, and its consequences may vary from a simple memory leak to a deadly memory fault (that might happen on some processors if the object Foo manages its own internal pointers). In general, it’s the gaps between logically related reads and writes that leave room for other threads to break our assumptions. If we could only concentrate those reads and writes into one atomic action, we’d be safe.

Let me propose a crazy solution: Before performing the write to pFoo, why don’t we re-check, or validate, our original read (assume that it’s still in the register). If the value in memory is still equal to the value in the register, we proceed with the write; otherwise we back off and try again.

Wait a moment! Aren’t we just postponing the problem? What about another thread messing with pFoo right between the read validation and the write? True, that is a problem, and we have to figure out a way to make the sequence read-validation/final-write atomic. But the beauty of this crazy approach is that it can be made universal. The code that does generic read verification may be compiled into the language runtime. It may be implemented once and for all, thoroughly tested, and used whenever concurrent access is needed.

All the programmer has to do is to inform the compiler that a logically related group of reads and writes must be done atomically; for instance, by enclosing the relevant code in the atomic block (not actual C++!):

Foo * getFoo() {
   static Foo * pFoo = 0;
   atomic {
      if (pFoo != 0) // read
           pFoo = new Foo(); // write(s), possibly reads
   }
   return pFoo;
}

Here’s what happens inside the atomic block (there are many variations of the algorithm but the idea is the same).

First, a transaction is started–the runtime allocates a data structure containing two logs.
Every memory read inside the atomic block is logged in a log called the read set.
Every memory write, instead of going directly to memory, is written into the write set (there are also versions of STM that let the writes go through, to be undone if the transaction aborts).
At the end of the atomic block the system attempts to commit the transaction. It does that by first verifying the read log and then performing the writes. (There is a very fine-grained system of locking in play, which I’ll describe shortly.)
If the read verification discovers a mismatch, the transaction is aborted and repeated from scratch, until it succeeds.

Some Details

There is one particular implementation of STM that’s been most widely accepted: Transactional Locking II (TL2), which I will summarize here.

Memory Layout

Imagine that for every single word of main memory you also have a separate data structure that stores the version number and a write lock for that word. How this can be done efficiently is an important implementation detail that I will explain later. For each memory word, its version number is initially set to zero and its lock is open.

Transaction

Now let’s start a transaction. We allocate a thread-local transaction object that has a read set and a write set, initially empty. We assign a version number to the transaction by (atomically) sampling the global “version clock” (it’s called a clock because it can only move forward). This will be our transaction’s read version number.

The compiler has conveniently instrumented all reads and writes inside the scope of the transaction (that is, in the atomic block). Each read adds an entry to our read set, storing the address of the word being read. Each write adds an entry to our write set, storing the address and the value to be written. The value in main memory is not changed at this point–the write is virtual.

Actually, before every read, we have to access our write set in case we have already written to that location. Moreover, we need to check that the location is not locked by another committing transaction, and that its version is less or equal to our transaction’s read version. This is why we need all those per-location locks and version numbers. If the location is locked or if its version number is larger than our current version, we abort the transaction and repeat the whole process again.

These checks are costly, and they are optimized to just a few instructions in the common case. So, for instance, the write-set check is implemented using a Bloom filter. Before every read we sample the corresponding lock and the version number (in one load, since it’s one word) and keep it for later. Then we check our write set using the Bloom filter. If we don’t find the corresponding entry, we just read the value directly from memory, otherwise we read it from our write log. Then we store the read address in our read set, and finally check the version number/lock that we have previously sampled. (On some processors, read barriers are required when accessing the locks.)

Going back to our singleton example, in Fig 1 you see the transaction corresponding right before committing. We have read the value of pFoo (directly from memory, since there was no entry in our write set yet) and logged the address of pFoo in our read set. We have checked that the version of pFoo, 6, was less than the version of our transaction, 8. We have also allocated a new Foo in memory. During the construction of Foo some two fields of Foo were (virtually) written and they have been saved in the write set together with the new value of pFoo (the pointer to the new Foo).

Fig 1. An STM transaction before commit. Every location in main memory has a corresponding version/lock entry. Both the read set and the write set have an entry for pFoo. The write set also stores the new value of pFoo, which is a pointer to the newly allocated Foo.

At commit time, we lock all our write locations (there are three of those) using the per-word locks. Since we might be racing with other transactions, we should be careful not to cause deadlocks. The simplest precaution is to use bounded spinlocks. If we fail to acquire any lock within a set timeout, we abort the transaction.

Committing the Transaction

Sequencing and Validation

During read set validation, we check the lock and the version number of each location that’s been read by our transaction. The problem is that this operation is not atomic (we don’t lock read locations). While we are validating, other transactions might modify the already validated locations. I’ll show you that it really doesn’t matter in the big scheme of things.

In the big scheme of things, there is a sequence of transactions, each making an atomic update. So our program’s memory switches from one consistent state to another. When, precisely, this jump occurs is called the “sequence point.” As I mentioned, the sequence point for each transaction is reached after the write set is locked but before the read set is validated.

Now imagine that we have two read locations, r1 and r2. We have validated r1 and, while we’re validating r2, a stray transaction commits a write to r1. What can be said about the sequence points of the two transactions? Our sequence point was set before we stated read validation. Since we have successfully validated r1, it hadn’t been locked by the stray transaction. It means that the other transaction’s sequence point must have been set after ours; and we don’t care about “future” transactions. All we are concerned about at this point is whether an earlier transaction, with an earlier sequence point (which is, however, greater than our read version), hasn’t overwritten those locations.

At this point we atomically increment and fetch the global version clock–this will be our transaction’s write version, not to be confused with the earlier read version. We haven’t committed the transaction yet but, if we commit it, this will be its official commit point, a.k.a. sequence point.

Now we can validate the read set at our leisure. We make sure that the read locations are not locked by another thread, and that their version numbers are still less or equal to our transaction’s read version. This is a slightly tricky point–see the sidebar.

In our example, the locks for pFoo and for the two fields inside the newly allocated Foo will be taken, the clock incremented to 9, and the read set validated.

The read set is small–just the location of the variable pFoo. It is not locked by another thread (although it’s locked by us, because it is also present in our write set), and the version number is still less than 8.

The final step is to commit the write set. Each location is updated and then unlocked. We also set the version number for each modified location to the write version of our transaction (the value of the version clock at our sequence point). In our case, we change the version number of pFoo from 6 to 9.

You might be wondering why we verify each read at least twice–once while executing the instrumented client code, and once during validation. After all, if we read the wrong version of the variable, the transaction will fail anyway during verification. The problem is that, after reading a bunch of inconsistent values, the client code might get into all kinds of trouble, including infinite loops or access violations. This used to be a major concern before TL2.

Reasons to Abort

Let’s consider what can go wrong in our example that would force us to abort (and retry).

During the first read of pFoo:
- We find it locked. That could happen if another transaction is just committing its write to pFoo.
- It’s version number is greater than our transaction’s read version number. That could happen if another transaction had just committed its own write.
During pre-commit locking, we find pFoo locked by another transaction. (We let that transaction proceed.)
During read validation of pFoo:
- The location is locked by another transaction. (This can’t happen in our case because we have locked it ourselves.)
- Its version number is greater than our read version number. This could happen if another transaction committed its write to pFoo after we have read it last.

You can easily convince yourself that in each case it really makes sense to abort the transaction. Moreover, when we retry the transaction, pFoo will have already been initialized by another thread. In that case the transaction will just breeze through, almost as fast as the lock-based implementation.

Optimizations

Note that during our transaction other non-conflicting transactions could have committed successfully. Each such transaction increments the version lock. This is why our write version could be arbitrarily larger than our read version. On the other hand, if the write version is just one unit larger that the read version, we know that we were alone, and we can skip the read validation altogether. In fact this is what would have happened in our example where the read version was 8 and the write version was 9.

Another great optimization is read-only transactions. Transactions that don’t perform any writes don’t have to log the reads. It’s enough that they perform the lock/version check at the point of each read. If all reads are successful, the transaction is done. It has seen a consistent state of memory.

It’s worth noting that atomic operations in TL2 are non-blocking. They are usually implemented using the CAS instruction (Compare and Swap). The locking of individual memory words is done using bounded spinlocks.

Conserving Memory

Let’s see how we can implement all the locks and version numbers without doubling or tripling our computer’s memory. To begin with, we can combine the lock with the version number in one word–a versioned lock. We only need one bit for a spinlock and we can stick it at the bottom of the version number, as long as we only use even version numbers.

The second trick is to use a reasonably sized table of versioned locks and hash all memory locations into that table. Since there will be many locations that map into the same lock, spurious aborts may happen. Notice however that spurious aborts have no impact on the semantics of the program. Aborted transactions will be retried until they commit. In practice such spurious conflicts happen reasonably rarely.

Performance

With the usual caveat that there are lies, damned lies, and benchmarks; the performance of STM clocks at about half the performance of hand-crafted locks (this is more or less the consensus number–see the interview with Peyton-Jones and Harris for confirmation). Which is not bad, considering all the instrumentation that goes into it. But if you have a program that spends 90% of its time concurrently inserting and removing items from a red-black tree, you might consider hiring a team of highly experienced programmers and testers to create a faster, lock-based, or even lock-free, solution. But take into account an important fact: STM is very fine grained. For instance, when you’re inserting an item into a tree, the STM transaction will only lock the nodes that you are actually modifying. STM will easily beat a solution that uses one global lock per whole tree. Per-node manual locking is hard to implement correctly because of the risk of deadlocks.

By its optimistic nature, STM performs best under low contention scenarios where the vast majority of transactions commit on the first try.

What’s Next?

Will performance problems interfere with a wider acceptance of STM? There is an interesting analogy between garbage collection (GC) and STM put forward by Dan Grossman. Historically, performance concerns were a major obstacle in the acceptance of GC until a general-purpose language, Java, adopted it as the memory management tool. If the analogy holds, the same will happen to STM and shared-memory concurrency.

STM has a permanent spot, and probably the best implementation, in Haskell (doesn’t it remind you of the relationship between Lisp and GC?). There’s been some STM activity in C++, Java, and other languages (see Bibliography) but, without language support, STM doesn’t offer enough bang for the buck in terms of safety and ease of use. That’s why I’m really excited about the use of STM in a batch of new high-performance languages, in particular in Chapel. Will Chapel become to STM what Java became to GC? Only time will tell.

I’ve been following the progress of the Chapel’s implementation of STM and talked to some of the developers. In distributed environments, STM can perform outstandingly. That’s because the overhead of STM read/write logging (which is done locally) is dwarfed by the usual distributed communication costs. The trick is to piggyback STM communications on top of remote reads and writes, which have to happen anyway. Chapel also has a chance to implement a type system that makes STM safer to use. I hope to discuss those options in a future post.

Bibliography

Herlihy and Moss, Transactional Memory: Architectural Support for Lock-Free Data Structures
Dice, Shalev, Shavit, Transactional Locking II
Interview with Simon Peyton-Jones and Tim Harris about STM
Harris, Fraser, Language Support for Lightweight Transactions–STM in Java
TinySTM, a public domain implementation of STM for C and C++
Deuce, a Java implementation of STM
Bronson, Chafi, Olukotun, CCSTM: A Library-Based STM for Scala
Harris, Marlow, Peyton Jones, Herlihy, Composable Memory Transactions–STM in Haskell.
Sridharan, Vetter, Kogge, Scalable Software Transactional Memory for Global Address Space Architectures–STM in Chapel

August 2, 2010

Beyond Locks and Messages: The Future of Concurrent Programming

Posted by Bartosz Milewski under Chapel, Concurrency, Fortress, Multithreading, Parallelism, Programming, X10
[65] Comments

Concurrent programming has been around for quite some time (almost half a century), but it was mostly accessible to the highest ranks of the programming priesthood. This changed when, in the 2000s, concurrency entered prime time, prodded by the ubiquity of multicore processors. The industry can no longer afford to pay for hand-crafted hand-debugged concurrent solutions. The search is on for programming paradigms that lead to high productivity and reliability.

Recently DARPA created its HPCS, High Productivity Computing Systems, program (notice the stress on productivity), and is funding research that lead to, among other things, the development of new programming languages that support concurrency. I had a peek at those languages and saw some very interesting developments which I’d like to share with you. But first let me give you some perspective on the current state of concurrency.

Latency vs. Parallel Performance

We are living in the world of multicores and our biggest challenge is to make use of parallelism to speed up the execution of programs. As Herb Sutter famously observed, we can no longer count on the speed of processors increasing exponentially. If we want our programs to run faster we have to find ways to take advantage of the exponential increase in the number of available processors.

Even before multicores, concurrency had been widely used to reduce latencies and to take advantage of the parallelism of some peripherals (e.g., disks). Low latency is particularly important in server architectures, where you can’t wait for the completion of one request before you pick up the next one.

Initially, the same approach that was used to reduce latency–creating threads and using message-passing to communicate between them–has been used to improve parallel performance. It worked, but it turned out to be extremely tedious as a programming paradigm.

There will always be a niche for client/server architectures, which are traditionally based on direct use of threads and message passing. But in this post I will limit myself to strategies for maximizing program performance using parallelism. These strategies are becoming exponentially more important with time.

Threads? Who Needs Threads?

There are two extremes in multithreading. One is to make each thread execute different code–they all have different thread functions (MPMD–Multiple Programs Multiple Data, or even MPSD–Multiple Programs Single Data). This approach doesn’t scale very well with the number of cores–after all there is a fixed number of thread functions in one program no matter how many cores it’s running on.

The other extreme is to spawn many threads that essentially do the same thing (SPMD, Single Program Multiple Data). On the very extreme we have SIMD (Single Instruction Multiple Data), the approach taken, e.g., graphics accelerators. There, multiple cores execute the same instructions in lockstep. The trick is to start them with slightly different initial states and data sets (the MD part), so they all do independently useful work. With SPMD, you don’t have to write lots of different thread functions and you have a better chance at scaling with the number of cores.

There are two kinds of SPMD. When you start with a large data set and split it into smaller chunks and create separate threads to process each chunk, it’s called data-driven parallelism. If, on the other hand, you have a loop whose iterations don’t depend on each other (no data dependencies), and you create multiple threads to process individual iterations (or groups thereof), it’s called control-driven parallelism.

These types of parallelism become more and more important because they allow programs to be written independent of the number of available cores. And, more importantly, they can be partially automated (see my blog on semi-implicit parallelism).

In the past, to explore parallelism, you had to create your own thread pools, assign tasks to them, balance the loads, etc., by hand. Having this capacity built into the language (or a library) takes your mind off the gory details. You gain productivity not by manipulating threads but by identifying the potential for parallelism and letting the compiler and the runtime take care of the details.

In task-driven parallelism, the programmer is free to pick arbitrary granularity for potential parallelizability, rather than being forced into the large-grain of system threads. As always, one more degree of indirection solves the problem. The runtime may chose to multiplex tasks between threads and implement work stealing queues, like it’s done in Haskell, TPL, or TBB. Programming with tasks rather than threads is also less obtrusive, especially if it has direct support in the language.

Shared Memory or Message Passing?

Threads share memory, so it’s natural to use shared memory for communication between threads. It’s fast but, unfortunately, plagued with data races. To avoid races, access to shared (mutable) memory must be synchronized. Synchronization is usually done using locks (critical sections, semaphores, etc.). It’s up to the programmer to come up with locking protocols that eliminate races and don’t lead to deadlocks. This, unfortunately, is hard, very hard. It’s definitely not a high-productivity paradigm.

One way to improve the situation is to hide the locking protocols inside message queues and switch to message passing (MP). The programmer no longer worries about data races or deadlocks. As a bonus, MP scales naturally to distributed, multi-computer, programming. Erlang is the prime example of a programming language that uses MP exclusively as its concurrency paradigm. So why isn’t everybody happy with just passing messages?

Unfortunately not all problems lend themselves easily to MP solutions–in particular, data driven parallelism is notoriously hard to express using message passing. And then there is the elephant in the room–inversion of control.

We think linearly and we write (and read) programs linearly–line by line. The more non-linear the program gets, the harder it is to design, code, and maintain it (gotos are notorious for delinearizing programs). We can pretty easily deal with some of the simpler MP protocols–like the synchronous one. You send a message and wait for the response. In fact object oriented programming is based on such a protocol–in the Smalltalk parlance you don’t “call” a method, you “send a message” to an object. Things get a little harder when you send an asynchronous message, then do some other work, and finally “force” the answer (that’s how futures work); although that’s still manageable. But if you send a message to one thread and set up a handler for the result in another, or have one big receive or select statement at the top to process heterogeneous messages from different sources, you are heading towards the land of spaghetti. If you’re not careful, your program turns into a collection of handlers that keep firing at random times. You are no longer controlling the flow of execution; it’s the flow of messages that’s controlling you. Again, programmer productivity suffers. (Some research shows that the total effort to write an MPI application is significantly higher than that required to write a shared-memory version of it.)

Back to shared memory, this time without locks, but with transactional support. STM, or Software Transactional Memory, is a relatively new paradigm that’s been successfully implemented in Haskell, where the type system makes it virtually fool-proof. So far, implementations of STM in other languages haven’t been as successful, mostly because of problems with performance, isolation, and I/O. But that might change in the future–at least that’s the hope.

What is great about STM is the ease of use and reliability. You access shared memory, either for reading of writing, within atomic blocks. All code inside an atomic block executes as if there were no other threads, so there’s no need for locking. And, unlike lock-based programming, STM scales well.

There is a classic STM example in which, within a single atomic block, money is transferred between two bank accounts that are concurrently accessed by other threads. This is very hard to do with traditional locks, since you must lock both accounts before you do the transfer. That forces you not only to expose those locks but also puts you in risk of deadlocks.

As far as programmer productivity goes, STM is a keeper.

Three HPCS Languages

Yesterday’s supercomputers are today’s desktops and tomorrow’s phones. So it makes sense to look at the leading edge in supercomputer research for hints about the future. As I mentioned in the introduction, there is a well-funded DARPA program, HPCS, to develop concurrent systems. There were three companies in stage 2 of this program, and each of them decided to develop a new language:

Cray: Chapel
IBM: X10
Sun: Fortress

(Sun didn’t get to the third stage, but Fortress is still under development.)

I looked at all three languages and was surprised at how similar they were. Three independent teams came up with very similar solutions–that can’t be a coincidence. For instance, all three adopted the shared address space abstraction. Mind you, those languages are supposed to cover a large area: from single-computer multicore programming (in some cases even on an SIMD graphics processor) to distributed programming over huge server farms. You’d think they’d be using message passing, which scales reasonably well between the two extremes. And indeed they do, but without exposing it to the programmer. Message passing is a hidden implementation detail. It’s considered too low-level to bother the programmer with.

Running on PGAS

All three HPCS languages support the shared address space abstraction through some form of PGAS, Partitioned Global Address Space. PGAS provides a unified way of addressing memory across machines in a cluster of computers. The global address space is partitioned into various locales (places in X10 and regions in Fortress) corresponding to local address spaces of processes and computers. If a thread tries to access a memory location within its own locale, the access is direct and shared between threads of the same locale. If, on the other hand, it tries to access a location in a different locale, messages are exchanged behind the scenes. Those could be inter-process messages on the same machine, or network messages between computers. Obviously, there are big differences in performance between local and remote accesses. Still, they may look the same from the programmer’s perspective. It’s just one happy address space.

By now you must be thinking: “What? I have no control over performance?” That would be a bad idea indeed. Don’t worry, the control is there, either explicit (locales are addressable) or in the form of locality awareness (affinity of code and data) or through distributing your data in data-driven parallelism.

Let’s talk about data parallelism. Suppose you have to process a big 2-D array and have 8 machines in your cluster. You would probably split the array into 8 chunks and spread them between the 8 locales. Each locale would take care of its chunk (and possibly some marginal areas of overlap with other chunks). If you’re expecting your program to also run in other cluster configurations, you’d need more intelligent partitioning logic. In Chapel, there is a whole embedded language for partitioning domains (index sets) and specifying their distribution among locales.

To make things more concrete, let’s say you want to distribute a (not so big) 4×8 matrix among currently available locales by splitting it into blocks and mapping each block to a locale. First you want to define a distribution–the prescription of how to distribute a block of data between locales. Here’s the relevant code in Chapel:

const Dist = new dmap(new Block(boundingBox=[1..4, 1..8]));

A Block distribution is created with a bounding rectangle of dimension 4×8. This block is passed as an argument to the constructor of a domain map. If, for instance, the program is run on 8 locales, the block will be mapped into 8 2×2 regions, each assigned to a different locale. Libraries are supposed to provide many different distributions–block distribution being the simplest and the most useful of them.

When you apply the above map to a domain, you get a mapped domain:

var Dom: domain(2) dmapped Dist = [1..4, 1..8];

Here the variable Dom is a 2-D domain (a set of indices) mapped using the distribution Dist that was defined above. Compare this with a regular local domain–a set of indices (i, j), where i is between 1 and 4 (inclusive) and j is between 1 and 8.

var Dom: domain(2) = [1..4, 1..8];

Domains are used, for instance, for iteration. When you iterate over an unmapped domain, all calculations are done within the current locale (possibly in parallel, depending on the type of iteration). But if you do the same over a mapped domain, the calculations will automatically migrate to different locales.

This model of programming is characterized by a very important property: separation of algorithm from implementation. You separately describe implementation details, such as the distribution of your data structure between threads and machines; but the actual calculation is coded as if it were local and performed over monolithic data structures. That’s a tremendous simplification and a clear productivity gain.

Task-Driven Parallelism

The processing of very large (potentially multi-dimensional) arrays is very useful, especially in scientific modeling and graphics. But there are also many opportunities for parallelism in control-driven programs. All three HPCS languages chose fine-grained tasks (activities in X10, implicit threads in Fortress), not threads, as their unit of parallel execution. A task may be mapped to a thread by the runtime system but, in general, this is not a strict requirement. Bunches of small tasks may be bundled for execution within a single system thread. Fortress went the furthest in making parallelism implicit–even the for loop is by default parallel.

From the programmer’s perspective, task-driven parallelism doesn’t expose threads (there is no need for a fork statement or other ways of spawning threads). You simply start a potentially parallel computation. In Fortress you use a for loop or put separate computations in a tuple (tuples are evaluated in parallel, by default). In Chapel, you use the forall statement for loops or begin to start a task. In X10 you use async to mark parallelizable code.

What that means is that you don’t have to worry about how many threads to spawn, how to manage a thread pool, or how to balance the load. The system will spawn the threads for you, depending on the number of available cores, and it will distribute tasks between them and take care of load balancing, etc. In many cases it will do better than a hand-crafted thread-based solution. And in all cases the code will be simpler and easier to maintain.

Global View vs. Fragmented View

If you were to implemented parallel computation using traditional methods, for instance MPI (Message Passing Interface), instead of allocating a single array you’d allocate multiple chunks. Instead of writing an algorithm to operate on this array you’d write an algorithm that operates on chunks, with a lot of code managing boundary cases and communication. Similarly, to parallelize a loop you’d have to partially unroll it and, again, take care of such details as the uneven tail, etc. These approaches results in fragmented view of the problem.

What HPCS languages offer is global view programming. You write your program in terms of data structures and control flows that are not chopped up into pieces according to where they will be executed. Global view approach results in clearer programs that are easier to write and maintain.

Synchronization

No Synchronization?

A lot of data-driven algorithms don’t require much synchronization. Many scientific simulations use read-only arrays for input, and make only localized writes to output arrays.

Consider for instance how you’d implement the famous Game of Life. You’d probably use a read-only array for the previous snapshot and a writable array for the currently evaluated state. Both arrays would be partitioned in the same way between locales. The main loop would go over all array elements and concurrently calculate each one’s new state based on the previous state of its nearest neighbors. Notice that while the neighbors are sometimes read concurrently by multiple threads, the output is always stored once. The only synchronization needed is a thread barrier at the end of each cycle.

The current approach to synchronization in HPCS languages is the biggest disappointment to me. Data races are still possible and, since parallelism is often implicit, harder to spot.

The biggest positive surprise was that all three endorsed transactional memory, at least syntactically, through the use of atomic statements. They didn’t solve the subtleties of isolation, so safety is not guaranteed (if you promise not to access the same data outside of atomic transactions, the transactions are isolated from each other, but that’s all).

The combination of STM and PGAS in Chapel necessitates the use of distributed STM, an area of active research (see, for instance, Software Transactional Memory for Large Scale Clusters).

In Chapel, you not only have access to atomic statements (which are still in the implementation phase) and barriers, but also to low level synchronization primitives such as sync and single variables–somewhat similar to locks and condition variables. The reasoning is that Chapel wants to provide multi-resolution concurrency support. The low level primitives let you implement concurrency in the traditional style, which might come in handy, for instance, in MPMD situations. The high level primitives enable global view programming that boosts programmer productivity.

However, no matter what synchronization mechanism are used (including STM), if the language doesn’t enforce their use, programmers end up with data races–the bane of concurrent programming. The time spent debugging racy programs may significantly cut into, or even nullify, potential productivity gains. Fortress is the only language that attempted to keep track of which data is shared (and, therefore, requires synchronization), and which is local. None of the HPCS languages tried to tie sharing and synchronization to the type system in the way it is done, for instance, in the D programming language (see also my posts about race-free multithreading).

Conclusions

Here’s the tongue-in-cheek summary of the trends which, if you believe that the HPCS effort provides a glimpse of the future, will soon be entering the mainstream:

Threads are out (demoted to latency controlling status), tasks (and semi-implicit parallelism) are in.
Message passing is out (demoted to implementation detail), shared address space is in.
Locks are out (demoted to low-level status), transactional memory is in.

I think we’ve been seeing the twilight of thread-based parallelism for some time (see my previous post on Parallel Programming with Hints. It’s just not the way to fully explore hardware concurrency. Traditionally, if you wanted to increase the performance of your program on multicore machines, you had to go into the low-level business of managing thread pools, splitting your work between processors, etc. This is now officially considered the assembly language of concurrency and has no place in high level programming languages.

Message passing’s major flaw is the inversion of control–it is a moral equivalent of gotos in un-structured programming (it’s about time somebody said that message passing is considered harmful). MP still has its applications and, used in moderation, can be quite handy; but PGAS offers a much more straightforward programming model–its essence being the separation of implementation from algorithm. The Platonic ideal would be for the language to figure out the best parallel implementation for a particular algorithm. Since this is still a dream, the next best thing is getting rid of the interleaving of the two in the same piece of code.

Software transactional memory has been around for more than a decade now and, despite some negative experiences, is still alive and kicking. It’s by far the best paradigm for synchronizing shared memory access. It’s unobtrusive, deadlock-free, and scalable. If we could only get better performance out of it, it would be ideal. The hope though is in moving some of the burden of TM to hardware.

Sadly, the ease of writing parallel programs using those new paradigms does not go hand in hand with improving program correctness. My worry is that chasing concurrency bugs will eat into productivity gains.

What are the chances that we’ll be writing programs for desktops in Chapel, X10, or Fortress? Probably slim. Good chances are though that the paradigms used in those languages will continue showing up in existing and future languages. You may have already seen task driven libraries sneaking into the mainstream (e.g., the .NET TPL). There is a PGAS extension of C called UPC (Unified Parallel C) and there are dialects of several languages like Java, C, C#, Scala, etc., that experiment with STM. Expect more in the future.

Acknowledgments

Special thanks go to Brad Chamberlain of the Cray Chapel team for his help and comments. As usual, a heated exchange with the Seattle Gang, especially Andrei Alexandrescu, lead to numerous improvements in this post.

Bibliography

DARPA’s High Productivity Computing Systems
Cray Chapel
IBM X10
Sun Fortress
Message Passing Interface specification
Unified Parallel C, PGAS in C
Microsoft Task Parallel Library
Intel Thread Building Blocks
HPC Programmer Productivity:
A Case Study of Novice HPC Programmers
Software Transactional Memory for Large Scale Clusters
Scalable Software Transactional Memory for Global Address Space Architectures
A Brief Retrospective on Transactional Memory
Designing an Effective
Hybrid Transactional Memeory System
Concurrency in the D Programming Language

May 21, 2010

The Anatomy of Message Passing

Posted by Bartosz Milewski under C++, Concurrency, Functional Programming, ML, Multithreading, Programming
[2] Comments

I recently gave a presentation at the Northwest C++ Users Group on message passing. The video and the slides are available here:

Video, Part 1
Video, Part 2
Slides with detailed comments.

Here’s my premise: I needed a flexible system of primitives for passing messages between threads in C++. After studying existing solutions in C++ (Boost, MPI) and in other languages (Haskell MVars and Caml channels and events) I created my own C++ mini-library. It’s general (no restrictions on message types or message storage), composable (allows waiting on multiple polymorphic channels), type-safe, and first-class (you may pass channels inside messages). Considering all this, the library turned out to be surprisingly simple. One thing I gave up on, and I think it was the right decision, is location transparency. Message passing between threads is fundamentally different from message passing between processes or network nodes. Performance and first-class-ness require substantially different interfaces, not to mention different implementations.

May 11, 2010

Parallel Programming with Hints

Posted by Bartosz Milewski under .NET, C++, Concurrency, Haskell, Multicore, Multithreading, Programming
[9] Comments

Wouldn’t it be nice to be able to write sequential programs and let the compiler or the runtime automatically find opportunities for parallel execution? Something like this is already being done on a micro scale inside processors. As much as possible, they try to execute individual instructions in parallel. Of course they have to figure out data dependencies and occasionally stall the pipeline or idle while waiting for a memory fetch. More sophisticated processors are even able to speculate–they guess a value that hasn’t been calculated or fetched yet and run speculative execution. When the value is finally available, they compare it with their guess and either discard or commit the execution.

If processors can do it, why can’t a language runtime do the same on a larger scale? It would solve the problem of effectively using all those cores that keep multiplying like rabbits on a chip.

The truth is, we haven’t figured out yet how to do it. Automatic parallelization is, in general, too complex. But if the programmer is willing to provide just enough hints to the compiler, the runtime might figure things out. Such programming model is called semi-implicit parallelism and has been implemented in two very different environments, in Haskell and in .NET. The two relevant papers I’m going to discuss are Runtime Support for Multicore Haskell and The Design of a Task Parallel Library.

In both cases the idea is to tell the compiler that certain calculations may be done in parallel. It doesn’t necessarily mean that the code will be executed in multiple threads–the runtime makes this decision depending on the number of cores and their availability. The important thing is that, other than providing those hints, the programmer doesn’t have to deal with threads or, at least in Haskell, with synchronization. I will start with Haskell but, if you’re not into functional programming, you may skip to .NET and the Task Parallel Library (and hopefully come back to Haskell later).

In Multicore Haskell

In Haskell, you hint at parallel execution using the par combinator, which you insert between two expressions, like this: e1 `par` e2. The runtime then creates a spark for the left hand side expression (here, e1). A spark is a deferred calculation that may be executed in parallel (in the .NET implementation a spark is called a task). Notice that, in Haskell, which is a lazy programming language, all calculations are, by design, deferred until their results are needed; at which point their evaluation is forced. The same mechanism kicks in when the result of a spark is needed–and it hasn’t been calculated in parallel yet. In such a case the spark is immediately evaluated in the current thread (thus forfeiting the chance for parallel execution). The hope is though that enough sparks will be ready before their results are needed, leading to an overall speedup.

To further control when the sparks are evaluated (whether in parallel or not), Haskell provides another combinator, pseq, which enforces sequencing. You insert it between two expressions, e1 `pseq` e2, to make sure that the left hand side, e1, is evaluated before the evaluation of e2 is started.

I’ll show you how to parallelize the standard map function (in C++ it would be called std::transform). Map applies a function, passed to it as the first argument, to each element of a list, which is passed to it as the second argument. As a Haskell refresher, let me walk you through the implementation of map.

map f []     = []
map f (x:xs) = y:ys
    where y  = f x
          ys = map f xs

Map is implemented recursively, so its definition is split into the base case and the recursive case. The base case just states that map applied to an empty list, [], returns an empty list (it ignores the first argument, f).

If, on the other hand, the list is non-empty, it can be split into its head and tail. This is done through pattern matching–the pattern being (x:xs), where x matches the head element and xs the (possibly empty) tail of the list.

In that case, map is defined to return a new list, (y:ys) whose head is y and tail is ys. The where clause defines those two: y is the result of the application of the function f to x, and ys is the result of the recursive application of map to the tail of the list, xs.

The parallel version does the same (it is semantically equivalent to the sequential version), but it gives the runtime the opportunity to perform function applications in paralle. It also waits for the evaluation to finish.

parMap f []     = []
parMap f (x:xs) = y `par` (ys `pseq` y:ys)
    where y  = f x
          ys = parMap f xs

The important changes are: y, the new head, may be evaluated in parallel with the tail (the use of the par combinator). The result, y:ys, is returned only when the tail part, ys, has been evaluated (the use of the pseq combinator).

The tail calculation is also split into parallel computations through recursive calls to parMap. The net result is that all applications of f to elements of the list are potentially done in parallel. Because of the use of pseq, all the elements (except for the very first one) are guaranteed to have been evaluated before parMap returns.

It’s instructive to walk through the execution of parMap step-by-step. For simplicity, let’s perform parMap on a two-element list, [a, b].

First we pattern-match this list to x = a and xs = [b]. We create the first spark for the evaluation of (y = f a) and then proceed with the evaluation of the right hand side of par, (ys `pseq` y:ys). Here ys = parMap f [b].

Because of the `pseq`, we must evaluate ys next. To do that, we call (parMap f [b]). Now the list [b] is split into the head, b, and the empty tail, []. We create a spark to evaluate y' = f b and proceed with the right-hand side, (ys' `pseq` y':ys').

Again, the `pseq` waits for the evaluation of ys' = parMap f []. But this one is easy: we apply the base definition of parMap, which returns an empty list.

Now we are ready to retrace our steps. The right hand side of the last `pseq` re-forms the list y':[]. But that’s the ys the previous `pseq` was waiting for. It can now proceed, producing y:(y':[]), which is the same as [y, y'] or [f a, f b], which is what we were expecting.

Notice complete absence of explicit synchronization in this code. This is due to the functional nature of Haskell. There’s no shared mutable state so no locking or atomic operations are needed. (More explicit concurrent models are also available in Haskell, using MVars or transactional memory.).

Task Parallel Library in .NET

It’s no coincidence that many ideas from Haskell end up in Microsoft languages. Many Haskell programmers work for Microsoft Research, including the ultimate guru, Simon Peyton Jones. The Microsoft Task Parallel Library (TPL) translates the ideas from Multicore Haskell to .NET. One of its authors, Daan Leijen, is a Haskell programmer who, at some point, collaborated with Simon Peyton Jones. Of course, a .NET language like C# presents a different set of obstacles to parallel programming. It operates on mutable state which needs protection from concurrent access. This protection (which, incidentally, is the hardest part of multithreaded programming) is left to the programmer.

Here’s the example of an algorithm in C# with hidden opportunities for parallel implementation. MatrixMult multiplies two matrices. It iterates over columns and rows of the result matrix. The value that goes at their intersection is calculated by the innermost loop.

void MatrixMult(int size, double[,] m1,double[,] m2, double[,] result)
{
   for(int i = 0; i < size; i++){
      // calculate the i'th column
      for(int j = 0; j < size; j++){
         result[i, j] = 0;
         for(int k = 0; k < size; k++){
              result[i, j] += m1[i, k] * m2[k, j];
         }
      }
   }
}

Each column of the result could potentially be evaluated in parallel. The problem is, the size of the array and the number of processor cores might be unknown until the program is run. Creating a large number of threads when there are only a few cores may lead to a considerable slowdown, which is the opposite of what we want. So the goal of TPL is to let the programmer express the potential for parallel execution but leave it to the runtime to create an optimal number of threads.

The programmer splits the calculation into tasks (the equivalent of Haskell sparks) by making appropriate library calls; and the runtime maps those tasks into OS threads–many-to-one, if necessary.

Here’s how the same function looks with parallelization hooks.

void ParMatrixMult(int size, double[,] m1,double[,] m2, double[,] result)
{
   Parallel.For(0, size, delegate(int i)
   {
      for(int j = 0; j < size; j++){
         result[i, j] = 0;
         for(int k = 0; k < size; k++){
              result[i, j] += m1[i, k] * m2[k, j];
         }
      }
   });
}

Because of clever formatting, this version looks very similar to the original. The outer loop is replaced by the call to Parallel.For, which is one of the parallelizing TPL functions. The inner loops are packed into a delegate.

This delegate is assigned to a task (the analog of Haskell spark) that is potentially run in a separate thread. Here the delegate is actually a closure–it captures local variables, size, m1, m2 and result. The latter is actually modified inside the delegate. This is how shared mutable state sneaks into potentially multi-threaded execution. Luckily, in this case, such sharing doesn’t cause races. Consider however what would happen if we changed the types of the matrices from double[,] to char[,]. Parallel updates to neighboring byte-sized array elements may inadvertently encroach on each other and lead to incorrect results. Programmer beware! (This is not a problem in Haskell because of the absence of mutable state.)

But even if the programmer is aware of potential sharing and protects shared variables with locks, it’s not the end of the story. Consider this example:

int sum = 0;
Parallel.For(0, 10000, delegate(int i)
{
   if(isPrime(i)){
      lock(this) { sum += i; }
   }
});

The captured variable, sum is protected by the lock, so data races are avoided. This lock, however, becomes a performance bottleneck–it is taken for every prime number in the range.

Now consider the fact that, on a 4-core machine, we’ll be running 10000 tasks distributed between about 4 threads. It would be much more efficient to accumulate the sum in four local variables–no locking necessary–and add them together only at the end of the calculation. This recipe can be expressed abstractly as a map/reduce type of algorithm (a generalization of the C++ std::accumulate). The tasks are mapped into separate threads, which work in parallel, and the results are then reduced into the final answer.

Here’s how map/reduce is expressed in TPL:

int sum = Parallel.Aggregate(
  0, 10000, // domain
  0, // initial value
  delegate(int i){ return (isPrime(i) ? i : 0) },
  delegate(int x, int y){ return x+y; }
);

The first delegate, which is run by 10000 tasks, does not modify any shared state–it just returns its result, which is internally accumulated in some hidden local variable. The second delegate–the “reduce” part of the algorithm–is called when there’s a need to combine results from two different tasks.

The Place for Functional Programming

Notice that the last example was written in very functional style. In particular you don’t see any mutable state. The delegates are pure functions. This is no coincidence: functional programming has many advantages in parallel programming.

I’ve been doing a lot of multi-threaded programming in C++ lately and I noticed how my style is gradually shifting from object-oriented towards functional. This process is accellerating as functional features keep seeping into the C++ standard. Obviously, lambdas are very useful, but so is move semantics that’s been made possible by rvalue references, especially in passing data between threads. It’s becoming more and more obvious that, in order to be a good C++ programmer, one needs to study other languages. I recommend Haskell and Scala in particular. I’ll be blogging about them in the future.

Bibliography

Simon Marlow, Simon Peyton Jones, and Satnam Singh, Runtime Support for Multicore Haskell
Daan Leijen, Wolfram Schulte, and Sebastian Burckhardt, The Design of a Task Parallel Library
A video introduction to Haskell by Simon Peyton Jones: Part I, Part II, together with the slides.

February 8, 2010

Real-Life Threads

Posted by Bartosz Milewski under C++, Concurrency, Multithreading, Programming, Type System
Leave a Comment

What’s the use of theory if you can’t apply it in practice? I’ve been blogging about concurrency, starting with the intricacies of multicore memory models, all the way through to experimental thread-safe type systems and obscure languages. In real life, however, I do most of my development in C++, which offers precious little support for multithreading. And yet I find that my coding is very strongly influenced by theoretical work.

C++ doesn’t support the type system based on ownership, which is a way to write provably race-free code; in fact it doesn’t have a notion of immutability, and very limited support for uniqueness. But that doesn’t stop me from considering ownership properties in my design and implementation. A lot of things become much clearer and less error-prone with better understanding of high-level paradigms.

I own a tiny software company, Reliable Software, which makes a distributed version control system, Code Co-op, written entirely in C++. Recently I had to increase the amount of concurrency in one of its components. I reviewed existing code, written some years ago, and realized how unnecessarily complex it was. No, it didn’t have data races or deadlocks, but it wasn’t clear how it would fare if more concurrency were added. I decided to rewrite it using the new understanding.

As it turned out, I was able to make it much more robust and maintainable. I found many examples of patterns based on ownership. I was able to better separate shared state from the non-shared, immutable from mutable, values from references. I was able to put synchronization exactly where it was needed and remove it from where it wasn’t. I’ve found reasonable trade-offs between message passing and data sharing.

It was a pretty amazing trip. I described it in my other blog, which deals more with actual programming experience. I though it might be a nice break from all the theory you normally find in this blog.

November 30, 2009

Unique Objects

Posted by Bartosz Milewski under Concurrency, Java, Multithreading, Programming, Scala, Type System
[6] Comments

I’ve blogged before about the C++ unique_ptr not being unique and how true uniqueness can be implemented in an ownership-based type system. But I’ve been just scratching the surface.

The new push toward uniqueness is largely motivated by the demands of multithreaded programming. Unique objects are alias free and, in particular, cannot be accessed from more than one thread at a time. Because of that, they never require locking. They can also be safely passed between threads without the need for deep copying. In other words, they are a perfect vehicle for safe and efficient message passing. But there’s a rub…

The problem is this: How do you create and modify unique objects that have internal pointers. A classic example is a doubly linked list. Consider this Java code:

public class Node {
    public Node _next;
    public Node _prev;
}
public class LinkedList {
    private Node _head;
    public void insert(Node n) {
        n._next = _head;
        if (_head != null)
            _head._prev = n;
        _head = n;
    }
}

Suppose that you have a unique instance of an empty LinkedList and you want to insert a new link into it without compromising its uniqueness.

The first danger is that there might be external aliases to the node you are inserting–the node is not unique, it is shared. In that case, after the node is absorbed:

_head = n;

_head would be pointing to an alias-contaminated object. The list would “catch” aliases and that would break the uniqueness property.

The remedy is to require that the inserted node be unique too, and the ownership of it be transferred from the caller to the insert method. (Notice however that, in the process of being inserted, the node loses its uniqueness, since there are potentially two aliases pointing to it from inside the list–one is _head and the other is _head._prev. Objects inside the list don’t have to be unique–they may be cross-linked.)

The second danger is that the method insert might “leak” aliases. The tricky part is when we let the external node, n, store the reference to our internal _head:

n._next = _head

We know that this is safe here because the node started unique and it will be absorbed into the list, so this alias will become an internal alias, which is okay. But how do we convince the compiler to certify this code as safe and reject code that isn’t? Type system to the rescue!

Types for Uniqueness

There have been several approaches to uniqueness using the type system. To my knowledge, the most compact and comprehensive one was presented by Haller and Odersky in the paper, Capabilities for External Uniqueness, which I will discuss in this post. The authors not only presented the theory but also implemented the prototype of the system as an extension of Scala. Since not many people are fluent in Scala, I’ll translate their examples into pseudo-Java, hopefully not missing much.

Both in Scala and Java one can use annotations to extend the type system. Uniqueness introduces three such annotations, @unique, @transient, and @exposed; and two additional keywords, expose and localize.

-Objects that are @unique

In the first approximation you may think of a @unique object as a leak-proof version of C++ unique_ptr. Such object is guaranteed to be “tracked” by at most one reference–no aliases are allowed. Also no external references are allowed to point to the object’s internals and, conversely, object internals may not reference any external objects. However, and this is a very important point, the insides of the @unique object may freely alias each other. Such a closed cross-linked mess is called a cluster.

Consider, for instance, a (non-empty) @unique linked list. It’s cluster consists of cross-linked set of nodes. It’s relatively easy for the compiler to guarantee that no external aliases are created to a @unique list–the tricky part is to allow the manipulation of list internals without breaking its uniqueness (Fig 1 shows our starting point).

Fig 1. The linked list and the node form separate clusters

Look at the definition of insert. Without additional annotations we would be able to call it with a node that is shared between several external aliases. After the node is included in the list, those aliases would be pointing to the internals of the list thus breaking its uniqueness. Because of that, the uniqueness-savvy compiler will flag a call to such un-annotated insert on a @unique list as an error. So how can we annotate insert so that it guarantees the preservation of uniqueness?

-Exposing and Localizing

Here’s the modified definition of insert:

public void insert(@unique Node n) @transient {
    expose (this) { list =>
        var node = localize (n, list);
        node._next = list._head;
        if (list._head != null)
            list._head._prev = node;
        list._head = node;
    }
}

Don’t worry, most of the added code can be inferred by the compiler, but I make it explicit here for the sake of presentation. Let’s go over some of the details.

The node, n passed to insert is declared as @unique. This guarantees that it forms its own cluster and that n is the only reference to it. Also, @unique parameters to a method are consumed by that method. The caller thus loses her reference to it (the compiler invalidates it), as demonstrated in this example:

@unique LinkedList lst = new @unique LinkedList();
@unique Node nd = new @unique Node();
lst.insert(nd);
nd._next; // error: nd has been consumed!

The method itself is annotated as @transient. It means that the this object is @unique, but not consumed by the method. In general, the @transient annotation may be applied to any parameter, not only this. You might be familiar with a different name for transient–borrowed.

Inside insert, the this parameter is explicitly exposed (actually, since the method is @transient, the compiler would expose this implicitly).

expose (this) { list => ... }

The new name for the exposed this is list.

Once a cluster is exposed, some re-linking of its constituents is allowed. The trick is not to allow any re-linking that would lead to the leakage of aliases. And here’s the trick: To guarantee no leakage, the compiler assigns the exposed object a special type–its original type tagged by a unique identifier. This identifier is created for the scope of each expose statement. All members of the exposed cluster are also tagged by the same tag. Since the compiler type-checks every assignment it automatically makes sure that both sides have the same tag.

Now we need one more ingredient–bringing the @unique node into the cluster. This is done by localizing the parameter n to the same cluster as list.

var node = localize (n, list);

The localize statement does two things. It consumes n and it returns a reference to it that is tagged by the same tag as its second parameter. From that point on, node has the same tagged type as all the exposed nodes inside the list, and all assignments type-check.

Fig 2. The list has been exposed: all references are tagged. The node has been localized (given the same tag as the list). Re-linking is now possible without violating the type system.

Note that, in my pseudo-Java, I didn’t specify the type of node returned by localize. That’s because tagged types are never explicitly mentioned in the program. They are the realm of the compiler.

Functional Decomposition

The last example was somewhat trivial in that the code that worked on exposed objects fit neatly into one method. But a viable type system cannot impose restrictions on structuring the code. The basic requirement for any programming language is to allow functional decomposition–delegating work to separate subroutines, which can be re-used in other contexts. That’s why we have to be able to define functions that operate on exposed and/or localized objects.

Here’s an example from Haller/Odersky that uses recursion within the expose statement. append is a method of a singly-linked list:

void append(@unique SinglyLinkedList other) @transient
{
    expose(this) { list =>
        if (list._next == null)
            list._next = other; // localize and consume
        else
            list._next.append(other);
    }
}

In the first branch of the if statement, a @unique parameter, other, is (implicitly) localized and consumed. In the second branch, it is recursively passed to append. Notice an important detail, the subject of append, list._next, is not @unique–it is exposed. Its type has been tagged by a unique tag. But the append method is declared as @transient. It turns out that both unique and exposed arguments may be safely accepted as transient parameters (including the implicit this parameter).

Because of this rule, it’s perfectly safe to forgo the explicit expose inside a transient method. The append method may be thus simplified to:

void append(@unique SinglyLinkedList other) @transient
{
    // 'this' is implicitly exposed
    if (_next == null)
        _next = other; // localize and consume
    else
        _next.append(other);
}

Things get a little more interesting when you try to reuse append inside another method. Consider the implementation of insert:

void insert(@unique SingleLinkedList other) @transient
{
    var locOther = localize(other, this);
    if (other != null) 
    {
        locOther.append(_next)
        _next = locOther;
   }
}

The insert method is transient–it works on unique or exposed lists. It accepts a unique list, other, which is consumed by the localize statement. The this reference is implicitly exposed with the same tag as locOther, so the last statement _next=locOther type-checks. The only thing that doesn’t type-check is the argument to append, which is supposed to be unique, but here it’s exposed instead.

This time there is no safe conversion to help us, so if we want to be able to reuse append, we have to modify its definition. First of all, we’ll mark its parameter as @exposed. An exposed parameter is tagged by the caller. In order for append to work, the this reference must also be tagged by the caller–with the same tag. Otherwise the assignment, _next=other, inside append, wouldn’t type-check. It follows that the append method must also be marked as @exposed (when there is more than one exposed parameter, they all have to be tagged with the same tag).

Here’s the new version of append:

void append(@exposed SinglyLinkedList other) @exposed
{
    if (_next == null)
        _next = other; // both exposed under the same tag
    else
        _next.append(other); // both exposed under the same tag
}

Something interesting happened to append. Since it now operates on exposed objects, it’s the caller’s responsibility to expose and localize unique object (this is exactly what we did in insert). Interestingly, append will now also operate on non-annotated types. You may, for instance, append one non-unique list to another non-unique list and it will type-check! That’s because non-annotated types are equivalent to exposed types with a null tag–they form a global cluster of their own.

This kind of polymorphism (non-annotated/annotated) means that in many cases you don’t have to define separate classes for use with unique objects. What Haller and Odersky found out is that almost all class methods in the Scala’s collection library required only the simplest @exposed annotations without changing their implementation. That’s why they proposed to use the @exposed annotation on whole classes.

Conclusion

Every time I read a paper about Scala I’m impressed. It’s a language that has very solid theoretical foundations and yet is very practical–on a par with Java, whose runtime it uses. I like Scala’s approach towards concurrency, with strong emphasis on safe and flexible message passing. Like functional languages, Scala supports immutable messages. With the addition of uniqueness, it will also support safe mutable messages. Neither kind requires synchronization (outside of that provided by the message queue), or deep copying.

There still is a gap in the Scala’s concurrency model–it’s possible to share objects between threads without any protection. It’s up to the programmer to declare shared methods as synchronized–just like in Java; but there is no overall guarantee of data-race freedom. So far, only ownership systems were able to deliver that guarantee, but I wouldn’t be surprised if Martin Odersky had something else up his sleeve for Scala.

I’d like to thank Philip Haller for reading the draft of this post and providing valuable comments. Philip told me that a new version of the prototype is in works, which will simplify the system further, both for the programmer and the implementer.

« Previous Page — Next Page »

Multithreading

It’s MapReduce

It’s Whole Program Analysis

Inverted Index

It’s a Topological Sort

Limitations of MapReduce

Appendix: Bob the Builder

STM in a Nutshell

Some Details

Memory Layout

Transaction

Committing the Transaction

Sequencing and Validation

Reasons to Abort

Optimizations

Conserving Memory

Performance

What’s Next?

Bibliography

Latency vs. Parallel Performance

Threads? Who Needs Threads?

Shared Memory or Message Passing?

Three HPCS Languages

Running on PGAS

Task-Driven Parallelism

Global View vs. Fragmented View

Synchronization

No Synchronization?

Conclusions

Acknowledgments

Bibliography

In Multicore Haskell

Task Parallel Library in .NET

The Place for Functional Programming

Bibliography

Types for Uniqueness

-Objects that are @unique

-Exposing and Localizing

Functional Decomposition

Conclusion

Top Posts

License

Blogroll

Follow Me

Archives