Software Transactional Memory



I’ve been dealing with concurrency for many years in the context of C++ and D (see my presentation about Software Transactional Memory in D). I worked for a startup, Corensic, that made an ingenious tool called Jinx for detecting data races using a lightweight hypervisor. I recorded a series of screencasts teaching concurrency to C++ programmers. If you follow these screencasts you might realize that I was strongly favoring functional approach to concurrency, promoting immutability and pure functions. I even showed how non-functional looking code leads to data races that could be detected by (now defunct) Jinx. Essentially I was teaching C++ programmers how to imitate Haskell.

Except that C++ could only support a small subset of Haskell functionality and provide no guarantees against even the most common concurrency errors.

It’s unfortunate that most programmers haven’t seen what Haskell can do with concurrency. There is a natural barrier to learning a new language, especially one that has the reputation of being mind boggling. A lot of people are turned off simply by unfamiliar syntax. They can’t get over the fact that in Haskell a function call uses no parentheses or commas. What in C++ looks like

f(x, y)

in Haskell is written as:

f x y

Other people dread the word “monad,” which in Haskell is essentially a tool for embedding imperative code inside functional code.

Why is Haskell syntax the way it is? Couldn’t it have been more C- or Java- like? Well, look no farther than Scala. Because of Java-like syntax the most basic functional trick, currying a function, requires a Scala programmer to anticipate this possibility in the function definition (using special syntax: multiple pairs of parentheses). If you have no access to the source code for a function, you can’t curry it (at least not without even more syntactic gymnastics).

If you managed to wade through this post so far, you are probably not faint of heart and you will accept the challenge of reading an excellent article by Simon Peyton Jones, Beautiful Concurrency that does a great job of explaining to the uninitiated Haskell’s beautiful approach to concurrency. It’s not a new article, but it has been adapted to the new format of the School of Haskell by Yours Truly, which means you’ll be able to run code examples embedded in it. If you feel adventurous, you might even edit them and see how the results change.


It was my first experience working with the C++ Standardization Committee in a subgroup dedicated to concurrency and parallelism. I won’t bore you with details — they will be available at the committee web site. I’ll share my overall impressions and then focus on specific areas where I have strong opinions.

Being an outsider I considered the C++ Standard the ultimate word. If I had problems interpreting the letter of the Standard I would ask one of the committee members for interpretation and assume that I would get the same answer from any of them. Reality turned out to be more complex than that. C++ Standard is full of controversial topics. Some of those controversies could not be resolved in time, so often the wording of the Standard is intentionally vague. Some features were not ready for inclusion so little stubs were inserted into the document that sometimes don’t make much sense in isolation.

One such example is the intentional vagueness and the lack of definition of thread of execution. Not only is a thread undefined, some of the semantics are expressed using the “as if” language. In particular the thingie started by std::async is supposed to behave “as if” it were run in a separate thread of execution (whatever that means). At some point I had a long email exchange about it with Anthony Williams and Hans Boehm that resulted in a blog post. I thought the things were settled until I was alerted to the fact that Microsoft’s interpretation of the Standard was slightly different, and their “as if” didn’t include thread_local variables, at least not in the beta of the new Visual C++.

Here’s the problem: std::async was introduced in the Standard as a compromise between the idea that it’s just syntactic sugar over std::thread creation, and the idea that it’s an opening for task-based parallelism. In fact when I first tried std::async using Anthony Williams’ Just Thread library I expected it to run on a thread pool complete with work stealing and thread reuse. Not so, argued Anthony and Hans, pointing among others things to the problem of managing thread-local variables — are they supposed to be local with respect to the underlying OS thread, or to a smaller units of execution, the tasks?. If multiple tasks are reusing the same thread should they see fresh versions of thread_local variables? When should thread-local variables be destroyed if the lifetime of pool threads is theoretically infinite?

Now, Microsoft has its implementation of task-based concurrency in the form of PPL (Parallel Pattern Library). Intel has TBB (Threading Building Blocks), which is a superset of PPL and it also runs on Linux. I can understand the eagerness of those companies to bend the (intentionally vague) rules and make these libraries accessible through std::async, especially if they can dramatically improve performance.

I’d be the first to vote for this proposal, except for a few unsolved problems.

First of all, Microsoft wanted to change the semantics of std::async when called with launch_policy::async. I think this was pretty much ruled out in the ensuing discussion. Pure async case should be indistinguishable from direct creation of std::thread. Any attempt at using a thread pool behind the scenes could result in deadlocks. Essentially, the programmer must have a guarantee that all the tasks will be allowed to run in parallel no matter how many there are. Just imagine a bunch of tasks trying to communicate with each other back and forth. If thread creation is throttled down after N of them start and possibly block waiting for responses from the rest of them, they might block forever. Thread pools usually have the ability to create new threads on demand, but it’s never obvious when a new thread must be created. Even if the pool could detect all the threads that are blocked, it couldn’t detect those that are busy-spinning. This is why std::async with launch_policy::async must always create, or at least immediately steal, a thread.

The situation is different with std::async called with the default launch policy (the bitwise OR of launch_policy::async and launch_policy::deferred). In that case the runtime does not guarantee that all tasks will be able to run in parallel. In fact the programmer must be prepared for the possibility that all tasks run serially in the context of the parent thread (more specifically, in the context of the thread that calls future::get). Here the problem with using a thread pool is different. It has to do with the lifetimes of thread_local variables that I mentioned before. This is a serious problem and the semantics defined by the current Standard are far from natural. As it stands, a task created using the default launch policy must either run on a completely new thread, in which case that thread defines the lifetimes of thread_local variables; or it must be deferred, in which case it shares thread_local variables with its parent (again, strictly speaking, with the caller of future::get — if the future is passed to a different thread). This behavior might seem confusing, but at least it’s well defined.

Here’s how Herb Sutter proposed to solve the problem of making tasks run in a thread pool: Disallow non-POD thread_locals altogether. The argument was that nobody has implemented non-POD thread locals anyway, so nobody will suffer. Anthony Williams’ and Boost implementations were dismissed as library-based.

This seems to me like a violation of the spirit of C++, but there is a precedent for it: atomic variables. You can declare a POD (Plain Old Data, including simple structs) as atomic and, if it fits inside a hardware supported atomic word, it will become a lock-free atomic; otherwise a lock will be provided free of charge (well, you’ll pay for it with performance, but that’s a different story). But you can’t define a non-POD as atomic!

A quick straw poll showed that the subcommittee was equally split between those who were willing to discuss this change and those who weren’t. It seems though that Microsoft will go ahead with its PPL implementation ignoring the problems with thread_local (and also with DLL_THREAD_DETACH handlers I mentioned in my blog). So you might want to restrict the use of non-POD thread-local variables for the time being.

This discussion had a larger context: The proposal to introduce thread pools into the language/library as first class objects. Google’s Jeffrey Yaskin described their Executor library, which combines thread pools with work-stealing queues and schedulers. PPL has a similar construct called task group. In this new context, std::async would only provide an interface to a global default thread-pool/executor/task-group. The introduction of first-class thread pools would take away the pressure to modify the semantics of std::async. If you cared about the way your tasks are scheduled, you could spawn them using a dedicated thread-pool object. Having an explicit object representing a set of tasks would also allow collective operations such as wait-for-all or cancel.

Which brings me to another topic: composable futures. I wrote a blog post some time ago, Broken Promises: C++0x Futures, in which I lamented the lack of composability of futures. I followed it with another blog, Futures Done Right, proposing a solution. So I was very happy to learn about a new proposal to fix C++ futures. The proposal came from an unexpected source — C#.

The newest addition to C# is support for asynchronous interfaces (somewhat similar to Boost::ASIO). This is a hot topic at Microsoft because the new Windows 8 runtime is based on asynchronous API — any call that might take more than 50ms is implemented as an asynchronous API. Of course you can program to asynchronous API by writing completion handlers, but it’s a very tedious and error-prone method. Microsoft’s Mads Torgersen described how C# offers several layers of support for asynchronous programming.

But what caught my interest was how C# deals with composition of futures (they call them task objects). They have the analog of an aggregate join called WhenAll and an equivalent of “select” called WhenAny. However these combinators do not block; instead they return new futures. There is another important combinator, ContinueWith. You give it a function (usually a lambda) that will be called when the task completes. And again, ContinueWith doesn’t block — it returns another future, which may be composed with other futures, and so on. This is exactly what makes C# futures composable and, hopefully, C++ will adopt a similar approach.

Of course there is much more to the async proposal, and I wish I had more time to talk about it; but the composable integration of asynchronicity with task-based concurrency is in my eyes a perfect example of thoughtful design.

I noticed that there seems to be a problem with C++’s aversion to generalizations (I might be slightly biased having studied Haskell with its love for generalizations). Problems are often treated in separation, and specific solutions are provided for each, sometimes without a serious attempt at generalization. Case in point: cancellation of tasks. A very specialized solution involving cancellation tokens was proposed. You get opaque tokens from a factory, you pass them to tasks (either explicitly or by lambda capture), and the tasks are responsible for polling the tokens and performing appropriate cancellation actions. But this is an example of an asynchronous Boolean channel. Instead of defining channels, C++ is considering a special-purpose one-shot solution (unless there is a volunteer willing who will write a channels proposal). By the way, futures can be also viewed as channels, so this generalization might go a long way.

Another candidate for generalization was the Intel vectorization proposal presented by Robert Geva. Of course it would be great to support the use of vector processors in C++. But you have to see it in the larger context of data-driven parallelism. It doesn’t make sense to have separate solutions for vector processors, multicores running in SIMD mode, and GPGPUs. What’s needed is general support for data parallelism that allows multiple hardware-specific specializations. Hopefully a more general proposal will materialize.

The C++ Standards Committee is doing a great job, considering all the limitations it’s facing. The committee will not add anything to the language unless there are volunteers who will write proposals and demonstrate working implementations. Remember, you too can contribute to the future of C++.


Traditional lock-based approach to concurrent programming is error prone, hard to debug, and almost impossible to scale.

In my previous post I gave a preview of how concurrent programming might evolve in the future, and the topic of Software Transactional Memory (STM) came up quite naturally. STM has the potential to replace locking with an easy to use, virtually fool-proof, scalable paradigm for concurrent access.

It’s true that STM has its problems; performance and integration with existing languages being the serious ones. The hope is that performance can be improved, especially if there’s some support in hardware, and that a new batch of concurrent languages will incorporate STM more smoothly.

In this post I won’t be elaborating on the superiority of STM over other paradigms but rather concentrate on how it works. I hope my explanation will be more approachable than academic papers.

STM in a Nutshell

Let’s analyze a very simple example, a singleton pattern in C++ (I chose C++ because it exposes the use of pointers):

Foo * getFoo() {
    static Foo * pFoo = 0;
    if (pFoo == 0) // <- read
        pFoo = new Foo(); // <- write
    return pFoo;
}

As you might know, this code doesn’t work correctly in a multithreaded environment. Let’s have a look at why, but from a slightly unusual angle. The statement:

if (pFoo == 0)

executes a memory read of the variable pFoo. The value that’s been read is usually stored in a register–in other words, is cached–at least until it’s compared with zero. Suppose that we read zero (a null pointer). We then proceed with the allocation and construction of the Foo object. We update the variable pFoo (a memory write) and return it to the caller.

The important thing is that the write to pFoo is only correct if the condition pFoo == 0 is maintained up to the moment when the new value is stored (we don’t even think of such subtleties in sequential programming).

But that condition may be broken by another thread storing its own creation in pFoo. It’s a classic data race, and its consequences may vary from a simple memory leak to a deadly memory fault (that might happen on some processors if the object Foo manages its own internal pointers). In general, it’s the gaps between logically related reads and writes that leave room for other threads to break our assumptions. If we could only concentrate those reads and writes into one atomic action, we’d be safe.

Let me propose a crazy solution: Before performing the write to pFoo, why don’t we re-check, or validate, our original read (assume that it’s still in the register). If the value in memory is still equal to the value in the register, we proceed with the write; otherwise we back off and try again.

Wait a moment! Aren’t we just postponing the problem? What about another thread messing with pFoo right between the read validation and the write? True, that is a problem, and we have to figure out a way to make the sequence read-validation/final-write atomic. But the beauty of this crazy approach is that it can be made universal. The code that does generic read verification may be compiled into the language runtime. It may be implemented once and for all, thoroughly tested, and used whenever concurrent access is needed.

All the programmer has to do is to inform the compiler that a logically related group of reads and writes must be done atomically; for instance, by enclosing the relevant code in the atomic block (not actual C++!):

Foo * getFoo() {
   static Foo * pFoo = 0;
   atomic {
      if (pFoo != 0) // read
           pFoo = new Foo(); // write(s), possibly reads
   }
   return pFoo;
}

Here’s what happens inside the atomic block (there are many variations of the algorithm but the idea is the same).

  • First, a transaction is started–the runtime allocates a data structure containing two logs.
  • Every memory read inside the atomic block is logged in a log called the read set.
  • Every memory write, instead of going directly to memory, is written into the write set (there are also versions of STM that let the writes go through, to be undone if the transaction aborts).
  • At the end of the atomic block the system attempts to commit the transaction. It does that by first verifying the read log and then performing the writes. (There is a very fine-grained system of locking in play, which I’ll describe shortly.)
  • If the read verification discovers a mismatch, the transaction is aborted and repeated from scratch, until it succeeds.

Some Details

There is one particular implementation of STM that’s been most widely accepted: Transactional Locking II (TL2), which I will summarize here.

Memory Layout

Imagine that for every single word of main memory you also have a separate data structure that stores the version number and a write lock for that word. How this can be done efficiently is an important implementation detail that I will explain later. For each memory word, its version number is initially set to zero and its lock is open.

Transaction

Now let’s start a transaction. We allocate a thread-local transaction object that has a read set and a write set, initially empty. We assign a version number to the transaction by (atomically) sampling the global “version clock” (it’s called a clock because it can only move forward). This will be our transaction’s read version number.

The compiler has conveniently instrumented all reads and writes inside the scope of the transaction (that is, in the atomic block). Each read adds an entry to our read set, storing the address of the word being read. Each write adds an entry to our write set, storing the address and the value to be written. The value in main memory is not changed at this point–the write is virtual.

Actually, before every read, we have to access our write set in case we have already written to that location. Moreover, we need to check that the location is not locked by another committing transaction, and that its version is less or equal to our transaction’s read version. This is why we need all those per-location locks and version numbers. If the location is locked or if its version number is larger than our current version, we abort the transaction and repeat the whole process again.

These checks are costly, and they are optimized to just a few instructions in the common case. So, for instance, the write-set check is implemented using a Bloom filter. The lock and version number are sampled in one load, the lock bit is checked, the address stored in the read set, and then the version number and lock re-checked once more. On some processors, read barriers are required when accessing the locks.

In Fig 1 you see the transaction corresponding to the singleton example right before committing. We have read the value of pFoo and logged the read (the address of pFoo) in the read set. We have checked that the version of pFoo, 6, was less than the version of our transaction, 8. We have also allocated a new Foo in memory. During the construction of Foo two fields were (virtually) written and they have been saved in the write set together with the new value of pFoo (the pointer to the new Foo).

STM before commit

Fig 1. An STM transaction before commit. Every location in main memory has a corresponding version/lock entry. Both the read set and the write set have an entry for pFoo. The write set also stores the new value of pFoo, which is a pointer to the newly allocated Foo.


At commit time, we lock all our write locations (there are three of those) using the per-word locks. Since we might be racing with other transactions, we should be careful not to cause deadlocks. The simplest precaution is to use bounded spinlocks. If we fail to acquire any lock within a set timeout, we abort the transaction.

Committing the Transaction

Sequencing and Validation

During read set validation, we check the lock and the version number of each location that’s been read by our transaction. The problem is that this operation is not atomic (we don’t lock read locations). While we are validating, other transactions might modify the already validated locations. I’ll show you that it really doesn’t matter in the big scheme of things.

In the big scheme of things, there is a sequence of transactions, each making an atomic update. So our program’s memory switches from one consistent state to another. When, precisely, this jump occurs is called the “sequence point.” As I mentioned, the sequence point for each transaction is reached after the write set is locked but before the read set is validated.

Now imagine that we have two read locations, r1 and r2. We have validated r1 and, while we’re validating r2, a stray transaction commits a write to r1. What can be said about the sequence points of the two transactions? Our sequence point was set before we stated read validation. Since we have successfully validated r1, it hadn’t been locked by the stray transaction. It means that the other transaction’s sequence point must have been set after ours; and we don’t care about “future” transactions. All we are concerned about at this point is whether an earlier transaction, with an earlier sequence point (which is, however, greater than our read version), hasn’t overwritten those locations.

At this point we atomically increment and fetch the global version clock–this will be our transaction’s write version, not to be confused with the earlier read version. We haven’t committed the transaction yet but, if we commit it, this will be its official commit point, a.k.a. sequence point.

Now we can validate the read set at our leisure. We make sure that the read locations are not locked by another thread, and that their version numbers are still less or equal to our transaction’s read version. This is a slightly tricky point–see the Sidebar.

In our example, the locks for pFoo and for the two fields inside the newly allocated Foo will be taken, the clock incremented to 9, and the read set validated.

The read set is small–just the location of the variable pFoo. It is not locked by another thread (although it’s locked by us, because it’s also present in our write set), and the version number is still less than 8.

The final step is to commit the write set. Each location is updated and then unlocked. We also set the version number for each modified location to the write version of our transaction (the value of the version clock at our sequence point). In our case, we change the version number of pFoo from 6 to 9.

You might be wondering why we verify each read at least twice–once while executing the instrumented client code, and once during validation. After all, if we read the wrong version of the variable, the transaction will fail anyway during verification. The problem is that, after reading a bunch of inconsistent values, the client code might get into all kinds of trouble, including infinite loops or access violations. This used to be a major concern before TL2.

Reasons to Abort

Let’s consider what can go wrong in our example that would force us to abort (and retry).

  1. During the first read of pFoo:
    • We find it locked. That could happen if another transaction is just committing its write to pFoo.
    • It’s version number is greater than our transaction’s read version number. That could happen if another transaction had just committed its own write.
  2. During pre-commit locking, we find pFoo locked by another transaction. (We let that transaction proceed.)
  3. During read validation of pFoo:
    • The location is locked by another transaction. (This can’t happen in our case because we have locked it ourselves.)
    • Its version number is greater than our read version number. This could happen if another transaction committed its write to pFoo after we have read it last.

You can easily convince yourself that in each case it really makes sense to abort the transaction. Moreover, when we retry the transaction, pFoo will have already been initialized by another thread. In that case the transaction will just breeze through, almost as fast as the lock-based implementation.

Optimizations

Note that during our transaction other non-conflicting transactions could have committed successfully. Each such transaction increments the version lock. This is why our write version could be arbitrarily larger than our read version. On the other hand, if the write version is just one unit larger that the read version, we know that we were alone, and we can skip the read validation altogether. In fact this is what would have happened in our example where the read version was 8 and the write version was 9.

Another great optimization is read-only transactions. Transactions that don’t perform any writes don’t have to log the reads. It’s enough that they perform the lock/version check at the point of each read. If all reads are successful, the transaction is done. It has seen a consistent state of memory.

It’s worth noting that atomic operations in TL2 are non-blocking. They are usually implemented using the CAS instruction (Compare and Swap). The locking of individual memory words is done using bounded spinlocks.

Conserving Memory

Let’s see how we can implement all the locks and version numbers without doubling or tripling our computer’s memory. To begin with, we can combine the lock with the version number in one word–a versioned lock. We only need one bit for a spinlock and we can stick it at the bottom of the version number, as long as we only use even version numbers.

The second trick is to use a reasonably sized table of versioned locks and hash all memory locations into that table. Since there will be many locations that map into the same lock, spurious aborts may happen. Notice however that spurious aborts have no impact on the semantics of the program. Aborted transactions will be retried until they commit. In practice such spurious conflicts happen reasonably rarely.

Performance

With the usual caveat that there are lies, damned lies, and benchmarks; the performance of STM clocks at about half the performance of hand-crafted locks (this is more or less the consensus number–see the interview with Peyton-Jones and Harris for confirmation). Which is not bad, considering all the instrumentation that goes into it. But if you have a program that spends 90% of its time concurrently inserting and removing items from a red-black tree, you might consider hiring a team of highly experienced programmers and testers to create a faster, lock-based, or even lock-free, solution. But take into account an important fact: STM is very fine grained. For instance, when you’re inserting an item into a tree, the STM transaction will only lock the nodes that you are actually modifying. STM will easily beat a solution that uses one global lock per whole tree. Per-node manual locking is hard to implement correctly because of the risk of deadlocks.

By its optimistic nature, STM performs best under low contention scenarios where the vast majority of transactions commit on the first try.

What’s Next?

Will performance problems interfere with a wider acceptance of STM? There is an interesting analogy between garbage collection (GC) and STM put forward by Dan Grossman. Historically, performance concerns were a major obstacle in the acceptance of GC until a general-purpose language, Java, adopted it as the memory management tool. If the analogy holds, the same will happen to STM and shared-memory concurrency.

STM has a permanent spot, and probably the best implementation, in Haskell (doesn’t it remind you of the relationship between Lisp and GC?). There’s been some STM activity in C++, Java, and other languages (see Bibliography) but, without language support, STM doesn’t offer enough bang for the buck in terms of safety and ease of use. That’s why I’m really excited about the use of STM in a batch of new high-performance languages, in particular in Chapel. Will Chapel become to STM what Java became to GC? Only time will tell.

I’ve been following the progress of the Chapel’s implementation of STM and talked to some of the developers. In distributed environments, STM can perform outstandingly. That’s because the overhead of STM read/write logging (which is done locally) is dwarfed by the usual distributed communication costs. The trick is to piggyback STM communications on top of remote reads and writes, which have to happen anyway. Chapel also has a chance to implement a type system that makes STM safer to use. I hope to discuss those options in a future post.

Bibliography

  1. Herlihy and Moss, Transactional Memory: Architectural Support for Lock-Free Data Structures
  2. Dice, Shalev, Shavit, Transactional Locking II
  3. Interview with Simon Peyton-Jones and Tim Harris about STM
  4. Harris, Fraser, Language Support for Lightweight Transactions–STM in Java
  5. TinySTM, a public domain implementation of STM for C and C++
  6. Deuce, a Java implementation of STM
  7. Bronson, Chafi, Olukotun, CCSTM: A Library-Based STM for Scala
  8. Harris, Marlow, Peyton Jones, Herlihy, Composable Memory Transactions–STM in Haskell.
  9. Sridharan, Vetter, Kogge, Scalable Software Transactional Memory for Global Address Space Architectures–STM in Chapel