After my post on thin locks, several people asked me why not use a futex (fast userspace mutex) instead. They referred to a Linux implementation described in “Fuss, Futexes and Furwocks: Fast Userlevel Locking in Linux” by Franke and Russel from IBM. The questions prompted me to do some research, and here’s what I’ve found.

Futex is not mutex

Futex doesn’t have Lock/Unlock operations. Instead it’s a primitive that may be used to implement a fast userspace mutex. The two relevant operations are Wait and Wake. Both result in kernel calls, except in one case: The Wait call takes two arguments: one is the address of a user-defined integer variable, the other is the expected value of that variable. If the values don’t match, the call returns immediately. This prevents missed wakeup calls–a common problem with condition variables.

Building a futex mutex

Buildning a mutex from those primitives is a nontrivial task. It requires writing some user code to test and manipulate a user-level integral variable representing the lock. A typical implementation (see “Futexes are Tricky” by Ulrich Drepper–and also the code at the end of this post) assigns tree states to a futex variable:

  • zero means that the lock is not taken,
  • one that it’s taken but no other thread is blocked on it,
  • two means that the lock is taken and in all likelihood there are threads waiting for it.

To lock the mutext, user code tries to CAS the futex variable from zero to one. If successful, the call returns immediately (we have taken the lock).

Unlock tests if the futex variable is equal to one (we are the owner, and nobody is waiting). If true, it sets it to zero. This is the fast-track common-case execution that doesn’t make any futex calls whatsoever.

All other cases end up making futex kernel calls.

Let me summarize this

The fast path is implemented by the client. Only when the client detects contention (the CAS failed), does she escalate to the kernel call. However, before she can safely escalate, she has to wait for the lock to be released, otherwise the thread that holds the lock might do the wrong thing when unlocking. It’s up to the client to make the wait busy (spinlock) or to use futex to conditionally sleep on the futex variable.

Compare this with the thin lock implementation. Here too, the fast path is implemented by the client. Only when the client detects contention, does he inflate the lock (and escalate to the kernel call). But before the lock can be inflated, he has to wait for the lock to be released. In the current implementation this wait is busy.

What’s the main difference between the two?

The thin lock implementation inflates the lock only once–the first time it detects contention. It then switches implementation to use OS locking directly. The futex implementation escalates to the kernel call every time there is contention.

The two algorithms are optimized for different scenarios. Thin lock assumes that once contention happens, it will happen over and over again. This assumption has some empirical support, as the authors of the thin lock paper assert. Converseley, under heavy contention, the futex implementation forces every thread to escalate, which involves (potentially mutlitple) conditional waits.

Obviously there are many (often hidden) factors that might influence performance under heavy contention, so I wouldn’t automatically assume that the futex is slower. Only actual measurements could provide solid arguments.

Ultimately, what weighs the heaviest against the use of a futexes in monitors is how much that would special-case the Linux implementation. Granted, Windows has its own “slim” locks, but they are different enough from futexes to preclude significant refactoring. Plus they are only available on Vista (Google for Slim Reader/Writer (SRW) Locks to find out more).

Futex code

This is my version of the code described in Drepper’s article. CAS stands for atomic compare-and-store. Here, CAS returns true on success–it has seen the expected value and it stored the new value in its place.

class FastMutex
{
public:
    FastMutex() : _word(0) {}
    void lock()
    {
        // try to atimically swap 0 -> 1
        if (CAS(&_word, 0, 1))
            return; // success
        // wasn't zero -- somebody held the lock
        do
        {
            // assume lock is still taken, try to make it 2 and wait
            if (_word == 2 || CAS(&_word, 1, 2))
            {
                // let's wait, but only if the value is still 2
                futex_wait(&_word, 2);
            }
            // try (again) assuming the lock is free
        } while (!CAS(&_word, 0, 2);
        // we are here only if transition 0 -> 2 succeeded
    }
    void unlock()
    {
        // we own the lock, so it's either 1 or 2
        if (atomic_decrement(&_word) != 1)
        {
            // it was 2
            // important: we are still holding the lock!
            _word = 0; // not any more
            // wake up one thread (no fairness assumed)
            futex_wake(&_word, 1);
        }
    }
private:
    int _word;
};

Don’t design these mutexes at home!

The trickiest part of this algorithm is that a thread that doesn’t own the lock may modify it. Look at the CAS from 1 to 2. This change may become visible to the lock owner in the middle of unlock. Suppose that the lock value was 2 to begin with. The unlocker decrements it to 1. At this point another thread may CAS it back to 2 and call futex_wait.

  • If the wait happens before the unlocker resets _word = 0, the wakup call is already bound to happen.
  • If, on the other hand, the unlocker manages to reset _word and call futex_wake, we are still fine. The wakeup call won’t be lost, because the other thread will enter futex_wait expecting 2 and finding 0 (or 1, if yet another thread grabbed the lock in the meanwhile).

And this is just one of the many possible interleavings! A full analysis would require a separate post.

Advertisements