Thin Lock vs. Futex

After my post on thin locks, several people asked me why not use a futex (fast userspace mutex) instead. They referred to a Linux implementation described in “Fuss, Futexes and Furwocks: Fast Userlevel Locking in Linux” by Franke and Russel from IBM. The questions prompted me to do some research, and here’s what I’ve found.

Futex is not mutex

Futex doesn’t have Lock/Unlock operations. Instead it’s a primitive that may be used to implement a fast userspace mutex. The two relevant operations are Wait and Wake. Both result in kernel calls, except in one case: The Wait call takes two arguments: one is the address of a user-defined integer variable, the other is the expected value of that variable. If the values don’t match, the call returns immediately. This prevents missed wakeup calls–a common problem with condition variables.

Building a futex mutex

Buildning a mutex from those primitives is a nontrivial task. It requires writing some user code to test and manipulate a user-level integral variable representing the lock. A typical implementation (see “Futexes are Tricky” by Ulrich Drepper–and also the code at the end of this post) assigns tree states to a futex variable:

zero means that the lock is not taken,
one that it’s taken but no other thread is blocked on it,
two means that the lock is taken and in all likelihood there are threads waiting for it.

To lock the mutext, user code tries to CAS the futex variable from zero to one. If successful, the call returns immediately (we have taken the lock).

Unlock tests if the futex variable is equal to one (we are the owner, and nobody is waiting). If true, it sets it to zero. This is the fast-track common-case execution that doesn’t make any futex calls whatsoever.

All other cases end up making futex kernel calls.

Let me summarize this

The fast path is implemented by the client. Only when the client detects contention (the CAS failed), does she escalate to the kernel call. However, before she can safely escalate, she has to wait for the lock to be released, otherwise the thread that holds the lock might do the wrong thing when unlocking. It’s up to the client to make the wait busy (spinlock) or to use futex to conditionally sleep on the futex variable.

Compare this with the thin lock implementation. Here too, the fast path is implemented by the client. Only when the client detects contention, does he inflate the lock (and escalate to the kernel call). But before the lock can be inflated, he has to wait for the lock to be released. In the current implementation this wait is busy.

What’s the main difference between the two?

The thin lock implementation inflates the lock only once–the first time it detects contention. It then switches implementation to use OS locking directly. The futex implementation escalates to the kernel call every time there is contention.

The two algorithms are optimized for different scenarios. Thin lock assumes that once contention happens, it will happen over and over again. This assumption has some empirical support, as the authors of the thin lock paper assert. Converseley, under heavy contention, the futex implementation forces every thread to escalate, which involves (potentially mutlitple) conditional waits.

Obviously there are many (often hidden) factors that might influence performance under heavy contention, so I wouldn’t automatically assume that the futex is slower. Only actual measurements could provide solid arguments.

Ultimately, what weighs the heaviest against the use of a futexes in monitors is how much that would special-case the Linux implementation. Granted, Windows has its own “slim” locks, but they are different enough from futexes to preclude significant refactoring. Plus they are only available on Vista (Google for Slim Reader/Writer (SRW) Locks to find out more).

Futex code

This is my version of the code described in Drepper’s article. CAS stands for atomic compare-and-store. Here, CAS returns true on success–it has seen the expected value and it stored the new value in its place.

class FastMutex
{
public:
    FastMutex() : _word(0) {}
    void lock()
    {
        // try to atimically swap 0 -> 1
        if (CAS(&_word, 0, 1))
            return; // success
        // wasn't zero -- somebody held the lock
        do
        {
            // assume lock is still taken, try to make it 2 and wait
            if (_word == 2 || CAS(&_word, 1, 2))
            {
                // let's wait, but only if the value is still 2
                futex_wait(&_word, 2);
            }
            // try (again) assuming the lock is free
        } while (!CAS(&_word, 0, 2);
        // we are here only if transition 0 -> 2 succeeded
    }
    void unlock()
    {
        // we own the lock, so it's either 1 or 2
        if (atomic_decrement(&_word) != 1)
        {
            // it was 2
            // important: we are still holding the lock!
            _word = 0; // not any more
            // wake up one thread (no fairness assumed)
            futex_wake(&_word, 1);
        }
    }
private:
    int _word;
};

Don’t design these mutexes at home!

The trickiest part of this algorithm is that a thread that doesn’t own the lock may modify it. Look at the CAS from 1 to 2. This change may become visible to the lock owner in the middle of unlock. Suppose that the lock value was 2 to begin with. The unlocker decrements it to 1. At this point another thread may CAS it back to 2 and call futex_wait.

If the wait happens before the unlocker resets _word = 0, the wakup call is already bound to happen.
If, on the other hand, the unlocker manages to reset _word and call futex_wake, we are still fine. The wakeup call won’t be lost, because the other thread will enter futex_wait expecting 2 and finding 0 (or 1, if yet another thread grabbed the lock in the meanwhile).

And this is just one of the many possible interleavings! A full analysis would require a separate post.

llucax Says:

September 2, 2008 at 10:29 am

You said on a previous comment on your thin locks post that “For the
inflated Fat Lock on Linux we use a pthread lock, which in turn uses a
futex, so we do take advantage of its performance”.

Now, your futex summary is correct: non-contended lock operations are
always resovled purely in userspace, in the locker; contended lock
operations take a syscall.

And, when your thin lock “inflates the lock”, you end up calling a
pthread_mutex_lock() (according to
http://www.dsource.org/projects/phobos/browser/trunk/phobos/internal/monitor.d#L328),
which is implemented using a futex and has those properties.

pthread_mutex_lock() *already is* a “thin lock” because it is
implemented using a futex, and has the _same_ advantage of your thin
locks (very fast for the non-contended case).

Your thin locks, on Linux, are useless because on the non-conteded case
they’re exactly like pthread_mutex_lock() (resolved purely in the locker
without any waits), and whenever there is contention you end up calling
pthread_mutex_lock().

Using your thin locks (again, on Linux) provide 0 advantage over using
pthread_mutex_lock() directly.

Thanks,
Alberto (albertito@blitiri.com.ar)

PS: having to login to comment is a bummer :S

Bartosz Milewski Says:

September 2, 2008 at 11:43 am

What I described is the most efficient use of futex I could find in the literature. I’m sure pthread_mutex_lock is also implemented efficiently, but it can never beat the hand-coded one, if only for the reason I gave in my previous post–locality of reference.

The best a pthread lock could do is to allocate a word in user space and essentially do the same as I described. But unlike the object header word, which is what I’m using, this word is unlikely to be in the processor cache, and will require a separate trip to main memory–hundreds of cycles.

As for having to log in to comment, I’ll try to change this option and see how bad the spam situation is.

Alberto Bertogli Says:

September 8, 2008 at 9:42 am

> The best a pthread lock could do is to allocate a word in user space and essentially do the same as I described. But unlike the object header word, which is what I’m using, this word is unlikely to be in the processor cache, and will require a separate trip to main memory–hundreds of cycles.

I see, but what’s the difference with having the pthread_mutex_t directly inside the lock object (instead of a pointer to it)?

That way not only you get the localty of reference for uncontended cases (in which the futex will resolve as fast as your “thin lock”), but also for the contended ones (and actually every case _after_ the first contention, because you will end up calling pthreads).

Thanks,
Alberto

PS: thanks for the change in the comments!

September 9, 2008 at 4:46 pm

You can look up the declaration of pthread_mutex_t in phobos/std/c/linux.d. It’s a pretty large struct. You wouldn’t want to embed it in the header of every Object.

Schlüssel Düsseldorf Says:

September 19, 2011 at 1:57 am

Schlüssel Düsseldorf…

[…]Thin Lock vs. Futex « Bartosz Milewski's Programming Cafe[…]…

Bob ONeil Says:

August 31, 2012 at 8:52 am

I have created a C++ class for an appliciation, predominately coded using the Qt framework, as I need to have mutexes that support priority inheritance (using the pthread_mutexattr_setprotocol() API). Under the covers, the Qt QMutex class uses a futex to implement the mutex under Linux. Is there any API to set the futex to support priority inheritance, or is the only solution to use the pthread_mutex?

August 31, 2012 at 9:18 am

On another topic, but one that has driven the discussion on mutexes and priority inversion, in my application, which consists of various threads of differing priorities, I need to implement a trace logger, which essentially means eash thread will have the capability of writing 14 bytes to a preallocated heap buffer whose size is typically 100 MB. The writing is therefore of the multiple producers nature, but no concurrent readers. The data can be generated at a rate of up to 1 ms, and I want this logging to be as light weight as possible. What I need is essentially an atomic write of 14 bytes coupled with a bump of the write index. Under Qt, there are two classes, QAtomicInt and QAtomicPointer, that contain the typical automic constructs used in algo (CAS equivalent to TestAndSet).

I am hoping to find a lock free implementation to avoid any potential for priority inversion, deadlocks, livelocks, convoying, etc.

This is shown as the following pseudocode.

class Buffer {
Data data[BUFFER_SIZE];
QAtomicInt index;

public:
void appendData(const Data &d) {
uint idx = index.fetchAndAddRelaxed(1);
data[idx] = d;
}

Are you aware of any algo (using CAS statement or equivalent) that would allow for a lock free implementation of what is essentially an array write of 14 bytes to a pre-allocated buffer with an emphasis on speed?

B-Line Medical | Inter process locking Says:

October 12, 2012 at 7:54 am

[…] Thin lock vs futex – By Bartosz Milewski […]

Inter process locking | onoffswitch.net Says:

February 25, 2013 at 11:44 am

linux 之 futex | PHPor 的Blog Says:

July 4, 2016 at 11:37 pm

[…] https://bartoszmilewski.com/2008/09/01/thin-lock-vs-futex/ […]

Bartosz Milewski's Programming Cafe