The Anatomy of Reference Counting

August 19, 2009

The Anatomy of Reference Counting

Posted by Bartosz Milewski under C++, Concurrency, D Programming Language, Multithreading, Programming
[23] Comments

What is there to reference counting that is not obvious? In any language that supports deterministic destruction and the overloading of the copy constructor and the assignment operator it should be trivial. Or so I though until I decided to implement a simple ref-counted thread handle in D. Two problems popped up:

How does reference counting interact with garbage collection?
How to avoid data races in a multithreaded environment?

In purely garbage-collected languages, like Java, you don’t implement reference counting, period. Which is pretty bad, if you ask me. GC is great at managing memory, but not so good at managing other resources. When a program runs out of memory, it forces a collection and reclaims unused memory. When it runs out of, say, system thread handles, it doesn’t reclaim unused handles–it just dies. You can’t use GC to manage system handles. So, as far as system resources go, Java forces the programmer to use the moral equivalent of C’s malloc and free. The programmer must free the resources explicitly.

In C++ you have std::shared_ptr for all your reference-counting needs, but you don’t have garbage collection for memory–at least not yet. (There is also the Microsoft’s C++/CLI which mixes the two systems.)

D offers the best of both worlds: GC and deterministic destruction. So let’s use GC to manage memory and reference counting (or other policies, like uniqueness) to manage other limited resources.

First attempt

The key to reference counting is to have a “value” type, like the shared_ptr in C++, that can be cloned and passed between functions at will. Internally this value type must have access to a shared chunk of memory that contains the reference count. In shared_ptr this chunk is a separately allocated integral type–the counter. The important thing is that all clones of the same resource share the same counter. The counter’s value reflects how many clones there are. Copy constructors and assignment operators take care of keeping the count exact. When the count goes to zero, that is the last copy of the resource goes out of scope, the resource is automatically freed (for instance, by calling CloseHandle).

In my first attempt, I decided that the memory allocated for the counter should be garbage-collected. After all, the important thing is to release the handle–the memory will take care of itself.

struct RcHandle {
   shared Counter _counter; // GC'd shared Counter object
   HANDLE _h;
   ~this() { // destructor
      if (_counter.dec() == 0) // access the counter
         CloseHandle(_h);
   }
   // methods that keep the counter up to date
}

RcHandle is a struct, which is a value type in D. Counter is a class, which is a reference type; so _counter really hides a pointer to shared memory.

A few tests later I got a nasty surprise. My program faulted while trying to access an already deallocated counter. How did that happen? How could garbage collector deallocate my counter if I still had a reference to it?

Here’s what I did in my test: I embedded the ref-counted handle inside another garbage-collected object:

class Embedder { // GC'd class object
   RcHandle _rc;
}

When the time came to collect that object (which happened after the program ran to completion), its finalizer was called. Whenever an object contains fields that have non-trivial destructors, the compiler generates a finalizer that calls the destructors of those embedded objects–_rc in this case. The destructor of the ref-counted handle checks the reference count stored in the counter. Unfortunately the counter didn’t exist anymore. Hours of debugging later I had the answer.

What happened is that the garbage collector had two objects on its list: the embedder and the counter. It just so happened that the collector decided to collect those two objects in reverse order: the counter first, then the embedding object. So, by the time it got to the finalizer of the embedding object, the counter was gone!

What I discovered (with the help of other members of the D team who were involved in the discussion) was that there are some limitations on mixing garbage collection with deterministic destruction. There is a general rule:

An object’s destructor must not access any garbage-collected objects embedded in it.

Since the destructor of the ref-counted handle must have access to the counter, the counter must not be garbage-collectible. That means only one thing: it has to be allocated using malloc and explicitly deallocated using free. Which brings us to the second problem.

Concurrency

What can be simpler than an atomic reference count? On most processors you can atomically increment and decrement a memory location. You can even decrement and test the value in one uninterruptible operation. Problem solved! Or is it?

There is one tiny complication–the location that you are atomically modifying might disappear. I know, this is totally counter-intuitive. After all the management of the counter follows the simple rule: the last to leave the room turns off the light. If the destructor of RcHandle sees the reference count going from one to zero, it knows that no other RcHandle has access it, and it can safely free the counter. Who can argue with cold logic?

Here’s the troubling scenario: RcHandle is embedded in an object that is visible from two threads:

class Embedder {
   RcHandle _rcH;
}
shared Embedder emb;

Thread 1 tries to overwrite the handle:

RcHandle myHandle1;
emb._rcH = myHandle1;

while Thread 2 tries to copy the same handle to a local variable:

RcHandle myHandle2 = emb._rcH;

Consider the following interleaving:

T2: Load the address of the _counter embedded in _rcH.
T1: Swap emb._rcH with myHandle
T1: Decrement the counter that was embedded in _rcH. If it’s zero (and it is, in this example), free the counter.
T2: Increment the _counter. Oops! This memory location has just been freed.

The snag is that there is a window between T2 reading the pointer in (1), and incrementing the location it’s pointing to in (4). Within that window, the reference count does not match the actual number of clients having access to the pointer. If T1 happens to do its ugly deed of freeing the counter within that window, the race may turn out deadly. (This problem has been known for some time and there were various proposals to fix it, for instance using DCAS, as in this paper on Lock-Free Reference Counting.)

Should we worry? After all the C++ shared_ptr also exposes this race and nobody is crying havoc. It turns out that it all boils down to the responsibilities of the shared object (and I’m grateful to Andrei for pointing it out).

A shared object should not willy-nilly expose its implementation to clients

If the clients of Embedder want access to the handle, they should call a synchronized method. Here’s the correct, race-free implementation of the Embedder in the scenario I just described.

class Embedder {
private:
   RcHandle _rcH;
public:
   synchronized RcHandle GetHandle() const { return _rcH; }
   synchronized void SetHandle(RcHandle h) { _rcH = h; }
   ...
}

The method GetHandle copies _rcH and increments its count under the Embedder‘s lock. Another thread calling SetHandle has no chance of interleaving with that action, because it is forced to use the same lock. D2 actually enforces this kind of protection for shared objects, so my original example wouldn’t even compile.

You might be thinking right now that all this is baloney because I’m trying to fit a square peg into a round hole. I’m imposing value semantics on a non-atomic object, and you cannot atomically overwrite a non-atomic object. However, using this logic, you could convince yourself that a double cannot have value semantics (on most common architectures doubles are too large to be atomic). And yet you can safely pass doubles between threads and assign them to each other (which is the same as overwriting). It’s only when you embed a double inside a shared object, you have to protect it from concurrent access. And it’s not the double protecting itself–it’s the shared embedder that is responsible for synchronization. It’s exactly the same with RcHandle.

Conclusions

Just when you thought you knew everything about reference counting you discover something you haven’t though about. The take-home message is that mixing garbage collection with deterministic destruction is not a trivial matter, and that a race-free reference-counted object is vulnerable to races when embedded it in another shared object. Something to keep in mind when programming in D or in C++.

23 Responses to “The Anatomy of Reference Counting”

Robert Lukierski Says:

August 19, 2009 at 1:47 pm
Very interesting topic, even for a C++ guy like me. Nonetheless, I think I should read more about D, it seems like a alternative that may surpass C++ sometime in the future.

BTW, small typo, shared_ptr is still within boost, not std namespace (where it definitely should be in).

Loading...
Bartosz Milewski Says:

August 19, 2009 at 2:31 pm
I’m already living in the land of 0x, where shared_ptr is in the standard library 😉

Loading...
The Anatomy of Reference Counting « Bartosz Milewski's Programming … | wpden Says:

August 19, 2009 at 4:54 pm
[…] a strange post here: The Anatomy of Reference Counting « Bartosz Milewski's Programming … Related Posts:9 most selected Free Resources For Textures and Patterns30 PHP Best Practices for […]

Loading...
Rodrigo Kumpera Says:

August 19, 2009 at 5:02 pm
The issue with D looks pretty much to be lack of proper topologically sorted finalization.

Garbage collectors that want to support usable finalization must do it in topological ordering, otherwise things do go wrong. This is how both the JVM and the CLR do.

Loading...
Arseny Kapoulkine Says:

August 19, 2009 at 10:25 pm
I’m quite far from GC environments, but is it a problem to launch GC whenever system handle allocation fails and try again after GC completes? Assuming there is a finalizer for system handle wrapping object that releases the handle to the system, of course.

Loading...
Dmitriy V'jukov Says:

August 19, 2009 at 11:22 pm
std::shared_ptr directly supports strongly thread-safe reference counting. One thread may execute atomic_load(sh_ptr), while another – atomic_exchange(sh_ptr)/atomic_compare_exchange(sh_ptr).
See for details:
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2008/n2632.html

Loading...
Motti Says:

August 19, 2009 at 11:35 pm
@Arseny asked “..is it a problem to launch GC whenever system handle allocation fails and try again after GC completes?”

If you add this rule explicitly for system handles it will work for system handles but not for DB connections, network sockets or any other number of unmanaged resources. Keep in mind that the list of unmanaged resources is unbounded since any programmer can add a new type of resource which is unknown to the language.

Loading...
Austin Bingham Says:

August 20, 2009 at 1:25 am
I think I’m just reiterating what Rodrigo said, but it seems almost nonsensical to me that _counter would go away before _rc is finalized. How is D determining when to deallocate things?

Loading...
Arseny Kapoulkine Says:

August 20, 2009 at 8:43 am
@Motti:
Well, uh, you’ll need two language features:
a. ability to define your own finalizers
b. ability to launch gc from the language

both should not be a problem; afterwards you do something like

function thread.new()
local tid = w32.CreateThread(args);
if tid == w32.InvalidHandle then
gc.collect();
tid = w32.CreateThread(args);
if tid == w32.InvalidHandle then
return nil;
end
end
local result = {__id = tid, etc.};
setmetatable(result, {__gc =
function (self) begin
w32.CloseHandle(self.__id);
end});
end

This is supposed to be pseudo-Lua, I hope it makes sense. It’s so simple that I’m wondering if I miss something here…

Loading...
The Anatomy of Reference Counting « Bartosz Milewski's Programming … | HiTechBooks Says:

August 20, 2009 at 9:29 am
[…] See the article here: The Anatomy of Reference Counting « Bartosz Milewski's Programming … […]

Loading...
Bartosz Milewski Says:

August 20, 2009 at 11:20 am
I asked Walter why he didn’t implement topological sorting for finalizers–it was performance reasons. He is willing to revisit it though in face of the problems I described.

Loading...
Bruno Martínez Says:

August 20, 2009 at 6:23 pm
Java and C# _don’t_ finalize objects in topological order. CPython does.

Loading...
Robert Jacques Says:

August 20, 2009 at 10:56 pm
Finalizing objects in topological order is impossible due to cycles.
Also, CPython uses a reference counting GC, and does perform non-topological finalization when cycles are detected (AFAIK).

Loading...
Robert Virding Says:

August 21, 2009 at 4:56 am
I am a little confused here. You separate reference counting and GC while traditionally reference counting is a way of doing GC. One of the earliest ways in fact.

Loading...
Bruno Martinez Says:

August 21, 2009 at 5:13 am
CPython breaks cycles automatically, discovering more objects to finalize. The only unsolvable case is finalizable objects in a cycle.

Loading...
Bartosz Milewski Says:

August 21, 2009 at 10:44 am
Robert, Sure, reference counting could be used to do GC. The resource that is managed by GC is memory and it’s managed in a non-deterministic way–a collecting cycle happens randomly.

I am talking about managing other resources, like system handles, in a deterministic way, i.e., as soon a the ref-count goes to zero, the handle must be closed. You don’t want to wait for the next GC cycle to free your handles.

Loading...
Nick B Says:

August 23, 2009 at 1:33 am
> The programmer must free the resources explicitly.

Your blog article example, shows the difficulty of good resource management.

In fact, I have tried to raise this as a feature request to Walter, without much success. See : http://d.puremagic.com/issues/show_bug.cgi?id=2757

These other resources, currently, need manage management: OpenGL objects (textures/shader programs/display lists),SDL surfaces, Hardware sound buffers, Mutex locks, File handles, Any object with a non-trivial destructor, Any object that contains or manages one of the above.

I would be interested in your thoughts as to if “D” should support automatic resource management for all resources. Perhaps in D3 ?

Loading...
Robert Jacques Says:

August 23, 2009 at 9:40 am
Re: Nick B
I know it’s not a general solution, but I’ve been managing GPU resources quite successfully using D’s GC. I’ve achieved this by adding a GC.collect + retry clause to my error code handler. (Though my application doesn’t have a hard real time constraint)

Loading...
Leandro Lucarella Says:

August 23, 2009 at 4:53 pm
AFAIK CPython just collect the memory when it finds a cycle, but doesn’t call any finalizers. I wonder how Java handles the cycles when doing finalization (*if* it supports finalization, I know the first versions didn’t).

Loading...
Bruno Martínez Says:

August 23, 2009 at 7:09 pm
CPython doesn’t call finalizers on objects that participate on a cycle, but it does on objects that just hang to another that is in the cycle.

Java and .Net call finalizers in unspecified order, even in parallel. However, even if already finalized themselves, objects reachable from finalizable objects can be manipulated normally. That is, the memory the object is not reclaimed when the object is finalized.

Loading...
Doug Says:

August 28, 2009 at 5:25 am
I came across this discussion of reference counting/garbage collection issues from the Microsoft CLR team’s perspective a while ago, and thought it might be of interest. They looked at the tensions between GC and reference counting in the same language quite a bit, because VB was traditionally reference counted, but they wanted cycle collection, and so there was a fair bit of work done to try integrate both mechanisms well. However, it caused all sorts of problems like those you’ve found, such as reference counted types not working well as members of garbage collected types or vice versa.

A copy of the discussion is here:
http://blogs.msdn.com/brada/articles/371015.aspx

Loading...
Bartosz Milewski Says:

August 28, 2009 at 11:46 am
Doug, Thank you for an excellent link. It’s a very comprehensive discussion. Brad’s post seems to confirm that composition of GC objects with RC objects is one of the major problems. It would have to be solved if RC became part of the language. However, a library solution, as in D, might simply punt it. The programmer should be aware that if he or she embeds a refcounted object in a GC object, the GC object won’t miraculously acquire deterministic finalization. A refcounted array (template) could be defined in the library.

It helps that D distinguishes between classes, which all derive from Object, and structs, which don’t. Objects have reference semantics and are GC’d, structs have value semantics and are scoped, which lends them to RAII-type management. That bypasses a lot of problems mentioned in Brad’s discussion.

Loading...
forum Says:

May 9, 2012 at 2:14 pm
forum…

[…]The Anatomy of Reference Counting « Bartosz Milewski's Programming Cafe[…]…

Loading...

Bartosz Milewski's Programming Cafe