The design and implementation of a better ThreadLocal<T>
I talked about finding a major issue with ThreadLocal and the impact that it had on long lived and large scale production environments. I’m not sure why ThreadLocal<T> is implemented the way it does, but it seems to me that it was never meant to be used with tens of thousands of instances and thousands of threads. Even then, it seems like the GC pauses issue is something that you wouldn’t expect to see by just reading the code. So we had to do better, and this gives me a chance to do something relatively rare. To talk about a complete feature implementation in detail. I don’t usually get to do this, features are usually far too big for me to talk about in real detail.
I’m also interested in feedback on this post. I usually break them into multiple posts in a series, but I wanted to try putting it all in one location. The downside is that it may be too long / detailed for someone to read in one seating. Please let me know your thinking in the matter, it would be very helpful.
Before we get started, let’s talk about the operating environment and what we are trying to achieve:
- Running on .NET core.
- Need to support tens of thousands of instances (I don’t like it, but fixing that issue is going to take a lot longer).
- No shared state between instances.
- Cost of the ThreadLocal is related to the number of thread values it has, nothing else.
- Should automatically clean up after itself when a thread is closed.
- Should automatically clean up after itself when a ThreadLocal instance is disposed.
- Can access all the values across all threads.
- Play nicely with the GC.
That is quite a list, I have to admit. There are a lot of separate concerns that we have to take into account, but the implementation turned out to be relatively small. First, let’s show the code, and then we can discuss how it answer the requirements.
This shows the LightThreadLocal<T> class, but it is missing the CurrentThreadState, which we’ll discuss in a bit. In terms of the data model, we have a concurrent dictionary, which is indexed by a CurrentThreadState instance which is held in a thread static variable. The code also allows you to define a generator and will create a default value on first access to the thread.
The first design decision is the key for the dictionary, I thought about using Thread.CurrentThread and the thread id.Using the thread id as the key is dangerous, because thread ids may be reused. And that is a case of a nasty^nasty bug. Yes, that is a nasty bug raised to the power of nasty. I can just imagine trying to debug something like that, it would be a nightmare. As for using Thread.CurrentThread, we’ll not have reused instances, so that is fine, but we do need to keep track of additional information for our purposes, so we can’t just reuse the thread instance. Therefor, we created our own class to keep track of the state.
All instances of a LightThreadLocal are going to share the same thread static value. However, that value is going to be kept as small as possible, it’s only purpose is to allow us to index into the shared dictionary. This means that except for the shared thread static state, we have no interaction between different instances of the LightThreadLocal. That means that if we have a lot of such instances, we use a lot less space and won’t degrade performance over time.
I also implemented an explicit disposal of the values if needed, as well as a finalizer. There is some song and dance around the disposal to make sure it plays nicely with concurrent disposal from a thread (see later), but that is pretty much it.
There really isn’t much to do here, right? Except that the real magic happens in the CurrentThreadState.
Not that much magic, huh?
We keep a list of the LightThreadLocal instance that has registered a value for this thread. And we have a finalizer that will be called once the thread is killed. That will go to all the LightThreadLocal instances that used this thread and remove the values registered for this thread. Note that this may run concurrently with the LightThreadLocal.Dispose, so we have to be a bit careful (the careful bit happens in the LightThreadLocal.Dispose).
There is one thing here that deserve attention, though. The WeakReferenceToLightThreadLocal class, here it is with all its glory:
This is basically wrapper to WeakReference that allow us to get a stable hash value even if the reference has been collected. The reason we use that is that we need to reference the LightThreadLocal from the CurrentThreadState. And if we hold a strong reference, that would prevent the LightThreadLocal instance from being collected. It also means that in terms of the complexity of the object graph, we have only forward references with no cycles, cross references, etc. That should be a fairly simple object graph for the GC to walk through, which is the whole point of what I’m trying to do here.
Oh, we also need to support accessing all the values, but that is so trivial I don’t think I need to talk about it. Each LightThreadLocal has its own concurrent dictionary, and we can just access that Values property and we get the right result.
This is currently going through our testing infrastructure, but I thought it is an interesting feature. Small enough to talk about, but complex enough that there are multiple competing requirements that you have to consider and non trivial aspects to work with.
As usual, your feedback is welcome.