This tale starts with a customer opening a support ticket. His application is struggling to keep up with the load during peak hours. Those sorts of tickets are taken very seriously, so we sat down to understand exactly what is going on. The scenario was simple, under load, the customer started getting errors similar to this:

Raven.Client.Exceptions.RavenException: System.TimeoutException: The operation has timed out.
   at Raven.Server.Rachis.RachisConsensus.ThrowTimeoutException()
   at Raven.Server.Rachis.RachisConsensus.WaitForCommitChangeOrThrowTimeoutException(Task timeoutTask, Task task)

That isn’t a normal timeout, which usually indicates some slow I/O or network problems. This timeout in particular came from the cluster-wide transaction code. The scenario was now clearer, under heavy concurrent load, RavenDB will start timing out for cluster-wide transactions.

A cluster-wide transaction is more expensive than a single-node transaction, but the numbers that we were seeing in this scenario were a lot lower than what we would normally expect. We dug deeper, trying to figure out what is actually going on. There were a number of documents that were involved in those transactions, and when we looked into those, we figured out at least part of the problem.

Those were pretty big documents. A large number of those cluster-wide transactions involved documents whose size exceed the MB range. In general, RavenDB works with such documents with no issue, but there are costs associated with large documents. Especially when you mix in cluster-wide transactions. Regardless the cluster still should be stable, and we didn’t think that the load the customer was facing represented the peak efficiency we could get from RavenDB.

We reproduced the client’s scenario on our local setup and simulated peak usage (admittedly, we went a bit high with the load factor here). The good news is that we were able to reproduce the same bad behavior. The first thing we noticed, is those timeout exceptions. At some point the throughput of cluster-wide transactions dropped to zero and we got only the exceptions.

What happened? We saturated the Disk/CPU and created a convoy of requests; we dispatched more requests than we can process and at some point, all requests waited for more than 15 seconds and ended up timing out. This is a fairly classic issue, I have to say. We exacerbated it with the auto retries, that added even more load to the system.

The problem in our scenario is that this represents a significant change in behavior from single-node transactions. For a single-node transaction, we’ll only timeout if the caller indicated that they have a time limit (by passing CancellationToken). For cluster-wide transactions, the code was using a default timeout mechanism. That was configurable, but we don’t want to need to call the database admin to change a value when our code send a big operation to the database.

So as a first step, we unified the timeout behavior for the single-node and the cluster-wide transactions. There is still a convoy, but since we will not timeout anymore, it will balance the ability to process the requests with the resource capacity. As a side effect, it prevents an avalanche scenario, where timeouts start, the clients retry and make the problem worse.

With the fix in place, we were able to run a full benchmark on this scenario, trying to hammer the system with a lot of concurrent cluster-wide transactions while using very large documents (MB range and higher). The idea is to try to see where we can most stress the system. Here are the results we got out of this benchmark run:

Current rate is 1149.978768483978 req/s
Current rate is 1454.185633878873 req/s
Current rate is 1446.6754911504288 req/s
Current rate is 1470.9936847111262 req/s
Current rate is 1483.042163675997 req/s
Current rate is 1439.042229676796 req/s
Executed total of 84,271 commands in 60,052 milliseconds (1403.22 Req/s)

So, this is our baseline. It is not a bad one. Being able to hit over 1000 transactions / second while each transaction is multi MB in size is not a small thing. But we need to figure out if we can do better, now let’s run it through the profiler and see where exactly are we spending all of this time.

What we see here is that the Leader of the cluster spends around two thirds of the time waiting to get the write lock in order to persist data into the storage. In other words, most of the time we aren’t even doing anything, just twiddling our thumbs and waiting in line. RavenDB’s storage engine has a single writer (and the semantics to ensure that this usually won’t be a problem). Spending that much time waiting for the lock is odd. Let’s see if we can figure out who is causing us to wait.

The profiler quickly found us the culprit, PutAsync, which is called to add a new transaction to be processed (and then the Leader can send it to the whole cluster).

It turns out that we were stomping on our own feet all along. Looking deeper into the profiler output, we can see some interesting details. We are spending a lot of time doing mostly ToJson and ReadObjectInternal. What are those? Those two methods are no more than converting a managed object that represents a cluster operation into the binary representation that we can persist to the disk. It doesn’t have to be under the write lock, we can absolutely move it outside the lock boundaries.

After adjusting the code let’s run it through the profiler again:

We can see now that the EmptyQueue method is only 20% and the rest is done outside this method and not taking the write lock. Below is the Leader thread, we can see that we are no longer waiting to acquire the write lock.

This is all nice, but let’s check the performance…

Current rate is 3645.6657070117467 req/s
Current rate is 3616.4546787122263 req/s
Current rate is 3726.887461465556 req/s
Current rate is 3718.4741694683144 req/s
Current rate is 3768.066180070161 req/s
Executed total of 222,521 commands in 60,001 milliseconds (3708.6 Req/s)

We got 250% performance boost for this scenario. And while this is a somewhat pathological scenario, it is something that is going to have an impact on the entire set of RavenDB use cases. This optimization was done as part of broader improvement to the cluster wide transactions.

If you are interested in the exact changes, you can find them in the PR #18133. It is important to mention that this behavior was found in version 5.4 of RavenDB. In RavenDB 6.0 we re-designed the process in which the cluster-wide transactions work to completely eliminate the lock entirely. Beyond just fixing the problem, in 6.0 it simply cannot exist.