Optimizing a three-way merge
Deep inside of the Corax indexing engine inside of RavenDB there is the notion of a posting list. A posting list is just an ordered set of entry ids that contains a particular term. During the indexing process, we need to add and remove items from that posting list. This ends up being something like this:
For fun, go and ask ChatGPT to write you the code for this task.
You can assume that there are no duplicates between the removals and additions, and that adding an existing item is a no-op (so just one value would be in the end result). Here is a quick solution for this task (not actually tested that much, mind, but sufficient to understand what I’m trying to do):
If you look at this code in terms of performance, you’ll realize that this is quite expensive. In terms of complexity, this is actually pretty good, we iterate over the arrays just once, and the number of comparisons is also bounded to the lengths of the list.
However, there is a big issue here, the number of branches that you have to deal with. Basically, every if and every for loop is going to add a tiny bit of latency to the system. This is because these are unpredictable branches, which are pretty nasty to deal with.
It turns out that the values that we put in the posting list are actually always a multiple of 4, so the bottom 2 bits are always cleared. That means that we actually have a different way to deal with it. Here is the new logic:
This code was written with an eye to being able to explain the algorithm, mind, not performance.
The idea goes like this. We flag the removals with a bit, then concatenate all the arrays together, sort them, and then do a single scan over the whole thing, removing duplicates and removals.
In the real code, we are using raw pointers, not a List, so there are no access checks, etc.
From an algorithmic perspective, this code makes absolutely no sense at all. We concatenate all the values together, then sort them (O(NlogN) operation) then scan it again?!
How can that be faster than a single scan across all three arrays? The answer is simple, we have a really efficient sort primitive (vxsort) that is able to sort things really fast (GB/sec). There is a really good series of posts that explain how that is achieved.
Since we consider sorting to be cheap, the rest of the work is just a single scan on the list, and there are no branches at all there. The code plays with the offset that we write into, figuring out whether we need to overwrite the current value (duplicate) or go back (removal), but in general it means that it can execute very quickly.
This approach also has another really important aspect. Take a look at the actual code that we have in production. This is from about an hour worth of profiling a busy indexing session:
And the more common code path:
In both of them, you’ll notice something really important. There isn’t a call to sorting at all in here. In fact, when I search for the relevant function, I find:
That is 25 ms out of over an hour.
How can this be? As efficient as the sorting can be, we are supposed to be calling it a lot.
Well, consider one scenario, what happens if:
There are no removals
All additions happen after the last existing item in the list
In this case, I don’t need to do anything beyond concatenate the lists. I can skip the entire process entirely, just copy the existing and additions to the output and call it a day.
Even when I do have a lot of removals and complicated merge processes, the code structure means that the CPU can get through this code very quickly. This isn’t super friendly for humans to read, but for the CPU, this is chump change.