Following my posts about search, I wanted to narrow my focus a bit and look into the details of implementing a persistent data structure inside Voron.
Voron is RavenDB’s storage engine and forms the lowest layers of RavenDB. It is responsible for speed, safety, transactions and much more. It is also a very low level piece of code, which has a lot of impact on the design and implementation.
Some of the things that we worry about when worrying Voron code are:
- Performance – reduce computation / allocations (ideally to zero) for writes.
- Zero copies – no cost for reads.
- Safety – concurrent transactions can operate without interfering with one another.
- Applicability – we tend to implement low level features that enable us to do a lot more on the higher tiers of the code.
- Scale – handling data that may be very large, millions and billions of results.
In this case, I want to look into what it would take to implement a persistent set. If I was working in memory, I would be using Set<Int64>, but when using a persistent data structure, things are more interesting. The set we use will simply record Int64 values. This is important for a bunch of reasons.
First, Int64 is big, such values are used as file pointers, artificial ids, etc. Even though it seems limiting, we can get a lot more functionality than expected.
Second, if we are using a set of Int64, we can implement that using a bitmap. A set value indicate that the value is in the set, which allows us to do set union, intersection and exclusion cheaply. The only problem here is that a bitmap with Int64 values is… a problem. Imagine that I have the following code:
We would need to use 76GB(!) of memory to hold a bitmap for this set. That is obviously not going to be a workable solution for us. Luckily, there are other alternatives. Roaring Bitmaps are efficient in both time and space, so that is great. We just need to have an implementation that can work with a persistent model.
In order to understand how I’m going to go about implementing this feature, you need to understand how Voron is built. Voron is composed of several layers, the paging layer, which managed transactions and ACID and the data structure layer, which managed B+Trees, tables, etc.
In this case, we are implementing something at the data structure layer. And the first hurdle to jump through is decide how the data should look like. On the fact of it, this is a fairly simple decision, most of the decisions has already been made and outline in the previous post. We are going to have a sorted array of segment metadata, which will host individual segments with the set bits. This works if we have a single set, but in our case, we expect lots.
If we are going to use this for storing the posting lists, we have to deal with the following scenarios (talking about the specific documents matching the terms in the index):
- Many such lists that have a single item (unique id, date, etc)
- Lots of lists that have just a few values (Customer’s field in an order, for example)
- Few lists that have many values ( OrderCompleted: true, for example, can be safely expected to be about 99% of the total results)
- Many lists that have moderate amount of values (Each of the Tags options , for example)
That means that we have to think very carefully about each scenario. The third and forth options are relatively similar and can probably be best served by the roaring bitmap that we discussed. But what about the first two?
To answer that, we need to compute the metadata required to maintain the roaring set. At a minimum, we are going to have one SegmentMetadata involved, but we’ll also need an offset for that segment’s data, so that means that the minimum size involved has got to be 16 bytes (SegmentMetadata is 8 bytes, and a file offset is the same). There is also some overhead to store these values, which is 4 bytes each. So to store a single value using roaring set we’ll need:
- 16 bytes for the segment metadata and actual segment’s offset
- 4 bytes storage metadata for the previous line’s data
- 2 bytes (single short value) to mark the single flipped bit
- 4 bytes storage metadata for the segment itself
In short, we are getting to 26 bytes overhead if we just stored everything as a roaring set. Instead of doing that, we are going to try to do better and optimize as much as possible the first two options (unique id and very few matches). We’ll set a limit of 28 bytes (which, together with the 4 bytes storage metadata will round up to nice 32 bytes). Up to that limit, we’ll simple store the document ids we have as delta encoded varint.
Let’s say that we need to store the following document id lists:
[229, 190, 19, 144, 169, 1, 167, 14]
You can see that the first list, which is 8 bytes in size, we encoded using merely 2 bytes. The second list, composed of three 8 bytes values (24 bytes) was encoded to merely 8 bytes. Without delta encoding, that value would be decoded to: [229, 190, 19, 245, 231, 20, 156, 246, 20], an additional byte. This is because we substract from each number the previous one, hopefully allowing to pack the value in a much more compact manner.
With a size limit of 28 bytes, we can pack quite a few ids in the list. In my experiments, I could pack up to 20 document ids (so 160 bytes, without encoding) into that space with realistic scenario. Of course, we may get a bad pattern, but that would simply mean that we have to build the roaring set itself.
I’m going to go ahead and do just that, and then write a post about the interesting tidbits of the code that I’ll encounter along the way.