Building extendible hash leaf page–Part II
by Oren Eini, CEO RavenDB

I wrote before about an idea for designing the leaf page of an extendible hash page. The idea was to arrange things on a cache line size, which would reduce how much we’ll need to search inside the page. On the other hand, it means that we might need to split whenever a particular 64 bits range is full.

I went ahead an implemented that, just to see how it would work in practice. You can read the code here. Please note, this is testing code, so the point is to see how things work, not so much to do things properly. One really useful piece you’ll find there is that I’m outputting the structure of the hash to a graph. Being able to visually inspect the state of the system at various points has been invaluable to me.

I have tested the implementation with two types on inputs. First, we have sequential keys and values. That is what I expect will be the most common use case that I have for this. It worked, but storing one million items results in just over 8 MB of memory being used. Given that my key & value are 8 bytes each, that isn’t too bad. In fact, that is pretty great. If we were storing this as an array, we would need about 15.2MB, so there are spacing savings there as well.

Looking at the internal structure of the hash table, we see that all 64 bytes pieces hold some values (all between 42 – 48 bytes). That is pretty great all around, open the champagne.

I decided to see what kind of behavior we’ll see when we have a random distribution. The situation was… worse. In fact, storing one million random keys and values takes 147MB. Yes, that is for when you can store the whole data in just 15.2 MB in its raw form.

To be fair, one of the advantages that we have for sequential values is that we use varints to store the values, so they compress well. With random values, we don’t really benefit from that. I tested this again with different ranges of values. With random values up to 2^32, we get a hash table that over 44MB in size and when we limit the range to ten millions, we use “merely” 27MB.

The issue is fairly simple, the moment we have a 64 bytes range full, we are forced to do a page split. That lead us to high fragmentation, which is not desirable at all. With random distribution, it is likely that we’ll hit the 64 bytes capacity easily on each page, leading to very high fragmentation. Amusingly enough, the sequential mode will not have this, because this will spread the values neatly along the different pieces in a page.

The answer to that is to move to an open addressing mode inside the page. In other words, when we have a 64 bytes piece that is full, we’ll mark it as having overflowed and put the value on the subsequent piece. That would allow us to spill over the values across the entire page much more efficiently. It does complicates deletions, but we’ll manage.

Implementing that feature took a little bit, but here is the result. For the case of random keys and values, we go to a total memory usage of 32 MB, a very significant improvement. There has been no change for the sequential mode, which was already optimal for our scenario.

When values are in the 2^32 range, we use 16 MB and for ten million range, we get 8MB. These numbers are pretty awesome, even if I say so myself.

For comparison, I wrote this C# program, which merely store 1 million 16 bytes key pairs. That is basically what I’m doing in my code, and it was interesting to compare the memory consumption between the two. The C# version took 37.2 MB, compared to the bad result I had of 32MB in my code.

Looking at the actual overflow count, on the random data case, it look like we have the following stats:

  • Max overflow chain (overflow pieces that are next to one another): 29
  • Total number of chains: 107,830
  • Average chain length: 2.35

In other words, we have a 10% chance in this case of hitting an overflow chain (meaning that we need to check more than 64 bytes of data to see if the key exists). When this happens, we’ll usually scan 128 – 192 bytes, but may scan up to 1.8 KB of sequential memory. There are a total of 1,388 cases where we have more than 10 consecutive overflow pieces (1% chance of hitting that) and only 64 cases where the chain is greater than 20.

Overall, these numbers looks good to me.

As usual, comments are welcome. I would appreciate any feedback you may have.

LIVE
DEMO
Request Live Demo

LIVE DEMO

Request a customized
presentation of RavenDB

Request Live Demo

Request a customized presentation of RavenDB