Implementing a file pager in Zig: Managing chunk metadata December 28, 2021 Author: Oren Eini, CEO RavenDB
Read the orginal blog post

Implementing a file pager in Zig: Managing chunk metadata

by Oren Eini, CEO RavenDB December 28, 2021

The file pager needs to know what values it has in memory and what it needs from the disk. Instead of tracking values on a per page level, we are going to do that on a chunk basis, where each chunk in 2MB (256 pages). A single file is going to be limited to 8 GB in size, so we have a maximum of 4,096 chunks in a file. We can allocate a simple array of metadata for the entire file in a single shot. That means that we don’t have to do reallocation when we grow the size of the file (up to the 8GB maximum). Let’s consider what metadata we need to know about the chunks we have:

  • What is the status of the chunk (in memory, on the disk, being loaded or errored).
  • How many outstanding references we have for a chunk?
  • Where do we find the actual chunk data in memory, when it is loaded?

The whole thing is made complex because we have to consider concurrency. Multiple threads may try to load a chunk at the same time, we may need to release the memory of a chunk to make room for loading another, etc. We also need to consider issues such as I/O failures, optimizing I/O patterns, etc. For now, I/O will be handled by another post. I want to focus just on how we will deal with the metadata.

A major PITA with concurrency is how to handle reference tracking. If a thread is reading from a chunk, we cannot release it. That leads us to reference counting, but that is tough to do atomically. You have to deal with the ABA problem, to start with. For that reason, we want to limit chunk metadata to 8 bytes in total. This will allow us to use atomic instructions to modify the metadata safely.

Using just 8 bytes is a very low amount. We know that the chunks we’ll use are 2MB in size. We can assume that we’ll also align them on 2MB boundary. That means that the lower 20 bits are unused, we can repurpose them. On x64 and ARM64, the top 16 bits are also unused (not always true, since from 2019 we have IceLake that has PML5, which uses 57 bits, but very likely to be the case). In most systems, the 47th bit will be used for kernel vs. user memory, so that will be cleared as well. That means that we actually only need 64 – 17 – 20 = 27 bits to store the pointer value. We can repurpose the other 37 bits.

There are actually several ways in which we can do this. The compressed pointer method is just one of them. I decided to not go that route. Instead, we are going to have the following structure:

This is a packed bit field struct which can fit into a 64 bits value. Note that we have fields for the type of the value, the version (for ABA) and the number of references. In addition to that, we also have the actual value, which is specified in offsetInPages. Let’s talk about sizes here.

  • The tag field has four options, as you can see.
  • The version field is 16 bits, which means that it can have 65,536 possible values. It will be incremented on every change to the value and used to avoid false successes when updating the value concurrently.
  • The references field is 20 bits in size, giving us 1 million values here. That is the number of concurrent references that it can support. That looks like big enough value that we shouldn’t care about it.
  • The offsetInPages field is 26 bits in size. Assuming 4 KB pages, we can reference up to 256 GB of memory. We’ll want to support machines with higher memory than that, which is why we’ll also add the concept of base. For a single file, all the allocations must come in the same 256 GB range. I don’t expect that to be a big problem, and different files can have different bases.

The fact that all of that fits in 64 bits means that we can use simple Compare & Swap atomic operations and avoid the need for 128 bits atomic instructions. To be fair, cmpxchg16b has been around forever. I believe that you can do that on ARM as well, but I’m not sure how.

At any rate, let’s look at the ChunkMetadata struct in all its glory, then we’ll discuss what is going on:

The ChunkMetadata can be in one of four states:

  • Empty – there is no value
  • Error – we tried to load the chunk, but failed for some reason. In that case, the actual error code is stored in offsetInPages.
  • Loading – we are currently loading the chunk, and callers can decide to wait for this or try again later.
  • Value – there is a value in the chunk and it is available immediately.

When we get() a value we check what the current state of the metadata is and in all but the Value case we’ll return immediately. If there is a value, we can’t just return it to the caller. We need to increment the reference count. That is most of the code in the get() method. We increment the references, do a wrapping increment for the version (so each change will be unique) and then use an atomic operation to update the value. The idea is that two concurrent threads getting the value at the same time will always increment or decrement the references properly. That will be quite important later on.

After you are done with the chunk, you can release() it, which will decrement the reference count. Note that reference count of 0 is wrong, we aren’t handling actual releasing of values yet. That will come in another post.

The trySet() function is responsible for the other side, it will set the value or the error, taking care of the concurrency aspects of the ChunkMetadata. Of particular interest here, however, is the Futex.wake() call. That deserves some detail.

Consider the sequence of events for accessing a chunk. We may have two threads that try to get a chunk, but they find that it is not resident in memory. It needs to be loaded, but we don’t want both threads to do so at once. Therefore, the threads will compete on moving the chunk from the Empty state to the Loading state. After which, the thread that won the race will need to schedule the actual I/O. What does the other thread do in the meantime? It needs to wait until the I/O is completed. This is done using the waitForValue() method, where we interpret the first half of the chunk metadata (the one holding the version field) as a Futex.wait  value. In other words, the thread will sleep until the trySet() call will wake it.

That is enough talking about the ChunkMetadata, I think. We went over that in detail, for my next post, I want to talk about how we deal with what is likely to be the most interesting bit of the file pager, managing the actual chunks.

Woah, already finished? 🤯

If you found the article interesting, don’t miss a chance to try our database solution – totally for free!

Try now

Share your thoughts or ask us a question

Categories: