Implementing a file pager in Zig: Reading from the disk

by Oren Eini

We have finally gotten to the point where we can ask the pager for a page of data (reminder, a page in this case is 8KB of data) and get it back. Here is what this looks like:

There are a few things to note here. There is a new concept here, the transaction. We don’t want to have that concept at the pager level, so we just pass the relevant state to the pager to deal with. Basically, any transaction we’ll have will need to have a bit of state for the pager to deal with it. The state we need at this point is minimal, just the list of chunks that are referenced by the pager. You can also see that we provide a timeout for the getPage() call. What is that about?

This leads to the second concern we need to consider. How are we expected to run this? If we call getPage() on a page (actually, a chunk containing this page) that isn’t resident in memory, we’ll need to go to the disk to read it. That can take a while, sometimes a long while. At a glance, that is one of those things that async/await was meant for. Since Zig supports async functions, that is certainly something that we can do, but it is something that I want to be cautious about. Having explicit blocking is far easier to understand and debug, at least for now. This is especially true if we’ll want to consume the pager API from a non Zig target.

That leads to an interesting issue, however. If a call to getPage() can block, how can we avoid blocking the thread. In most cases, we would like to avoid blocking, after all. It would be simple to have tryGetPage() method, which will not block (but schedule the load of the page from disk), and then maybe register for a notification for that. If that sounds like async to you, that is because it is. The problem with this sort of approach is that you need to suspend execution somewhere in the middle of a transaction operation and continue when the data is loaded. Without async/await, you can’t really do that. Well, I mean, you could try, but we have a lot of experience with trying to manage state via callbacks, that isn’t really going to work for anything beyond the simplest systems (see: node.js without async/await).

There is one thing that we can do that would be both simple and effective, however: we can error if the page isn’t in memory. That sounds like a pretty bad idea, no? How would that help us?

Well, the concept is simple. If a transaction attempts to access a page that isn’t resident in memory, we’ll do the following operations:

  • Schedule the chunk the page resides on to load into memory.
  • Return an error from the pager
  • Rollback the transaction
  • Keep the loadedChunks for that transaction active and wait for the chunk to be loaded
  • Re-run the transaction again, now the chunk is in memory and we can proceed further

Each time that we re-run the transaction, we make sure that the chunks it needs are in memory, eventually ensuring that all the required chunks are resident and we don’t need to block.

At the same time, the code to work with the transactions is not going to care about blocking, etc. We need to do the usual error handling, but that is required anyway. There is a single location where we need to deal with callbacks from the pager, so there is a limited blast radius of complexity. For write transactions, for example, this is a very reasonable strategy. We assume that there is only a single thread writing at a given time. A transaction being blocked because it needs to read a page from the disk can stall other pending transactions. By having it abort and retry later, we can keep the line moving. For read operations, on the other hand, that is likely not something that you want to do. If I’m already streaming results to the caller, I can’t just repeat the transaction.

I’m not making any decisions at this point, just considering the various options and implications that we have to deal with at this early level.

Now, let’s look at how the getPage() is actually implemented, shall we?

There is a lot that is going on here, I know. We start by defining a set (a hash map using a 64 bits unsigned integer to a zero-sized value). The way this works with Zig is quite elegant, since we pay no memory cost for the values here.

The majority of the work is done in the loadChunksToTransaction() function, which we’ll examine shortly, but you can see some interesting details already in getPage(). We assume that we have a page loaded, and any range of pages that we ask is always within a single page.

The call to load the chunks actually puts them in the loadedChuncks argument. We verify that we loaded the chunk properly and then we create a slice to return for the caller. Note that we may request more than a single page and it is valid to ask for a range that contains multiple chunks. We validate that the range we return is within the memory range for the current file, we ensured that the chunks for a specific file are consecutive in memory, so we can safely return this pointer across multiple chunks without needing to think about it.

There is another aspect of loadedChunks that we need to discuss. A transaction may use multiple pages from the same chunk, but we only need to load the chunk once. At the same time, we can avoid adding a reference to the chunk multiple times for each loaded page. When we close the transaction, we need to release the reference for these chunks, so we need to keep track of those. With that in mind, let’s see how we actually load the chunks to memory.

As a reminder, we have two actors working together here. The FileChunks is used to store the chunks in memory and the PagerRing is used for parallel I/O.

That is a lot of code to throw at you, I know, let’s dissect it in detail. In this method, we are working on chunks, not pages, and we assume that we may have multiple chunks, that is why we have the while loops. We start by checking if the chunk is already loaded in the transactions’ loadedChunks. If it isn’t, we compute the position of the chunk in the file (the chunk number we get from the caller is the global one, after all) and try to get it from the FileChunks. This is where things get interesting.  When we call tryGet() for the current chunk, we may get an error because of two possible scenarios:

  1. The value is currently being loaded from the disk (some other transaction asked for it, probably). We don’t need to do anything further other than wait for it to show up.
  2. Another transaction tried to load it, but got an error. At this point we just return the error. We don’t try to do anything special here. In general, there may be a lot of errors to consider here. We may have temporary I/O issue, or run out of memory or something that is transient. Or we may have an actual problem at hand (bad sector on disk, corrupted data, etc). Regardless of what we are doing, we aren’t going to try to do any error handling here. We’ll just record the error and any future attempt to access that chunk will also error. The proper way to recover at this point is to restart the pager. This is assuming we have the other components of a database at play here. So we’ll re-run the journal files, apply recovery, etc. In short, any I/O issues like that are critical errors and require a restart of the system to come back to a known state.

If the tryGet() method returned without an error, there are still two options to consider. The call may have returned a value (so we called addRef() on the chunk internally), we can simply add that to the chunks we own and move on. If there isn’t a value in memory, things start to get interesting. At this point we call markLoading(). We are basically racing to be the owners for loading this chunk. If we are successful in this race, we’ll get the buffer back from the FileChunks and can schedule reading the relevant chunk from the disk. You’ll note that we are setting the callback to completeLoad, we’ll look into that shortly. If we aren’t successful (we didn’t get a buffer back), then some other thread was able to get the buffer and will schedule the read for us, so we are done.

After we either ensured that all the chunks are loaded or scheduled them to be loaded, we use getBlocking() to wait for all the relevant chunks to be available. Once that is done, we can safely return and getPage() will complete the process, as we saw earlier.

The only thing that we have to look at is the completeLoad function, which is about as basic as you can get:

Most of the function is about error handling. We register either the fact that we got an error reading from the disk or that we completed the load process and maybe log something. In general, there isn’t really much that we need to do here. The act of calling markLoaded() will release any threads waiting on getBlocking(), after all. So the whole thing comes together quite nicely.

With this done, we are mostly done on the reading side of the pager and this post as well. In my next post, I want to discuss how we should handle eviction of data. So far, we are just reading into memory, never releasing. We need to take care of that as well, of course. Once that is done, we can move to the wonderful topic of handling writes and durability…

Woah, already finished? 🤯

If you found the article interesting, don’t miss a chance to try our database solution – totally for free!

Try now try now arrow icon