At long last, we are now at the point where we can write data back to the disk. Before we can do that, however, we need to figure out what sort of writes we want to allow. The idea that I have in mind for the Pager is to follow the same path as Voron does. In other words, for writes, we can make the following assumptions:
- There is only a single thread doing writes to the pager.
- There are no readers for the writes until the write is completed.
- There is a distinction between writing the data to the pager and writing the data to the disk.
Let’s break those assumptions apart and see what they bring to the table. The fact that we can assume only a single writer thread at any given point is pretty important. It means that the level of complexity that we have to face is greatly reduced. In the same sense, the fact that we don’t need to deal with concurrent readers or any consistency boundary for the data while it is being written will greatly simplify things for us. Finally, we make a distinction between writing to the pager and writing to the disk. Writing to the disk is _slow_, so we want to avoid doing that at any critical areas and push that to the background.
Finally, there is another aspect to consider. Internally, the Pager works with 2MB chunks, but to the outside world, it is using 8KB pages. When we write, we always write at the 8KB pages, not chunks. How would that work for the Pager?
The Pager itself is concurrent, but we only allow a single writer at a time, we can achieve this by centralizing all the write activities in the Writer struct, like so:
For now, I’m going to ignore the fields in the Writer struct, we’ll touch on them in detail later. In order to use the writer, you need to acquire it, write as many pages as you need, then release it. Here is a usage example:
The basic idea is fairly simple. With the writer, we operate at the page boundary to write as many pages as we need, once we are done, the call to flushWrites() persists the data to disk and then we can release the writer. Let’s dig a bit deeper and see how that works, shall we?
The write() call is about as basic as you can get. We use the getPage() function to get the right page, memcpy the data and that is about it, right? There are only two other things here that are important:
- We record which chunk (the 2MB chunk of memory, out of which we carve the 8KB pages) at the writer’s level, is using the loadedChunksForWrites value.
- We remember which pages we wrote to using the writtenPages hash table.
This is intentionally bare bones, because that is actually sufficient for our needs. The fact that we remember which chunks we loaded (and keep a reference to them) will prevent us from reclaiming them, so even though we just wrote to memory, another thread can get the data and start using it without waiting for the disk. Of course, we still need to hit the disk eventually, that is what flushWrites() is about.
There is a lot that is going on here, let’s break it up. We start by allocating a temporary array and copying the keys from the writtenPages hash table to it. We then sort the array. This is done so we’ll be able to process the writes in a sequential manner, which is likely to be faster, even with async I/O. We then scan the list of pages in order, trying to merge writes together. The idea is to issue the minimum number of write calls. Finally, we’ll wait for all the writes to complete. Okay, maybe it isn’t that complex. There is a bunch of code here, but it is mostly straightforward. Note that we also prepare the writeResults list to accept the results of the write to the disk.
As for writing to the disk, this is done using the PagerRing we previously looked at:
To write a buffer to the disk, we simply get the buffer from the Pager (reusing all the work we did in getPage()), increment the number of outstanding writes and then submit the work for the ring for processing. We setup the completeFlush as the callback function on completion. The PagerRing will call us when it is done writing to the disk. If there is an error, we’ll record it and reduce the number of outstanding writes. If there are no more outstanding writes, we’ll wake any waiters. That part is handled in the waitForAllDiskWritesToComplete().
We start by waiting for the outstanding writes to complete, waiting if needed. Then we can reset the state of the Writer. We start by resetting the written pages and then iterate over all the loaded chunks and release them. After the call, the Pager may decide to remove them from memory. This is fine, since they were already written to disk.
Except… if there was an error. You might have noticed that we are gathering the errors on each individual write operation we send, but we are actually only looking at the first one. For that matter, we clear the state of the Writer regardless if there were errors or not.
In general, an I/O error from the disk is not something that is recoverable. What you can do at this stage is to raise the error higher and run whatever recovery you have on startup.
In the next post, I’m going to be talking about durability and the overall expected performance of the system under this sort of model.