This is my 7th post in this series, and we are now starting to get into the really interesting bits. So far we worked with the individual components, each of them doing just one thing, which makes it hard to see how they are all put together. In this post, we are going to start plugging things together.
As a reminder, the whole point of this series of posts is to explore what we need to do in order to ask for a page (8KB, in our case) from the data file and work with it in memory. We have most of everything ready, let’s put them back together. The first thing that we need to do is actually tie a file to the FileChunks structure that we previously created. This is as simple as this structure:
We are simply initializing the FileChunks and opening the file, nothing else. Remember that a single pager is going to be responsible for multiple files. Furthermore, all the data structures that we are dealing with now are meant for concurrent use. That means that we should be prepared for this type of code:
When we use a managed language, that is actually fairly simple to work with. In an unmanaged language, those two lines of code are tough. Why is that? Let’s look at the raw data members for the Pager structure, shall we?
Of particular interest to us is the files member. This is the list of files that are being managed by the pager. Each one of them has a maximum size of 8GB in size. The problem is that we may have one thread accessing the list at the same time that another thread wants to increase its size. How would that work?
The simplest option is that we’ll reallocate the array, but that will move it. What would the first thread be doing in that scenario? The good thing from our point of view is that we don’t need to worry about concurrent modifications. There is only a single thread that is allowed to modify the Pager’s state at any given point in time.
Trying to find solutions for this problem leads into a rabid rabbit’s hole. You go into hazard pointers, epoch GC and other fun stuff. Also, the call to getPage() is one of the most important ones in a database, anything that we can do to reduce its cost will be done. As such, we can’t typically use any form of locking. A reader/writer lock can be a killer. Here is a good example for how that can happen.
I thought about how to resolve this, and I decided against a generic solution, instead, let’s look at the actual behavior that we need. The files array is going to be accessed a lot, it has an entry per 8GB of disk space that we take. That means that it isn’t going to be experiencing any rapid growth. It is also only going to grow. We also need to worry only when we grow the physical backing store for this array, if we overprovision and use less than we need, that is perfectly fine. Each element in the array is 8 bytes in size, so if we allocate a single memory page (4KB) we can store 512 file references in it. That represents 4 TB(!) of data, so we can probably just accept the additional cost and allocate it and not think about it.
Databases with > 4TB of disk size do exist, and we don’t want to have this artificial limit on us, do we? Instead, we can use another approach. We’ll start by allocating the array with a minimum size of 8 elements (sufficient until you get to 64GB). But what happens when we reach that size?
What we do here is cheat. We don’t need to free the memory immediately, when we reach the limit of the size of the array, we’ll double its size, copy the data that we have and register it to be freed when we close the Pager. At that point, the caller already needs to ensure that there are no other users of the pager.
Because we copy the values from the old array, but keep it around, old readers may use the old or new arrays, but we don’t actually care. The memory remains valid and accessible. In terms of wasted space, if our database went from a single file to being 128 TB in one run, we’ll have an array with 16,384 elements (whose size is 128KB). Along the way, we had to double the size of the array a dozen times and we “waste” 128KB of unused buffers. This seems like a pretty reasonable cost to significantly reduce the level of complexity of concurrent access. Using this method, we can avoid any sort of synchronization on the read side. That is certainly a plus.
Here are the init() and deinit() calls for the Pager, to complete the picture:
As you can see, we allocate a PagerRing for the pager, which will deal with the actual I/O. The actual disposal of the files array is managed using the pendingFree list. That is a small cost to pay, to reduce the cost of adding a new file. In the deinit() routine, note that there is a distinction between releasing the files themselves (where we close the FilesChunk, release the relevant memory, close the file handle, etc) and releasing the arrays that hold the files themselves.
I’m quite pleased with how this turned out, zero cost for reads (important) and negligible memory cost for most scenarios).
In my next post, I’ll get started with actually reading the data from disk and putting that in the pager. So far, that is a 7 post series, and we haven’t completed the first part. That simple scenario is surprisingly tricky.