After implementing the memory management in the previous post, I set out to handle the actual I/O primitives that we need. As a reminder, we are separating the concerns here. We managed memory and reference counting in the previous post and now I want to focus on how we can read and write from the disk in as efficient a manner as possible. Before we get to the guts of the code, I want to explain a bit about the environment that I have in mind. Most of the time, the pager is able to provide the requested page from memory directly. If it can’t do that, it needs to consider the fact that there may be multiple threads that are trying to load that page. At the same time, while we are loading the page, we want to be free to do other things as well.
I decided to implement the I/O routine using async I/O. Here is the rough sketch of the API I have in mind:
The idea is simple, we use a struct to which we can submit work in an asynchronous manner. At some later point in time, the work will complete and our read or write will be done. At that point we’ll be able to invoke the provided callback for the user. The code above is about as simple as you can manage, it spawns a dedicated thread to manage the I/O and then just issues those operations directly. To save myself some headache, I’m using an eventfd as a synchronization mechanism, I don’t strictly need this, but it will be useful down the road.
In terms of the API, I can now write the following code:
The basic idea is that the BackgroundRing struct doesn’t manage file descriptors or buffers. It is a strict way to manage I/O. The API is pretty piss poor as well, in terms of usability. No one will ever want to write generic I/O routines using this method, but we aren’t trying to do generic I/O, we are trying to ensure usability in a very constrained manner, inside the pager.
About the only nice thing that we do in this implementation is handle partial reads and writes. If we were asked to read more than what we got, we’ll repeat the operation until we get to the end of the file or succeed.
In terms of implementation, as well, this is a really bad idea. We look like we are doing async I/O, but we are actually just passing it all to a background thread that will do the work off a queue. That means that it will be very hard to make full use of the hardware capabilities. But I think that you can guess from the name that I’m not going to leave things like that. I’m going to use the new io_uring API in Linux to handle most of those concerns. That idea is that we’ll allocate a command buffer in the kernel and allow the kernel to handle asynchronous execution of the I/O operations. We still retrain the same rough structure, in that we are going to have a dedicated background thread to manage the commands, however. Amusing enough, the io_uring API is meant to be used from a single thread, since otherwise you’ll need to orchestrate writes to the ring buffer from multiple providers, which is much harder than a single consumer, single producer scenario.
The use of io_uring is also why we are using the eventfd model. We are registering that file descriptor in the io_uring so it will let us know when event completes. This also does double duty as the method that we can use to wake the background thread when we have more work for it to do. The most major change is inside the background worker, of course. Here is how this looks like:
We create the ring in the init function (see full code listing below) and in the background thread we are simply waiting for an event using the eventfd. When a caller submits some work, we’ll register that on the io_uring and wait for it to complete (also using the eventfd). You can see that I’m handling some basic states (partial reads, full queues, etc). The code itself ends up being pretty small. Once we are done with the operation, we let the user know about the completion.
There are a few things here that are interesting to note. We are actually allowing interleaving of operations, so we may have many outstanding operations at any given point. We aren’t trying to guarantee any kind of ordering between the operations, nor are we providing anything but the most bare bones interface for the caller. Even so, there is quite a bit of power in the manner in which we are working here. We need to complete a few more components first, and then we can bring it all together…
Here is the full listing of the PagerRing code. In the next post, I want to focus on the actual overall pager, when we are managing multiple files and how we work with all of them together. In particular, we want to understand how we manage the memory budget across all of them.