In the last blog post I presented the manner in which the Pager can write data to disk. Here is a reminder:
We acquire the writer (and a lock on it), call write on the pages we want to write on (passing the buffer to write on them), and finalize the process by calling flushWrites() to actually write the data to disk. As a reminder, we assume that the caller of the Pager is responsible for coordination. While we are writing to a specific page, it is the responsibility of the caller to ensure that there are no reads to that page.
The API above is intentionally simplistic , it doesn’t give us a lot of knobs to play with. But that is sufficient to do some fairly sophisticated things. One of the interesting observations is that we split the process of updating the data file into discrete steps. There is the part in which we are updating the in memory data, which allows other threads to immediately observe it (since they’ll read the new details from the Pager’s cache). Separately, there is the portion in which we write to the disk. The reason that I built the API in this manner is that it provides me with the flexibility to make decisions.
Here are some of the things that I can do with the current structure:
- I can decide not to write the data to the disk. If the amount of modified pages is small (very common if I’m continuously modifying the same set of pages) I can skip the I/O costs entirely and do everything in memory.
- Flushing the data to disk can be done in an asynchronous manner. In fact, it is already done in an asynchronous manner, but we are waiting for it to complete. That isn’t actually required.
The way the Pager works, we deposit the writes in the pager, and at some future point the Pager will persist them to disk. The durability aspect of a database is not reliant on the Pager, it is a property of the Write Ahead Log, usually.
If I wanted to implement a more sophisticated approach for writing to the disk, I could implement a least recently used cache for the written pages. When the number of pages in memory exceeds a certain size, we’ll start writing the oldest to disk. That keeps the most used pages in memory and avoids needless I/O. At certain points, we can ask the Pager to flush everything to the disk, this gives us a checkpoint, where we can safely trim the Write Ahead Log. A good place to do that is whenever we reach the file size limit of the log and need to create a new one.
So far, by the way, you’ll notice that I’m not actually talking about durability, just writing to the disk. The durability aspect is coming from something we did long ago, but didn’t really pay attention to. Let’s look at how we are opening files, shall we:
Take a look at the flags that we pass to the open() command, we are asking the OS to use direct I/O (bypassing the buffer pool, since we’ll use our own) as well as using DSYNC write mode. The two together means that the write will skip any buffering / caching along the way and hit the disk in a durable manner. The fact that we are using async I/O means that we need to ensure that the buffers we write are not modified while we are saving them. As we currently have the API, there is a strong boundary for consistency. We acquire the writer, write whatever pages we need and flush immediately.
A more complex system would be needed to manage higher performance levels. The issue is that in order to do that, we have to give up a level of control. Instead of knowing exactly where something will happen, we can have a more sophisticated approach, but we’ll need to be aware that we don’t really know at which point the data will be persisted.
At this point, however, there is a good reason to ask, do we even need to write durably? If we are limiting the consistency of the data to specific times requested by the caller (such as when we replace the Write Ahead Log), we can just call fsync() at the appropriate times, no? That would allow us to use buffered writes from most I/O.
I don’t think that this would be a good idea. Remember that we are using multiple files. If we’ll use buffered I/O and fsync(), we’ll need to issue multiple fsync() calls, which can be quite expensive. It also means higher memory usage on the system because of the file system cache, for data we determine is no longer in memory. It is simpler to use direct I/O for the whole thing, after all.
In the next post, I’m going to show how to implement a more sophisticated write-behind algorithm and discuss some of the implications of such a design.