High performance .NET: Building a Redis Clone– the wrong optimization path
Now that I’m done with the low hanging fruits, I decided to shift the Redis implementation to use System.IO.Pipelines. That is a high performance I/O API that is meant specifically for servers that need to eke out all the performance out of the system.
The API is a bit different, but it follows a very logical pattern and makes a lot of sense. Here is the main loop of handling commands from a client:
The idea is that we get a buffer from the network, we read everything (including pipelined commands) and then we flush to the client. The more interesting things happen when we start processing the actual commands, because now we aren’t utilizing StreamReader but PipeReader. So we are working at the level of bytes, not strings.
Here is what this (roughly) looks like, I’m not showing the whole thing because I want to focus on the issue that I ran into:
The code is reading from the buffer, parsing the Redis format and then executing the commands. It supports multiple commands in the same buffer (pipelining) and it has absolutely atrocious performance.
Yes, the super speedy API that is significantly harder to get right (compared to the ease of working with strings) is far slower. And by far slower I mean the following, on my development machine:
- The previous version clocks at around 126,017.72 operations per second.
- This version clocks at less than 100 operations per second.
Yes, you read that right, less than one hundred operations per second compared to over hundred thousands for the unoptimized version.
That was… surprising, as you can imagine.
I actually wrote the implementation twice, using different approaches, trying to figure out what I was doing wrong. Surely, it can’t be that bad.
I took a look at the profiler output, to try to figure out what is going on:
It says, quite clearly, that the implementation is super bad, no? Except, that this is what you are supposed to be using. So what is going on?
The underlying problem is actually fairly simple and relates to how the Pipelines API achieves its performance. Instead of doing small calls, you are expected to get a buffer and process that. Once you are done processing the buffer you can indicate what amount of data you consumed, and then you can issue another call.
However, there is a difference between consumed data and examined data. Consider the following data:
*3 $3 SET $15 memtier-2818567 $256 xxxxxxxxxx ... xxxxxx *2 $3 GET $15 memtier-7689405 *2 $3 GET $15
What you can see here is a pipelined command, with 335 bytes in the buffer. We’ll process all of those commands in a single hit, except… look at the highlighted portion. What do we have there?
We have a partial command. In other words, we are expected to execute a GET with a key size of 15 bytes, but we only have the first 4 bytes here. That is actually expected and fine. We consumed all the bytes until the highlighted portion (thus letting the PipeReader know that we are done with them). The problem is that when we issue a call now, we’ll get the highlighted portion (which we didn’t consume), but we aren’t ready to process that. Data is missing. We indicate that to the PipeReader using the examined portion. So the PipeReader knows that it needs to read more from the network.
However… my code has a subtle bug. It will report that it examined the yellow highlight, not the green one. In other words, we tell the PipeReader that we consumed some portion of the buffer, and examined some more, but there are still bytes on the buffer that are neither consumed nor examined. That means that when we issue the read call, expecting to get data from the network, we’ll actually get the same buffer again, to do the exact same processing.
Eventually, we’ll have more data in the buffer from the other side, so the correctness of the solution isn’t impacted. But it will kill your performance.
The fix is really simple, we need to tell the PipeReader that we examined the entire buffer, so it will not do a busy wait and wait for more data from the network. Here is the bug fix:
With that change in place, we can hit 187,104.21 operations per second! That is 50% better, which is awesome. I haven’t profiled things yet properly, because I also want to address another issue, how are we going to deal with the data from the network. More on that in my next post.