One of the things that I find myself paying a lot of attention to is the error handling portion of writing software. This is one of the cases where I’m sounding puffy even to my own ears, but from over two decades of experience, I can tell you that getting error handling right is one of the most important things that you can do for your systems. I spend a lot of time on getting errors right. That doesn’t just mean error handling, but error reporting and giving enough context that the other side can figure out what we need to do.
In a secured protocol, that is a bit harder, because we need to safeguard ourselves from eavesdroppers, but I spent significant amounts of time thinking on how to do this properly. Here are the ground rules I set out for myself:
- The most common scenario is client failing to connect to the server.
- We need to properly report underlying issues (such as TCP errors) while also exposing any protocol level issues.
- There is an error during the handshake and errors during processing of application messages. Both scenarios should be handled.
We already saw in the previous post that there is the concept of the data messages and alert messages (of which there can only be one). Let’s look how that works for the handshake scenario. I’m focusing on the server side here, because I’m assuming that this one is more likely to be opaque. A client side issue can be much more easily troubleshooted. And the issue isn’t error handling inside the code, it is distributed error handling. In other words, if the server has an issue, how it reports to the client?
The other side, where the client wants to report an issue to the server, is of no interest to us. From our perspective, a client can cut off at any point (TCP connection broke, etc), so there is no meaning to trying to do that gracefully or give more data to the server. What would the server do with that?
Here is the server portion of establishing a secured connection:
I’m using Zig to write this code and you can see any potential error in the process marked with a try keyword. Looking at the code, everything up to line 24 (the completeAuth() call) is mechanically sending and receiving data. Any error up to that point is something that is likely network related (so the connection is broken). You can see that the protocol call challenge() can fail as does the call to generateKey() – in both cases, there isn’t much that I can do about it. If the generateKey() call fails, there is no shared secret (for that matter, it doesn’t look like that can fail, but we’ll ignore that). As for the challenge() call, the only way that can fail is if the server has failed to encrypt its challenge properly. That is not something that the client can do much about. And anyway, there isn’t a failing codepath there either.
In other words, aside from network issues, which will break the connection (meaning we cannot send the error to the client anyway), we have to wait until we process the challenge from the client to have our first viable failure. In the code above, I”m just calling try, which means that we’ll fail the connection attempt, close the socket and basically just hang up on the client. That isn’t nice to do at all. Here is what I replaced line 24 with:
What is going on here is that by the time that I got the challenge response from the client, I have enough information to send derive the shared key. I can use that to send an alert to the other side, letting them know what the failure was. A client will complete the challenge, and if there is a handshake failure, we proceed to fail gracefully with meaning error.
But there is another point to this protocol, an alert message doesn’t have to show up only in the hand shake part. Consider a long running response that run into an error. Here is how you’ll usually handle that in TCP / HTTP scenarios, assume that we are streaming data to the client and suddenly run into an issue:
How do you send an error midstream? Well, you don’t. If you are lucky, you’ll have the error output and have some way to get the full message and manually inspect it. That is a distressingly common issue, by the way, and a huge problem for proper error reporting with long running responses.
With the alert model, we have effectively multiple channels in the same TCP stream that we can utilize to send a clear and independent error for the client. Much nicer overall, even if I say so myself.
And it just occurred to me that this mimics quite nicely the same approach that Zig itself uses for error handling .