In the previous post, I talked a lot about the manner in which both client and server will authenticate one another safely and securely. The reason for all the problem is that we want to ensure that we are talking to the entity we believe we do, protect ourselves from man in the middle, etc. The entire purpose of the handshake exchange is to establish that the person on the other side is the right one and not a malicious actor (like the coffee shop router or the corporate firewall). Once we establish who is on the other side, the rest is pretty easy. Each side of the connection generated a key pair specifically for this connection. They then managed to send each other both the other side’s public key as well as prove that they own another key pair (trust in which was established separately, in an offline manner).
In other words, on each side, we have:
- My key pair (public, secret)
- Other side public key
With those, we can use key exchange to derive a shared secret key. The gist of this is that we know that this statement holds:
op(client_secret, server_public) == op(server_secret, client_public)
The details on the actual op() aren’t important for understanding, but I’m using sodium, so this is scalar multiplication over curve 25519. If this tells you anything, great. Otherwise, you can trust that people who do understand the math says that this is safe to do. Diffie-Hellman is the search term to use to understand how this works.
Now that we have a shared secret key, we can start sending data to one another, right? It would appear that the answer to that is… no. Or at least, not yet. The communication channel that we build here is based is built on top of TCP, providing two way communication for client and server. The TCP uses the stream abstraction to send data over the wire. That does not work with modern cryptographic algorithm.
How can that be? After is literally this thing called stream cipher, after all. If you cannot use a stream cipher for stream, what is it for?
A stream cipher is a basic building block for modern cryptography. However, it also has a serious problem. It doesn’t protect you from modification of the ciphertext. In other words, you will “successfully” decrypt the value and use it, even though it was modified. Here is a scary scenario of how you can abuse that badly.
Because of such issues, all modern cryptographic algorithms uses Authenticated Encryption. In other words, to successfully complete their operation, they require that the cipher stream will match a cryptographic key. In other words, conceptually, the first thing that a modern cipher will do on decryption is something like:
That isn’t quite how this looks like, but it is close enough to understand what is going on. If you want to look at how a real implementation does it, you can look here. The python code is nicer, but this is basically the same concept.
So, why does that matter for us? How does this relate to having to dealing with streams?
Consider the following scenario, in this model, in order to successfully decrypt anything, we first need to validate the MAC (message authentication code) for the encrypted value. But in order to do that, we have to have the whole value, not just part of that. In other words, we cannot use a real stream, instead, we need to send the data in chunks. The TLS protocol has the same issue, that is handled via the notion of records, with a maximum size of about 16KB. So a TLS stream is actually composed of records that are processed independently from one another. That also means that before you get to the TLS a buffered stream is a must, otherwise we’ll send just a few plain text bytes for a lot of cryptographic envelope. In other words, if you call tls.Write(buffer[0..4]), if you don’t have buffering, this will send a packet with a cryptographic envelope that is much bigger than the actual plain text value that you sent.
Looking at the TLS record layer, I think that I’ll adopt many of the same behaviors. Let’s consider a record:
So each record is composed of an envelope, that simply contains the length, then we have the cipher text itself. I’m intending to use the libsodium’s encrypted stream, because it lets me handle things like re-keying on the fly transparently, etc. We read the record from the network, decrypt and then need to decide what to do.
If this is an alert, we raise it to the user (this is critical for good error reporting). Note that in this way I can send (an encrypted) stream to the other side to give a good error for the caller. For data, we just pass it to the caller. Note that there is one very interesting aspect here. We have two Len fields. This is because we allow padding, that can help avoid attacks such as BREACH and mitigate traffic analysis. We ensure that the padding is always set to zero, similar in reason to the TLS model, to avoid mistakes and to force implementation correctness.
I think that this is enough theory for now. In my next post, I want to get to actually implementing this.
As usual, I would love to hear your feedback and comments.