Production Postmortem: The Spawn of Denial of Service
A customer contacted us to complain about a highly unstable cluster in their production system. The metrics didn’t support the situation, however. There was no excess load on the cluster in terms of CPU and memory, but there were a lot of network issues. The cluster got to the point where it would just flat-out be unable to connect from one node to another.
It was obviously some sort of a network issue, but our ping and network tests worked just fine. Something else was going on. Somehow, the server would get to a point where it would be simply inaccessible for a period of time, then accessible, then not, etc. What was weird was that the usual metrics didn’t give us anything. The logs were fine, as were memory and CPU. The network was stable throughout.
If the first level of metrics isn’t telling a story, we need to dig deeper. So we did, and we found something really interesting. Here is the total number of TCP connections on the server over time.
So there are a lot of connections on the system, which is choking it? But the CPU is fine, so what is going on? Are we being attacked? We looked at the connections, but they all came from authorized machines, and the firewall was locked down tight.
If you look closely at the graph, you can see that it hits 32K connections at its peak. That is a really interesting number, because 32K is also the number of ephemeral port range values for Linux. In other words, we basically hit the OS limit for how many connections could be sustained between a client and a server.
The question is what could be generating all of those connections? Remember, they are coming from a trusted source and are valid operations. Indeed, digging deeper we could see that there are a lot of connections in the TIME_WAIT state.
We asked to look at the client code to figure out what was going on. Here is what we found:
There is… not much here, as you can see. And certainly nothing that should cause us to generate a stupendous amount of connections to the server. In fact, this is a very short process. It is going to run, read a single line from the input, write a document to RavenDB, and then exit.
To understand what is actually going on, we need to zoom out and understand the system at a higher level. Let’s assume that the script above is called using the following manner:
What will happen now? All of this code is pretty innocent, I’m sure you can tell. But together, we are going to get the following interesting behavior:
For each line in the input, we’ll invoke the script, which will spawn a separate process to connect to RavenDB, write a single document to the server, and exit. Immediately afterward, we’ll have another such process, etc.
Each of those processes is going to have a separate connection, identified by a quartet of (src ip, src port, dst ip, dst port). And there are only so many such ports available on the OS. Once you close a connection, it is moved to a TIME_WAIT mode, and any packets that arrive for the specified connection quartet are going to be assumed to be from the old connection and drop. Generate enough new connections fast enough, and you literally lock yourself out of the network.
The solution to this problem is to avoid using a separate process for each interaction. Aside from alleviating the connection issue (which also requires non trivial cost on the server) it allows RavenDB to far better optimize network and traffic patterns.