Complex decisions and random chance
An important aspect of RavenDB is that It Just Works. In other words, we aim to make RavenDB a really boring database, one that require very little of your attention. As it turns out, this can get really complex. The topic for today is role assignment in a distributed cluster after recovering from a failure.
Let’s consider the case of a cluster composed of 10 databases, spread among three nodes (A,B and C). The databases are replicated on all three nodes, which A being the primary for 4 of them, and C and B being the primary for 3 each.
And the following sequence of events:
- Node A goes down.
- The RavenDB cluster will react to that by re-shuffling responsibilities in the cluster for the databases where A was the primary node of contact.
- Clients will observe no change in behavior and continue to operate normally. All the databases are now served by nodes B and C.
- Node B goes down.
- At this point, there is no longer a cluster (which requires a majority of the nodes to be up, in this case, two). Node C and the remaining client follow the disaster plan that was created (and distributed to the clients) ahead of time by the cluster.
- All the clients now talk to node C for all databases.
So far, this is business as usual and how RavenDB works to ensure that even in the presence of failure, we can continue working normally. Node C is the only remaining node in the cluster, and it shoulders the load admirably.
At this point, the admin takes pity on our poor node C and restarts node A and B. The following operations then take place:
- The cluster can communicate between the nodes and start monitoring the system.
- Nodes A and B are getting updated from node C with all the missing changes that they missed.
- The cluster will update the topology of the databases to reflect that node A and B are now up.
What we need to do now is to re-assign the primary role for each database across the cluster. And that turns out to be surprisingly tricky thing to do.
We sat down and wrote a list of the potential items we need to consider in this case:
- The number of databases already assigned to the cluster node.
- The number of requests / sec that the database gets vs. the already assigned databases on the node.
- The working set that the database require to operate vs. the existing working set on the node.
- The number of IOPS that are required vs…
There is also the need to take into account the hardware resources for each node. They are usually identical, but they don’t have to be.
As we started scoping the amount of work that we would need to make this work properly, we were looking at a month or so. In addition to actually making the proper decision, we first need to gather the relevant information (and across multiple operating systems, containers, bare metal, etc). In short, that is a lot of work to be done.
We decided to take a couple of steps back, before we had to figure out where we’ll put this huge task in our queue. What was the actual scenario?
If there is a single database on the cluster (which isn’t uncommon), then this feature doesn’t matter. A single node will always be the primary anyway. This means that this distribution of primary responsibility in the cluster is not actually a single step operation, it only make sense when you apply it to multiple databases. And that, as it turns out, gives us a lot more freedom.
We were able to implement the feature in the same day, and the cost of that was quite simple. Whenever the cluster detected that a database’s node has recovered, it would run the following command:
primary = topology[rand(topology.Length)];
This is the only thing that we need in order to get fair distribution of the data across the nodes.
The way it works is simple. As each database become up to date, the cluster will pick it up and a primary at random. Because we have multiple such cases, sheer chance is going to ensure that they will go to different nodes (and those, rebalance the load across the cluster).
It simple, it works quite nicely and it is a lot less scary than having to implement a smart system. And in our testing, it worked like magic .