Applying RavenDB cloud wisdom to Kubernetes Operator

by Szymon Kulec

Writing a Kubernetes Operator is not an easy feat. After all, the ultimate goal is to have a fully automatic process, that manages and deals with all the complexities of the very thing you want to deploy. Clearly, the Operator Framework helps you a lot. It provides tools, best practices and some mapping of capabilities that an operator should provide. Still, the heavy lifting part is on the sole implementer of the operator and there are a lot of things to address.

Keeping the lights on

There are so many aspects to discuss when it comes to implementing a K8S Operator. Stateful vs stateless, ingress, IP management. You name it, K8S demands it! One particularly interesting thing, in the world where software releases are so often, is to make the deployed service always on. After all, you use Kubernetes not only to manage the scale but also to make services always green.

Ensuring that a particular scaled up service is green is much easier, if we’re talking about stateless systems. Something that has a pure computational nature, can be deployed with ease. Statefulness in a stateless world is a hard thing to manage, especially if you consider that some of them are clustered on their own, running protocols like Paxos or Raft. This makes things much more complicated.

Cluster of N

RavenDB by design is a clustered system. To make sure that the cluster perceives itself as a cohesive whole, it utilizes a strong consistency protocol. In the world of distributed systems, maintaining agreement across multiple independent nodes is paramount, and RavenDB achieves this through a consensus protocol called Raft.

The core idea behind protocols is that every node that is connected to a cluster must agree on the state of the data and the overall cluster topology. This consensus protocol is designed to cover CP of the CAP, meaning, Consistency and Partitioning. The Availability is left out as you can pick up only two.

Does it mean that RavenDB has no Availability? RavenDB is always available. It will always accept your writes, support you with the queries, and be operational in general. The consensus protocol is required only for the cluster-wide operations which are not that often.

Now, a cluster of N nodes requires (N-1) / 2 to be able to proceed with the consensus. This is why RavenDB will always run an odd number. This also means that as long as you don’t breach the lower half of nodes when upgrading, even cluster-wide operations will be possible to perform. We found an upper limit of the number of nodes that we can upgrade at the same time. But should we push that hard?

Wisdom from the cloud we own

The fact that we haven’t mentioned, is that Kubernetes Operator is not the first time RavenDB needs to understand an environment that is far from on-premises land. Actually, we do have an environment that predates Kubernetes Operator work that has been used by our customers a lot. It’s RavenDB Cloud.

RavenDB Cloud provides you clusters on a public cloud infrastructure you select. You point at Azure, AWS or GCP, select a few things, and delegate the cluster management, including upgrades. This effectively means no operational overhead for you and a bit more for us 🙂 It also means that the cluster upgrade scenario has already been solved by our cloudy engineers. Otherwise we wouldn’t be able to run the cloud effectively. How do we roll out the upgrades then?

Deployment at the gates

Upgrading a clustered database system is a deceptively complex task. While it might appear as simple as changing an image tag, doing so casually can rapidly lead to undesirable outcomes. A single RavenDB node restart, by itself, is not the primary risk; the danger lies in the pre-existing state of the cluster at the moment of the restart. What if one node is already exhibiting unhealthy behavior, cluster communication is degraded, or a specific database has an unnoticed topology issue?

This was the first thing that a simple RavenDB Helm chart wasn’t capable of handling, increasing our appetite for a more intelligent solution than just fire-and-forget bundles of yaml files.

To address these issues, the Operator performs upgrades in a serial manner, one node by one. It only moves to the next one, if the previous has been properly upgraded. Upgrades are protected by strict safety gates. What are they and how are they applied?

The operator enforces this safety via a family of crucial checks performed before and after each node transition:

  1. Node Liveness Check: A quick liveness probe to confirm the target node’s basic health.
  2. Cluster Connectivity Check: Verification that all essential cluster links remain healthy.
  3. Database Groups Availability Check (Excluding Target Node): For every replicated database, the operator confirms that the cluster can continue to serve the database without relying on the node currently being targeted for upgrade. This ensures data access and consistency are maintained throughout the process.

This robust, guarded approach ensures that the “wisdom from the cloud we own” regarding zero-downtime rolling upgrades is baked directly into the Kubernetes Operator. This also means that if you use our operator, upgrading RavenDB is as simple as updating the image tag in the spec and applying the manifest. This is it. You’ve just got your RavenDB upgraded using RavenDB Kubernetes Operator. Enjoy

Summary

Are we done here? Not yet. There are a few more things to move our operator to the next levels of the Kubernetes Operator scale, such as automatic storage size management, etc. Still, the expertise gained with our Cloud management as well as general engineering experience set us on the right track. And you as well, no matter if you use our RavenDB Cloud or prefer to Kubernetes it away!

If you run K8S cluster, and haven’t tried us yet, give it a shot with https://github.com/ravendb/ravendb-operator. If you prefer being more cloudy, https://ravendb.net/cloud it is. In any case, join our Discord!

Woah, already finished? 🤯

If you found the article interesting, don’t miss a chance to try our database solution – totally for free!

Try now try now arrow icon