When a cloud platform offers 99.99% uptime, it means, that if there are 2 million database instances running in a single cloud center, 200 of them will be down at any given time.
Russian Roulette may not be one of the features your cloud platform offers, but it's there and you have to manage it.
On the moon failure was not an option, but in the cloud it's guaranteed.
Today, RavenDB CEO Oren Eini will talk about the most common failures applications encounter on the cloud and how to handle them... or how to rest easy knowing that your RavenDB Cloud managed database is handling them for you.
We can never assume that a failure is something that we can avoid or prevent or in some manner escape. Failure is something that we are just going to have to deal with.
What happens when you get the 7 nines uptime guarantee? That's 99.99999%, or a 1 in a million chance of failure. Sounds good, right?
If you are on a cloud platform with 1,000 machines, that means a machine can be down for 52 minutes. What if you have 10,000 machines? You can suffer 8 hours of downtime per year. Now if 10,000 machines are all doing the same thing, that's fine. You have fault tolerance. But what if each of them is doing something unique? There will be someone stuck, and if other machines rely on that machine or person to do their job - then the downtime will spread throughout your system.
Even in the best-case scenario there is failure and you need to manage it.
What is Failure on the Cloud?
There are two specific types of failure we will focus on. The worst type of error is where something is slow. What makes it worse is that it is slow only some of the time, like if 1 of 100 requests are slow, or even 1 in 1,000 requests are slow and you need to figure out what is going on. From my experience, these types of bugs are the nastiest to try to resolve.
RavenDB's attitude is that failure is expected, handled, and mitigated. When we rebuilt RavenDB in the 4.0 version, our emphasis was to focus development on, where does it hurt.
In real life, you constantly expect failure. We don't try to prevent it, but we do try to work with it. When we go in a car, we expect to arrive at our destination safely, but we still put on seat belts.
We have a whole array of seatbelts and airbags deployed throughout your system so that when things do occur, you can keep going with minimal to zero damage. We are one of the few databases that install features inside the database to handle failure and keep you running at all times.
What Can Fail?
The simple answer is: Everything can fail.
What are the common causes for most issues? The network. I/O. Cluster liveliness.
Whenever we do something that is internal to RavenDB, I trust it. Whenever I have to leave RavenDB, like talking to the network, or even talking to the disk - I don't trust it.
The simplest form of failure is network outage, where I cannot connect to something. You may sit there as the system tries again and again to reconnect. We handle this by reading the topology of the database, asking what are the nodes this database resides on?
RavenDB uses a multi-master system where you can perform both reads and writes on each node in your cluster. If you cannot connect to one node, RavenDB will look for other nodes that are working and connect to those.
On the server side, RavenDB will change the topology the moment one of the nodes fails and notify all the clients of the new topology. This works as a new node will likely replace the failed node and a database will be replicated to it. This ensures that all your clients know exactly where they can turn to for their data at all times.
If you have a lot of data to transfer, you can churn through your burstable performance within a minute. From the user's point of view, he is scratching his head? How did something that was working fine suddenly freeze up?
RavenDB handles this by giving you full metrics of the data you are moving, the memory that data contains, and how long it takes - letting you know how to budget your credits. RavenDB gives you the internal metrics the cloud is skimpy with so you know what your options are, like provision a better machine, redo some of your queries, or simply use less or more bandwidth or change the network cable.
RavenDB is the only database to give you such a comprehensive and detailed breakdown of the internals of your data to diagnose issues like these and resolve them quickly.
We Don't Trust the Network
If you are running through a hotel network, someone is listening in on the line. If you are using a corporate network, you are never using it alone. To protect you, all your data is encrypted over the wire. We spent quite a lot of time to ensure that when you get errors relating to security, you will get good errors. You get errors that say, "You tried to access this database, but you didn't use a client certificate."
Errors that say Your system is secure.
When Your I/O is Slow or Corrupted
There are certain hard drives that cannot confirm that what persisted to disk is what is on the disk. Many times, a database will tell you that they persisted an operation to disk while they are still in the process of writing it. If the power goes out, you think you have the information while it is actually not there.
RavenDB will actively validate that data stored to the disk is on the disk and in the form that it was written. We are among the few databases that do this.
Memory is a finite resource. We have users running RavenDB in containers with a very limited amount of memory. We have users running us on tiny VMs, provisioning us on small cloud instances, and so on.
RavenDB was written using .NET core. This is a managed language that is using garbage collection. One of the problems of garbage collection is something called GC pauses. When the GC needs to collect garbage, and it needs to stop the world so it will have the chance of collecting all of the generated garbage.
That can be quite expensive. We have set RavenDB to retain virtual memory so even if we are not currently using memory right now, we make sure that we keep it so we can use it soon. The idea here is that we don't want to have a ping pong of allocate memory from the operating system, release it to the operating system, etc., etc.
What we did was that we started moving data directly into native memory where RavenDB is the one responsible for the data management. RavenDB is the one doing allocations and tracking, doing optimizations like that. This allows us to tailor the memory access and the memory allocation strategy toward what we want to do.
More importantly, it also means that we don't have to push so much memory into the GC. Most of the memory that RavenDB is using is native memory which means that the GC has a lot less work to do. That means that we have far fewer GC pauses and they don't tend to be something you have to pay attention to.
There are servers that have terabytes of memory, but RavenDB tends to run on machines that range from 512 MB to 256 GB. That means we really have to take care of efficiency. Now efficiency goes both ways. If I am running on a 512 MB, I really care about all of the memory that I am using and I want to use as little as possible because I don't have a lot of space to work with.
On the other hand, if I am running on a system that is running on 256 GB of memory, I absolutely want to use as much of that as possible, otherwise I have a user who just purchased a lot of memory and he using nothing.
RavenDB memory usage is divided into 3 categories. There is the native memory, where we do most of the work. A lot of the ongoing work is done using managed memory. In a typical environment, the biggest usage of managed memory in RavenDB is used in indexing. Most of the data in RavenDB is in memory mapped data, data that is sitting on disk and has been mapped to memory to work with it directly.
Notice that every topic we go over are all the components of your computer. This goes in line with our thinking that everything outside RavenDB is suspect to fail. We cover all the bases so as many types of failures that can happen RavenDB has solutions in place to deal with.
In a cluster, I need to ensure that I am able to communicate with other members in a timely fashion and do all the back-end operations. We created a set of tiers for this. What is most important, what is less important, and what can be done at some point.
We use thread priorities in order to tell the operating system how you want that to happen.