Hot failures and high availability June 14, 2021 Author: Oren Eini, CEO RavenDB
Read the orginal blog post

Last week, Amazon had an outage in its Frankfurt region. Here is what they had to say about it:

We can confirm increased API error rates and latencies for the EC2 APIs and connectivity issues for instances within a single Availability Zone (euc1-az1) within the EU-CENTRAL-1 Region, caused by an increase in ambient temperature within a subsection of the affected Availability Zone. Other Availability Zones within the EU-CENTRAL-1 Region are not affected by the issue and we continue to work towards resolving the issue.

That is of particular interest to us, because we have clients running RavenDB Cloud clusters on that region. Here are some of the alerts that we got when the incident happened:

image

This is marked as a Disaster level event, because we lost all connectivity with the node and none of the redundant watchdogs were unable to bring it back up.

Our operations team looked at the issue, figured out that this is an AWS outage that impacted us and then dropped the matter.

Wait a minute, dropped the matter?! What kind of a reaction is that from an operations team?

The right reaction. There wasn’t anything that we could have done, since the problem was out of our hands.

What does that means for our customers? Well, they didn’t notice that anything happened. RavenDB was explicitly designed to survive just this sort of incident.

On the cloud, we are running each cluster with three nodes on separate availability zones. A single node going down is a non event, the rest of the cluster will just make a note of that and clients will transparently failover to the other nodes.

This behavior is the basis for a lot of operations inside of RavenDB and RavenDB Cloud. For example, we routinely put ourselves in this position, whenever we do a maintenance run or whenever a user want to scale their systems up or down.

When the AWS outage ended, our internal systems then brought the nodes back online and they got integrated to their clusters automatically. All in all, that is pretty much a non event for everyone, but the fact that we suddenly got flooded with “the sky is falling” messages.