Postmortem: Partial RavenDB Cloud outage
On Friday, Sep 24, 2021, our monitoring tool started emit alerts. Our customers’ nodes were losing connectivity to one another. Not all of them, but a significant percentage. The issue impacted nodes in all availability zones, in all regions and across cloud providers. We are running isolated clusters, and aside from a total breakdown of the Internet, there shouldn’t be a scenario where this happens. Looking into the details, it soon became clear that the issue wasn’t with the infrastructure or the network. Actual connectivity between the nodes was fine, they were able to reach one another, but they weren’t trusting one another. Following a checklist, we verified that this isn’t an issue with a specific release or version, the issue was affecting (some) nodes across all our supported versions, including many that has been running in production for years without issue.
TLDR; If you are using RavenDB and you run into SSL / TLS connection issues such as:System.Security.Authentication.AuthenticationException: The remote certificate is invalid because of errors in the certificate chain
You can resolve the issue by running the following command on your instances:rm /home/$RAVENDB_USER/.dotnet/corefx/cryptography/x509stores/ca/DAC9024F54D8F6DF94935FB1732638CA6AD77C13.pfx rm /home/$RAVENDB_USER/.dotnet/corefx/cryptography/x509stores/ca/48504E974C0DAC5B5CD476C8202274B24C8C7172.pfx
RavenDB uses X509 certificates for authentication. One of the primary reasons that it is doing so is that reduce the dependency on external actions (like LDAP, for example). We were able to quickly figure out what the root cause was. Clients were not trusting the servers. Usually, for authentication, it is the other way around, but in this case, somehow, the client connection would detect that the remote server is using an invalid certificate.
We are using Let’s Encrypt certificate, like the vast majority of the web, and a quick look on our own showed us nothing. Our browsers certainly thought that the certificate was valid, and locally everything worked. On the production systems (and that impacted a few dozen clusters), on the other hand, they were flat out rejected. That was not supposed to happen. The certificate were valid, and we would get a different error if they were expired. One of the good things about using X509 certificates is that it is easy to debug independently. Here is an example of what we saw:
ayende@oren-pc:~$ curl https://a.free.ayende.ravendb.cloud curl: (60) SSL certificate problem: unable to get local issuer certificate More details here: https://curl.haxx.se/docs/sslcerts.html curl failed to verify the legitimacy of the server and therefore could not establish a secure connection to it. To learn more about this situation and how to fix it, please visit the web page mentioned above.
Digging a bit deeper, we can use OpenSSL to see the real details:
ayende@oren-pc:~$ openssl s_client -connect a.free.ayende.ravendb.cloud:443 CONNECTED(00000003) depth=1 C = US, O = Let's Encrypt, CN = R3 verify error:num=20:unable to get local issuer certificate verify return:1 depth=0 CN = *.free.ayende.ravendb.cloud verify return:1 --- Certificate chain 0 s:CN = *.free.ayende.ravendb.cloud i:C = US, O = Let's Encrypt, CN = R3 1 s:C = US, O = Let's Encrypt, CN = R3 i:O = Digital Signature Trust Co., CN = DST Root CA X3 ---
So the issue is that is is now no longer trusting Let’s Encrypt? What the hell is going on. The start of the answer is here, the root certificate for Let’s Encrypt will expire at the end of the month. That was known well in advance, and there is a new root certificate in place that should be in pretty much all devices that you use (except very old Androids, which aren’t relevant here).
The problem is that for some reason, which if I understand correctly has to do with old versions of OpenSSL, the ca-certificates package, which is a common way to maintain a list of trusted root certificates has removed the soon to be expiring (but not yet actually expired) from the list of trusted root certificates. That was pushed (as a security update), to many instances. It is at this point that problem started:
- Established connections are fine and worked just great, since they already authenticated and were working without issue.
- New connections, on the other hand, would fail, because the trusted root was now untrusted.
As you can imagine, this caused somewhat of an uproar for our operations team.
The killer thing for us, I have to say, is that we tested this scenario before hand. There has been plenty of time to get ready, and our ops team have made sure that we provisioned the certificates with the new root. This should have been a non event, given that we prepared specifically for that.
However… what happened was that on some portion of our machines, the certificate that was used was using the old chain, the one that is no longer valid. At this point we were looking at having to redeploy certificates to hundreds of clusters and we split the team to work on things in parallel. While one part of the team prepped the emergency patch to all instances, the other half focused on how to properly generate the right certificate.
That is a bit complex, we intentionally do not keep the private keys of the clusters on our management plane, they are located solely on customer instances, so we first had to fetch them from the store, generate a new Let’s Encrypt certificate and then distribute it back to the right location. Tricky bit of work, but we are used to working at this scale.
Except… that didn’t work. To be rather more exact, the certificates that we got from Let’s Encrypt were always using the old root certificate, never the new one. There is a way to get the specific chain from Let’s Encrypt, but our client for interacting with Let’s Encrypt did not support it. It shouldn’t matter, however, all recent (6 months or so, at least) certificates from Let’s Encrypt are signed using the new chain, so how can this be?
Here is how you’ll typically see a discussion of cross-signed certificates:
Basically, you have a single certificate that is signed using both chains. Except, that isn’t how it works. A certificate can be signed by a single issuer. The key here is that the issuer is not a certificate. The issuer is the public key + name of the certificate that provided the signature. The key here is that you can get the same public key + name to be signed by multiple parties. Which means, of course, that when you serve the certificate to a client, you typically need to tell it what certificate signed this one. I’ll have a more detailed post on that topic, but for now, assume that when the server reply to a client, it sends either:
- Our Certificate + Issuer (green)
- Our Certificate + Issuer (red)
Note that the client that receive the certificate is free to absolutely ignore any other certificate that the server it sending, but most of the time, it will use that to follow the chain of certificates and find a trusted root.
In our case, we could see that we were serving the wrong certificate chain. We were serving the (valid) server certificate, and an intermediate that pointed to a removed root. That meant that the client would follow the chain we provided and then would find a missing root, leading to a trust violation and an aborted connection.
That, in a word, sucks. Because we couldn’t figure out what was going on. Given the scope of the problem and the number of impacted customers, we needed to figure out a way to resolve that as soon as possible.
Going back to the Let’s Encrypt certificates that we were generating, what we actually get back from Let’s Encrypt is actually a certificate that contains:
- Server Certificate (a.free.ayende.ravendb.cloud, for example)
- Intermediate certificate R3 signed by DST Root CA X3 – that one is the problematic one
- Intermediate certificate R3 signed by ISRG Root X1 – that is the one we want
We verified that not only are we getting the right certificate, we have been getting the right certificate for quite some time now, including on all the impacted servers.
A note on the state of the system at that frame. We have quite a few customers running on the cloud, and the problem only impacted a small percentage at the time. Given features such as connection pooling, actually creating new connection tends to be rare. But the longer we let it go, the more chance is that a connection would break and we can’t renew those.
What is more, that impacts both server to server communication in the cluster and clients connecting to the server. That means that this also happened to customers on their application (although it was pretty rare), the major impact that we could see is that the cluster connections broke, which meant that we effectively had a network split, even though the network itself was fine.
RavenDB is actually designed to operate in such an environment, and the nodes carried on their task. Reads and writes to the database worked as usual, but cluster wide tasks (such as deploying indexes) would fail.
So the sky wasn’t falling yet, but there were some dark clouds in the horizon. At this point, we have managed to figure out why the certificate isn’t trusted anymore, because of the automatic update for the ca-certificates package. The test environment showed that reinstalling the removed certificate would restore proper behavior. After verifying that this indeed fixes the issue, we rolled out across our fleet.
Problem solved? Not so fast. We plugged the hole in the ship, but in a but a few days, we are going to be looking into the exact same issue, because the certificate will actually expire. We had to figure out what happened and why RavenDB wasn’t serving the right certificate chain in this context.
Cue putting of spelunking equipment and digging into pretty much every code base in site. We scoured OpenSSL, .NET framework and dug through documentation about the certificate infrastructure in Windows. We couldn’t figure out what was going on. This should work. In fact, all our tests showed us that it is working, except it didn’t.
At this point, we believe that we figured it out, and this is pretty insane reason. Four years ago, we run into a serious problem inside of RavenDB when running in a secured environment. If the certificate chain wasn’t available on the host, it would download that so it could serve the appropriate chain when doing TLS handshake. To avoid this issue, RavenDB will build the certificate chain once on startup, and then store that in the CA X509Store for the RavenDB user. That allowed us to remove a few remote calls on connection startup and made us more resilient when we are facing network issues.
The actual problem we were trying to solve is that occasionally, RavenDB would fail when (a completely random, as far as we are concerned) server (which is serving the missing intermediate certificate) is missing. Registering the certificate ourselves in the RavenDB’s user store was a way to avoid this issue entirely. It also falls in line with the It Just Works model that we have. The other alternative would be to ask users to setup the full certificate chains locally (for each platform and environment). We rather handle that ourselves and not have to have another item in the checklist.
That said, this created an interesting issue. On Linux machines, there are now multiple certificate stores that are being used. There is the .NET X509 certificate store, which is stored at ~/.dotnet/corefx/cryptography/x509stores/ca/ and there is the system store, which is stored in /etc/ssl/certs (usually).
The ca-certificates update, however, caused an interesting problem. The relevant certificate was removed from the system store, but it was still on the .NET store. If it would have expired, as we expected it to, the situation would be just fine. The stores would agree and serve the new chain without issue.
However, given that it didn’t expire yet, we had a discrepancy. .NET will obviously favor its X509 store if it can find the certificate there, so it kept serving the old chain. The solution to that it to manually clear the old certificate from the cache as well, you can do that by issuing the following commands:
If you are running on windows, you can run the following commands to achieve the same goal:
.PsExec.exe -u "nt authorityLocal Service" powershell.exe
Get-ChildItem cert:/CurrentUser/CA/48504E974C0DAC5B5CD476C8202274B24C8C7172 | Remove-Item
That would remove the relevant chain from the store and make RavenDB use the right chain. There is no need to restart the service, it should automatically fix itself.
For the future, we are going to take a long look into our policies in this matter. Given the advent of Docker & Cloud services, we feel that the actual management of the entire environment should be left to the environment, not to the RavenDB process itself.