Production postmortem: Do you trust this server?

A customer called us with a problem. They set up a production cluster successfully, they could manually verify that everything is working, except that it would fail when they try to connect to it via the client API.

The error in question looked something like this:

CertificateNameMismatchException: You are trying to contact host rvn-db-72 but the hostname must match one of the CN or SAN properties of the server certificate: CN=rvn-db-72, OU=UAT, OU=Computers, OU=Operations, OU=Jam, DC=example, DC=com, DNS Name=rvn-db-72.jam.example.com

That is… a really strange error. Because they were accessing the server using: rvn-db-72.jam.example.com, and that was the configured certificate for it. But for some reason the RavenDB client was trying to connect directly to rvn-db-72. It was able to connect to it, but failed on the hostname validation because the certificates didn’t match.

Initially, we suspected that there is some sort of a MITM or some network appliance that got in the way, but we finally figured out that we had the following sequence of events, shown in the image below. The RavenDB client was properly configured, but when it asked the server where the database is, the server would give the wrong URL, leading to this error.

This deserves some explanation. When we initialize the RavenDB client, one of the first things that the client does is query the cluster for the URLs where it can find the database it needs to work with. This is because the distribution of databases in a cluster doesn’t have to match the nodes in the cluster.

Consider this setup:

In this case, we have three nodes in the cluster, but the “Orders DB” is located only on two of them. If we query the rvn-db-72 database for the topology of “Orders DB”, we’ll get nodes rvn-db-73 and rvn-db-74. Here is what this will look like:

Now that we understand what is going on, what is the root cause of the problem?

A misconfigured server, basically. The PublicServerUrl for the server in question was left as the hostname, instead of the full domain name.

This configuration meant that the server would give the wrong URL to the client, which would then fail.

This is something that only the client API is doing, so the Studio behaved just fine, which made it harder to figure out what exactly is going on there. The actual fix is trivial, naturally, but figuring it out took too long. We’ll be adding an alert to detect and resolve misconfigurations like that in the future.

Production postmortem: Do you trust this server?

Woah, already finished? 🤯

Related Articles

Using RavenDB Queue ETL as Event Outbox

Begin analysis with OLAP ETL

RavenDB Ansible Collection: New Features (12/25)