Production postmortem: The encrypted database that was too big to replicate
A customer called the support hotline with a serious problem. They had a large database and wanted to add another replica node to it. This is a fairly standard thing to do, of course. The problem was that somewhere around the 70% mark, the replication process stalled. All the metrics were green, the mentor node and the new node had perfect connectivity, and there were no errors in the logs.
Typical reasons for replication to stall usually involve connectivity issues, but in this case, we could see that there was no such sign of that. In fact, the mentor node kept sending (empty) batches to the destination node. That shouldn’t be the case, however. If we have nothing to send, there shouldn’t be a batch sent over the wire. That was the only hint of something wrong.
We also looked into what information RavenDB could tell us about the system, and noticed that we have a performance hint about big documents. Some of them exceeded 32MB in size, which is… quite a lot. That doesn’t really relate so much to replication, however. It would surely slow it down, but it should work.
Looking into the logs, we could see that the mentor node was attempting to send a batch, but it was sending zero documents. Digging deeper, we saw an entry about skipping documents, that was… strange. Cross referencing the log statement with the source code revealed that RavenDB decided that it is sending too much in the batch and aborted it. But… it isn’t sending anything in the batch.
What is actually going on is that the database in question is an encrypted one. Encrypted databases in RavenDB are encrypted in both disk and memory. The only time that we decrypt a document is when there is an active transaction reading it. During that time, we hold that in locked memory (so it wouldn’t be paged to disk). As a result of that, we try to limit the size of transactions in encrypted databases. When we replicate data between nodes, we open a read transaction on the source node, read the documents that we need to replicate and send them to the other side.
There is a small caveat here, each node in an encrypted database can use a different encryption key, so we aren’t sending the encrypted data, but the plain text. Of course, the communication itself is encrypted, so nothing can peek into the data in the middle.
By default, we’ll stop a replication batch in an encrypted database after we locked more than 64 MB of memory. A replication batch of 64 MB is plenty big enough, after all. However… we didn’t take into account a scenario where a single document may cause us to consume more than 64 MB in locked memory. And we put the check to close the replication batch very early in the process.
The sequence of operations was then:
- Start a replication batch
- Load the first document to send
- Realize that we locked too much memory and close the batch
- Send a zero length batch
Rinse and repeat, since we can’t make any forward progress.
The actual solution was to set the “Replication.MaxSizeToSendInMb” configuration option to a higher value, enough to send even the biggest documents the customer has. At that point, there was forward progress again in the system and the replication was completed successfully.
We still consider this a bug, and we’ll fix it so there won’t be a hang in the system, but I’m happy to see that we were able to do a configuration change and get everything up to speed so quickly.