On fixing a bug (and all its siblings) with a forward looking view

We run into a strange situation deep in the guts of RavenDB. A cluster command (the backbone of how RavenDB is coordinating action in a distributed cluster) failed because of an allocation failure. That is something that we are ready for, since RavenDB is a robust system that handles such memory allocation failures. The problem was that this was a persistent allocation failure. Looking at the actual error explained what was going on. We allocate memory in units that are powers of two, and we had an allocation request that would overflow a 32 bits integer.

Let me reiterate that, we have a single cluster command that would need more memory than can fit in 32 bits. A cluster command isn’t strictly limited, but a 1MB cluster command is huge, as far as we are concerned. Seeing something that exceeds the GB mark was horrifying. The actual issue here was somewhere completely different, there was a bug that caused quadratic growth in the size of a database record. This post isn’t about that problem, it is about the fix.

We believe in defense in depth for such issues. So aside from fixing the actual cause for this problem, the issue was how we can prevent similar issues in the future. We decided that we’ll place a reasonable size limit on the cluster commands, and we chose 128MB as the limit (this is far higher than any expected value, mind). We chose that value since it is both big enough to be outside anyone’s actual usage, but at the same time, it is small enough that we can increase this if we need to. That means that this needs to be a configuration value, so the user can modify that in place if needed. The idea is that we’ll stop the generation of a command of this size, before it hits the actual cluster and poison it.

Which brings me to this piece of code, which was the reason for this blog post:

This is where we are actually throwing the error if we found a command that is too big (the check is done by the caller, not important here).

Looking at the code, it does what is needed, but it is missing a couple of really important features:

We mention the size of the command, but not the actual size limit.
We don’t mention that this isn’t a hard coded limit.

The fix here would be to include both those details in the message. The idea is that the user will not only be informed about what the problem is, but also be made aware of how they can fix it themselves. No need to contact support (and if support is called, we can tell right away what is going on).

This idea, the notion that we should be quite explicit about not only what the problem is but also how to fix it, is very important to the overall design of RavenDB. It allows us to produce software that is self supporting, instead of ErrorCode: 413, you get not only the full details, but how you can fix it.

Admittedly, I fully expect to never ever hear about this issue again in my lifetime. But in case I’m wrong, we’ll be in a much better position to respond to it.

RavenDB

RavenDB Cloud

Try

Experience interactive demos and playground server

RavenDB Docs

RavenDB Cloud Docs

Documentation Guide

Download

Features

Performance

Comparison

What’s New

Demo

Bootcamp

Webinars

Workshops

Inside RavenDB Book

GitHub

StackOverflow

Articles

Whitepapers

Events

Promotional Materials

Unlock your business potential

Use Cases

Articles

Whitepapers

Press Releases

Industry Reports

Performance

Comparison

Proof of Concept Program

Academic Program

Events

What’s New

Roadmap

On-premise Pricing

Cloud Pricing

Support

Proof of Concept Program

Academic Program

On fixing a bug (and all its siblings) with a forward looking view

Woah, already finished? 🤯

Related Articles

CollabTalk Podcast | Episode 123 with Oren Eini–Building a business with Open Source foundations

RavenDB’s storage engine: Voron–unlocking the secret

Certificates from the Ground Up

Watch Live Demo

RavenDB

RavenDB Cloud

Try

RavenDB Docs

RavenDB Cloud Docs

Documentation Guide

Download

Features

Performance

Comparison

What’s New

Demo

Bootcamp

Webinars

Workshops

Inside RavenDB Book

GitHub

StackOverflow

Articles

Whitepapers

Events

Promotional Materials

Use Cases

Articles

Whitepapers

Press Releases

Industry Reports

Performance

Comparison

Proof of Concept Program

Academic Program

Events