Accidental complexity explosion and document modeling concerns

In RavenDB Cloud, we routinely monitor the usage of the RavenDB Cluster that our customers run. We noticed something strange in one of them, the system utilization didn’t match the expected load given the number of requests the cluster was handling. We talked to the customer to try to figure out what was going and we had the following findings (all details are masked, naturally, but the gist is the same).

The customer stores millions of documents in RavenDB.
The document sizes range from 20KB – 10 MB.

RavenDB supports documents up to 2 GB in size, but we consider multi MB documents problematic.

Let’s say that the documents in questions were BlogPosts which had a Comments array in them. We analyzed the document data and found the following details:

99.89% of the documents had less than 100 comments in them.
0.11% of the documents (a few thousands) had more than 100 comments in them.
99.98% of the documents had less than 200 comments in them.
0.02% of the documents had more than 200 comments in them!

Let’s look at this 0.02%, shall we?

The biggest document exceeded 10 MB and had over 24,000 comments in it.

So far, that didn’t seem like a suboptimal modeling decision, which can lead to some inefficiencies for those 0.02% of the cases. Not ideal, but no big deal. However, the customer also defined the following index:

Take a minute to look at this index. Note the parts that I marked? This index is doing something that look innocent. It index all the comments in the post, but it does this using a fanout. The index will contain as many index entries as the document has comments. We also need to store all of those comments in the index as well as in the document itself.

Let’s consider what is the cost of this index as far as RavenDB is concerned. Here is the cost per indexing run for different sized documents, from 0 to 25 comments.

This looks like a nice linear cost, right? O(N) cost as expected. But we only consider the cost for a single operation. Let’s say that we have a blog post that we add 25 comments to, one at a time. What would be the total amount of work we’ll need to do? Here is what this looks like:

Looks familiar? This is O(N^2) is all its glory.

Let’s look at the actual work done for different size documents, shall we?

I had to use log scale here, because the numbers are so high.

The cost of indexing a document with 200 comments is 20,100 operations. For a thousand comments, the cost is 500,500 operations. It’s over 50 millions for a document with 10,000 comments.

Given the fact that the popular documents are more likely to change, that means that we have quite a lot of load on the system, disproportional to the number of actual requests we have, because we have to do so much work during indexing.

So now we know what was the cause of the higher than expected utilization. The question here, what can we do about this? There are two separate issues that we need to deal with here. The first, the actual data modeling, is something that I have talked about before. Instead of putting all the comments in a single location, break it up based on size / date / etc. The book also has some advice on the matter. I consider this to be the less urgent issue.

The second problem is the cost of indexing, which is quite high because of the fanout. Here we need to understand why we have a fanout. The user may want to be able to run a query like so:

This will give us the comments of a particular user, projecting them directly from their store in the index. We can change the way we structure the index and then project the relevant results directly. Let’s see the code:

As you can see, instead of having a fanout, we’ll have a single index entry per document. We’ll still need to do more work for larger documents, but we reduced it significantly. More importantly, we don’t need to store the data. Instead, at query time, we use the where clause to find documents that match, then project just the comments that we want back to the user.

Simple, effective and won’t cause your database’s workload to increase significantly.

RavenDB

RavenDB Cloud

Try

Experience interactive demos and playground server

RavenDB Docs

RavenDB Cloud Docs

Documentation Guide

Download

Features

Performance

Comparison

What’s New

Demo

Bootcamp

Webinars

Workshops

Inside RavenDB Book

GitHub

StackOverflow

Articles

Whitepapers

Events

Promotional Materials

Unlock your business potential

Use Cases

Articles

Whitepapers

Press Releases

Industry Reports

Performance

Comparison

Proof of Concept Program

Academic Program

Events

What’s New

Roadmap

On-premise Pricing

Cloud Pricing

Support

Proof of Concept Program

Academic Program

Accidental complexity explosion and document modeling concerns

Woah, already finished? 🤯

Related Articles

CollabTalk Podcast | Episode 123 with Oren Eini–Building a business with Open Source foundations

RavenDB’s storage engine: Voron–unlocking the secret

Certificates from the Ground Up

Watch Live Demo

RavenDB

RavenDB Cloud

Try

RavenDB Docs

RavenDB Cloud Docs

Documentation Guide

Download

Features

Performance

Comparison

What’s New

Demo

Bootcamp

Webinars

Workshops

Inside RavenDB Book

GitHub

StackOverflow

Articles

Whitepapers

Events

Promotional Materials

Use Cases

Articles

Whitepapers

Press Releases

Industry Reports

Performance

Comparison

Proof of Concept Program

Academic Program

Events