Implementing a count(distinct) query in RavenDB

A user called us to ask about how they can manage to move a particular report from a legacy system to RavenDB. They need to be able to ask questions such as the following one:

This is an interesting issue, when you think about it from the point of view of a database engine. The distinct issue means that we have to keep state (all the unique values) while we evaluate the query, which can be expensive. One of the design principles of RavenDB was that we want to make it hard to accidently create expensive queries. Indeed, a query like that isn’t trivial to implement in RavenDB. We need to have a two stage approach for implementing this feature.

First, we’ll introduce a Map/Reduce index, which will aggregate the data on Employee, Company and City. Along the way, it will run the distinct operation on the City, because it will group by it. That gives us a model in which we get the distinct amount for free, and in a highly efficient manner. Here is the index in question:

The interesting thing about this index is that querying it will not give us the right results. We don’t want to get the details based on Employee, Company and City. We want just by Employee and Company. This is where the second stage comes into play. Instead of running a simple query on the index, we’ll use a faceted query. Here is what it will look like:

What this does is to aggregate the results (which were already partially aggregated by the Map/Reduce) and give us the totals. And here are the results:

The end result is that we are able to do most of the work an indexing time, and the query time is left working on already aggregated data. That means that the queries should be much faster and that there is a lot less work for the database to do.

It also isn’t RavenDB’s strong suit. Such queries are typically more inline with OLAP systems, to be honest. If you know what your query patterns looks like, you can use this technique to easily handle such queries, but if there is a wide range of dynamic queries, you may want to use RavenDB as the system of record and then use either SQL ETL or OLAP ETL to push that to a reporting system.

RavenDB

RavenDB Cloud

Try

Experience interactive demos and playground server

RavenDB Docs

RavenDB Cloud Docs

Documentation Guide

Download

Features

Performance

Comparison

What’s New

Demo

Bootcamp

Webinars

Workshops

Inside RavenDB Book

GitHub

StackOverflow

Articles

Whitepapers

Events

Promotional Materials

Unlock your business potential

Use Cases

Articles

Whitepapers

Press Releases

Industry Reports

Performance

Comparison

Proof of Concept Program

Academic Program

Events

What’s New

Roadmap

On-premise Pricing

Cloud Pricing

Support

Proof of Concept Program

Academic Program

Implementing a count(distinct) query in RavenDB

Woah, already finished? 🤯

Related Articles

CollabTalk Podcast | Episode 123 with Oren Eini–Building a business with Open Source foundations

RavenDB’s storage engine: Voron–unlocking the secret

Certificates from the Ground Up

Watch Live Demo

RavenDB

RavenDB Cloud

Try

RavenDB Docs

RavenDB Cloud Docs

Documentation Guide

Download

Features

Performance

Comparison

What’s New

Demo

Bootcamp

Webinars

Workshops

Inside RavenDB Book

GitHub

StackOverflow

Articles

Whitepapers

Events

Promotional Materials

Use Cases

Articles

Whitepapers

Press Releases

Industry Reports

Performance

Comparison

Proof of Concept Program

Academic Program

Events