Implementing a count(distinct) query in RavenDB
A user called us to ask about how they can manage to move a particular report from a legacy system to RavenDB. They need to be able to ask questions such as the following one:
This is an interesting issue, when you think about it from the point of view of a database engine. The distinct issue means that we have to keep state (all the unique values) while we evaluate the query, which can be expensive. One of the design principles of RavenDB was that we want to make it hard to accidently create expensive queries. Indeed, a query like that isn’t trivial to implement in RavenDB. We need to have a two stage approach for implementing this feature.
First, we’ll introduce a Map/Reduce index, which will aggregate the data on Employee, Company and City. Along the way, it will run the distinct operation on the City, because it will group by it. That gives us a model in which we get the distinct amount for free, and in a highly efficient manner. Here is the index in question:
The interesting thing about this index is that querying it will not give us the right results. We don’t want to get the details based on Employee, Company and City. We want just by Employee and Company. This is where the second stage comes into play. Instead of running a simple query on the index, we’ll use a faceted query. Here is what it will look like:
What this does is to aggregate the results (which were already partially aggregated by the Map/Reduce) and give us the totals. And here are the results:
The end result is that we are able to do most of the work an indexing time, and the query time is left working on already aggregated data. That means that the queries should be much faster and that there is a lot less work for the database to do.
It also isn’t RavenDB’s strong suit. Such queries are typically more inline with OLAP systems, to be honest. If you know what your query patterns looks like, you can use this technique to easily handle such queries, but if there is a wide range of dynamic queries, you may want to use RavenDB as the system of record and then use either SQL ETL or OLAP ETL to push that to a reporting system.