Inside RavenDB 4.0

Static Indexes and Other Advanced Options

In the previous chapter, we looked at how we can query RavenDB, while letting the query optimizer handle the indexing for us. In the next chapter, we'll talk about MapReduce indexes and in the one after that, we'll dive into the actual details of how indexes are implemented in RavenDB and the wealth of metrics that show how indexes work and behave, which is very useful for performance monitoring and troubleshooting issues. This chapter, however, will primarily focus on how you can define your own indexes, why you would want to do that and what options this provides you.

Indexes in RavenDB are split across multiple axes:

  • Auto (dynamic) indexes vs. static indexes.
  • Map-only indexes vs. MapReduce indexes.
  • Single collection indexes vs. multi-collection indexes.

Indexes offer quite a bit of features and capabilities, it's a big topic to cover — but it's also one that gives you a tremendous amount of power and flexibility.

What are indexes?

Indexes allow RavenDB to answer questions about your documents without scanning the entire dataset each and every time. An index can be created by the query optimizer or by the user directly. The way an index works is by iterating over the documents and building a map between the terms that are indexed and the actual documents that contain them - a process that is called indexing. After the first indexing run, the index will keep that map current as updates and deletes happen in the database.

Listing 10.1 shows a simple way to construct an index. The code in Listing 10.1 has nothing to do with RavenDB but is provided so we'll have a baseline from which to discuss how indexes work.

Listing 10.1 Creating an index over users' names


Func<string, List<User>> BuildIndexOnUsers(List<User> users)
{
    var index = new Dictionary<string, List<int>>();
    for (var i = 0; i < users.Count; i++)
    {
        if (index.TryGetValue(users[i].Name, out var list) == false)
        {
            list = new List<int>();
            index[users[i].Name] = list;
        }
        list.Add(i);
    }

    return username =>
    {
        var results = new List<User>();
        if (index.TryGetValue(username, out var matches))
        {
            foreach (var match in matches)
                results.Add(users[match]);
        }
        return results;
    };
}

The code in Listing 10.1 is meant to convey a sense of what's going on. We're given a list of users, and we iterate over the list, building a dictionary that would allow fast access to the user by name. This is all an index is, effectively. It trades the cost of building the index with a significant reduction in the cost of the query.

If there's a query you only want to perform once, you're probably better off just scanning through the entire dataset since you'll do that anyway when creating an index. But if you intend to query more than once, an index is a fine investment. Consider the two options shown in Listing 10.2.

Listing 10.2 The right and wrong way to use an index


// the right way
Func<string, List<User>> findUserByName = BuildIndexOnUsers(largeNumberOfUsers);
List<User> usersNamedOren = findUserByName("Oren");
List<User> usersNamedArava = findUserByName("Arava");

// the wrong way
List<User> usersNamedOren = BuildIndexOnUsers(largeNumberOfUsers)("Oren");
List<User> usersNamedArava = BuildIndexOnUsers(largeNumberOfUsers)("Arava");

In the first section in Listing 10.2, we generate the index and then use it multiple times. In the second section, we create the index for each query, dramatically increasing the overall cost of making the query.

The code in Listing 10.1 and Listing 10.2 is about as primitive an index as you can imagine. Real world indexes are quite a bit more complex, but there's a surprising number of details that are actually unchanged between the toy index we have here and the real-world indexes used in RavenDB.

Indexes in RavenDB are implemented via the Lucene search library, hosted inside Voron, RavenDB's storage engine. An index in RavenDB can contain multiple fields, and a query can be composed of any number of clauses that operate on these fields. But in the end, we end up with a simple search from the queried term to the list of documents that contain the term in the specified field, just like our dictionary usage in Listing 10.1.

Indexes come in layers — and with an identity crisis

I'll be the first to admit that this can be quite confusing, but RavenDB actually has several different things called "index" internally. At the lowest level, we have Voron indexes, which is how RavenDB organizes the data in persistent storage. As a user, you don't have any control over Voron indexes. A Voron index is used to find a document by ID, for example, or to return a list of documents that belong to a particular collection, without any further filters.

A Voron index is updated by the storage engine directly as part of the transaction and is always kept in sync with the data. Unlike the async nature of higher level indexes in RavenDB, these Voron indexes (sometimes also called storage indexes) guarantee full consistency to the reader. You'll rarely be able to see them in use or affect them in any way, but they're crucial for the well being and performance of RavenDB.

I did mention this is confusing, right? Because while we call these Voron indexes, our regular (exposed to the user) indexes are also stored in Voron, to ensure the index data is transactionally safe and can recover from an error such as an abrupt shutdown.

Even for the user-visible indexes (what you'll generally mean when you're talking about indexes), there are several different levels. At the outer edge of the system, you have the index that was defined for you by the query optimizer or that you manually created. This, however, is not the actual index but rather the index definition. It merely tells RavenDB how you want to index the data.

Then, we have the actual process of indexing the data (which resides in the Index class and its derived classes) and the actual output of the indexing process, which is also called an index.

The good news about this naming mess is that you're rarely going to need to think about any of that. As far as the outside world can tell, RavenDB allows you to define indexes, and that's pretty much it. But if you're reading the code or interested in the implementation details, you need to remember that when we're talking about an index, you might want to verify what index we're talking about.

For this chapter, we'll use the term index definition for the definition of what's going to be indexed and the term index for the actual indexed data generated from the indexing process.

The first index

In the Studio, go to the Indexes section and then to List of Indexes. Click the New index button and give the index the name "MyFirstIndex". This screen allows you to define an index by writing the transformation function from the document format to the actual indexed data.

RavenDB uses C# and Linq to define indexes, and Figure 10.1 shows the simplest possible index: indexing the FirstName and LastName from the Employees collection.

Figure 10.1 A simple index over the Employees collection

A simple index over the Employees collection

An index is just a C# Linq expression on a collection that outputs the values we want to index. The index in Figure 10.1 isn't a really interesting one, and it usually won't be a good candidate for a static index. There's nothing there that can't be done using an auto index the query optimizer will create for us. Nevertheless, let's see how we use such an index.

Listing 10.3 shows how we can query this index. Instead of specifying that we'll use a collection, which will cause the query optimizer to take over and select the best index to use, we're explicitly specifying that we want to use the "MyFirstIndex” index.

Listing 10.3 RQL query using an explicit index


from index 'MyFirstIndex' where FirstName = 'Andrew'

Except for the explicit index, the rest of the query looks familiar. Indeed, everything that we've gone over in the previous chapter still applies. The difference is that, in this case, we have an index that defines the shape of the result in a strict fashion. That doesn't seem like such a good idea until you realize the shape of the index and the shape of the source document don't have to be the same. Consider Listing 10.4, which shows an updated definition for "MyFirstIndex". It indexes a computed value rather than actual values from the document. Also consider Listing 10.5, which shows how to query the new index.

Listing 10.4 Indexing a computation allow us to query over the computed value


from emp in docs.Employees
select new 
{
    Name = emp.FirstName + " " + emp.LastName
}

Listing 10.5 Querying over the computed field


from index 'MyFirstIndex' where Name = "Nancy Davolio"

The result of Listing 10.4 and Listing 10.5 is quite interesting, so it's worth examining further. In Listing 10.4, we define an index whose output is a computed field ("Name"), which in turn is the result of a concatenation of two values from the document. In Listing 10.5, you can see that we're querying over the index and finding a result, even though the source document never contained such a field.

The example here is silly; I'll be the first to admit it. But it nicely shows off an important feature. You can run computation during the indexing process and then query over the result of the said computation. This is quite powerful because it allows you to do some pretty cool things. For example, consider Listing 10.6, which shows an index ("Orders/Total" in the sample dataset) that does a more interesting computation.

Listing 10.6 Computation during indexing can be arbitrarily complex


from order in docs.Orders
select new { 
    order.Employee,  
    order.Company, 
    Total = order.Lines.Sum(l => 
        (l.Quantity * l.PricePerUnit) * (1 - l.Discount)) 
}

In Listing 10.6, we've computed the total value of an order. The formula we used isn't too complex, but it's also not trivial. What makes it interesting is that this allows us to run a query such as the one in Listing 10.7.

Listing 10.7 Querying over computed field as well as sorting by it


from index 'Orders/Totals'
where Total > 100
order by Total as double desc

The query in Listing 10.7 demonstrates a few key concepts. Once the indexing process is done, the computed field is just that: a field. It means that you can filter using this field as well as sort by it. The computation has already happened at indexing time, and the cost of the query in Listing 10.7 is the cost of a seek to the relevant location in the index, plus the cost of sorting the results according to the indexed value.

Most importantly, this query involves no computation during its execution, only index operations. In contrast to a comparable query in SQL, which would have to sum all of the order lines for each Orders table, we can take an enormous shortcut by running the computation once during indexing and reusing it in our queries. We're going to spend quite some time exploring what kind of fun we can have with this feature.

Computation during indexing vs. computation during query

In the previous chapter, we mentioned that RavenDB does not allow you to perform queries that will require computation during the query, such as from Employees where FirstName = LastName. In order to answer such a query, RavenDB will need to check each individual document, which is incredibly expensive.

You can get an answer to the question you actually want to ask, though, which is "Do I have users whose first and last name match?" You do that using a static index, such as

from employee in docs.Employees
select new
{
   FirstAndLastNameMatch = employee.FirstName == employee.LastName    
}

And you can query this index using the following query: from index 'Employees/FirstAndLastNameMatch' where FirstAndLastNameMatch == true

This query can be performed as a simple indexing operation instead of an expensive full scan. This is a common thing to do with RavenDB: shift the cost of the computation to indexing time as much as possible. Queries are far more frequent than updates, so this kind of cost-shifting makes a lot of sense. Of course, even so, if you have several operations you want to do on a specific collection, you're better off having them all on a single index rather than having a separate index for each.

Doing computation during indexing is a neat trick, but how does RavenDB handle the case where a document was updated? That's quite simple, as it turns out. All RavenDB needs to do is simply run the updated document through the indexing function again and index the resulting values.

Index definition must be a pure function

RavenDB places several limitations on the index definition. Primarily, it requires that the index definition be a pure function. This means that for the same input, the index definition will always produce the same output. One of the reasons that RavenDB uses Linq for defining indexes is that it's quite easy to define a pure function using Linq. In fact, you need to go out of your way to get a nondeterministic output from a Linq expression. And the syntax is quite nice, too, of course.

In particular, usage of date time functions or random, as well as trying to access external resources, is not allowed. This lets RavenDB assume that identical inputs will produce identical outputs and is important for reindexing and updates.

How the index actually works

There are a lot of moving parts here, so we need to clearly define what the terms we use mean:

  • Document — the RavenDB JSON document.
  • Index entry — all of the fields and values that have been indexed from a particular document. Frequently, it will be a subset of the fields from the document that is being indexed, but it can be some computed fields as well.
  • Term — the actual indexed value that's stored in the index. This is usually the same as the value of the field being indexed, but it can be different if you're applying full text search.

For example, let's assume we have the JSON document in Listing 10.8 and we query using from Dogs where search(Name, 'Arava').

Listing 10.8 Sample document that is about to be indexed


{
    "Name": "Arava Eini", 
    "Nick": "Dawg", 
    "@metadata": {
        "@id": "dogs/1",
        "@collection": "Dogs"
    } 
}

What will happen is that RavenDB will produce an index entry from this document that will have the structure {"Name": "Arava Eini"} and will mark the Name field as using full text search. This requires additional processing, and the actual terms that will be indexed are shown in Listing 10.9.

Listing 10.9 Index terms after the analyzing for full text search


index = {
    "Name": {
        "arava": ["dogs/1"],
        "eini": ["dogs/1"]
    }
}

The search(Name, 'Arava') will then be translated into what's effectively a search on index.Name['arava'] to get the proper matches.

This isn't how the index works at all. But it's a very good lie because it allows you to reason about what's going on and make intuitive guesses about the behavior of the system without actually having to deal with the full complexity of managing an index. For example, you can see from the data we keep in the index in Listing 10.9 that we aren't storing the full document in the index. Instead, we only store the document ID.

This means that the query pipeline first needs to run the query on the index and get the list of document IDs that are a match for this query and then go to the document storage to load the actual documents from there.

Queries, stale indexes and ACID documents, oh my!

In the previous chapter, we talked about async indexes and the possibility that a query will read from an index before it's done indexing new or modified documents. An interesting wrinkle here is that the index doesn't contain the actual document data, so after the index is done giving us the document IDs, we need to go to the document storage and load the documents.

One of the promises that RavenDB provides is that document reading is always ACID and must be consistent (within a single node, at least). This means that even if the index itself hasn't caught up to changes, the data it pulls from the document store will have everything up to date.

Another important aspect of how queries work in RavenDB, which you can see in Listing 10.9, is that the Name field was not indexed as a single term. In other words, if we looked for index.Name['Arava Eini'], we wouldn't find anything in the index. This is because we search for the terms in the index. And during indexing, the terms were created by breaking the name to its constituents' parts and making all letters lowercase. At query time, we can apply the same transformation and be able to find the individual terms.

If we were indexing the name without full text search, we'd index the term arava eini. So the only thing this will allow is for us to run a non-case-sensitive query. Using exact(), of course, will store the term in the index as is and will require a case-sensitive match.

We already saw, in Figure 9.4 in the previous chapter, that you can pull the terms that are actually indexed from RavenDB and inspect them, which can help you understand why your queries return the results they do.

All of this explanation is here to hammer home the fact that at the index level, we aren't querying on your documents' properties. We're querying on the output of the indexing function, and that may bear little resemblance to how your index looks. The Total field in Listing 10.6 serves as a good example for this feature. The documents don't have a Total property, but we can compute it during indexing and query on it.

Security considerations

It's worth noting that a static index is just a C# Linq statement, which means you have a lot of power in your hands. An index can transform a document into an index entry in some pretty interesting ways. Combine this with the fact that the shape of the index entry and the shape of the data can be completely disconnected from one another and it's easy to understand why we'll spend most of this chapter just skimming over all that you can do with indexes.

This power has a downside. Static indexes can do everything, including running arbitrary code. In fact, they are arbitrary code. Defining static indexes is an operation that's limited to database administrators for that reason. Auto indexes defined by the query optimizer do not have this issue, obviously, and will be defined for users of all permissions levels.

You can also use the Additional Sources feature in indexes to expose additional classes and methods to the index definition. This goes beyond having a simple Linq expression and allows you to run any code whatsoever on the server. It's mostly meant to allow you to perform complex logic in the indexing and enable advanced scenarios (using custom data types on client and server, with their own logic, like NodaTime). You can read more about the Additional Sources feature in the online documentation.

We can also filter data out during indexing. The indexing function takes a collection of documents and returns a collection of index entries. It can also return no index entries, in which case the document will not be indexed. This can be done using a where in the Linq expression that composes the index definition. Listing 10.10 shows an example of filtering out employees without a manager for the "Employees/ByManager" index.1

Listing 10.10 Filtering documents during indexing


from employee in docs.Employees
where employee.ReportsTo != null
select new
{
    employee.ReportsTo
}

Only employees who have someone to report to will be included in this index. There isn't usually a good reason to filter things during the indexing process. The reduction in the index data size isn't meaningful, and you're usually better off having this in the query itself, where you can make such decisions on a per query basis. We'll see another use case for filtering the raw data in the index in a bit, when we discuss multimap indexes.

Storing data in the index

A query in RavenDB will go to the index to find the results (and the order in which to return them) and then will usually grab the document IDs from the results and load the documents from the document store. Since the typical query will return full documents back, that's usually what you'll want to do.

Sometimes, such as with the Total field in Listing 10.6, you want to compute a value during indexing and use it in your projection. By default, RavenDB will store only enough information in the index to handle the query — not to get data out of the index. So as it stands, we'll need to recompute the Total after the query.

We can ask RavenDB to store the field. Go to Indexes, List of Indexes, and click the "Orders/Totals" index. This will take you to the index edit screen. Click Add Field and then set Total as the field name. Next, set "Store" to Yes. You can now save the index. This setting tells RavenDB that it needs to store the value itself (and not just the parts that it indexed) in such a way that we can later retrieve it.

We can project the Total field from the query, as you can see in Listing 10.11.

Listing 10.11 Projection of a stored computed field


from index 'Orders/Totals'
where Total > 10.0
select Total, Company, OrderedAt

Listing 10.11 also shows that we can project a standard field, Company, without storing it. This works because if the value isn't stored in the index we'll try to get it from the document. Last, we also project the OrderedAt, which follows the same logic. It isn't stored in the index, so it's fetched directly from the document.

Stored fields are used primarily to store the result of such computations. There's a small performance advantage in projecting everything from the index. We don't need to do the document load, and in some very specific circumstances, that might be helpful. But document load is a very cheap process in RavenDB. It happens so frequently that it's been heavily optimized, so there's usually not much point trying to store fields in the index.

Storing data in the index will also increase its size and the time it takes to actually index since it needs to do more work. Unless you actually need to get the projection out, it's usually not worth it.

Querying many sources at once with multimap indexes

Sometimes, just querying a single source isn't enough. In the sample database, we have several types of users. We have Employees, the Contact person on Companies and the Contact person for Suppliers. If an application needed to allow a free-form search over all of these users, how would that work? Would we need to perform three separate queries? We could do that, but it would be pretty complex. Instead, we can define an index that will use more than a single source of data.

Go to Indexes and then to List of Indexes. Click New Index. Then, name the new index "People/Search" and click Add map twice. You can see the content of the map functions in Listing 10.12.

Listing 10.12 These map functions allow us to query over multiple sources with ease


from e in docs.Employees
select new
{
    Name = e.FirstName + " " + e.LastName
}

from c in docs.Companies
select new 
{
    c.Contact.Name
}

from s in docs.Suppliers
select new 
{
    s.Contact.Name
}

The three map functions in Listing 10.12 each point to a different collection. RavenDB will use the information from all three map functions to generate a single index. This means that you can now query on all of these collections as a single unit. There are a few things to notice, though. First, multimap indexes require that all the maps in the index have the same output. Note that we indexed a Name field in all three maps even though the Employees collection has no such field.

The other important factor is that it's usually awkward to have to deal with a heterogeneous result set. When you're querying, it's nice to know what shape of data to expect. A query on a multimap can return any of the collections that the multimap index covers. Because of that, it's usually best to project the data out into a common shape, as you can see in Listing 10.13.

Listing 10.13 Projecting data from multiple collections into a common shape


from index 'People/Search' as p 
where Name in ('Mary Saveley', 'Nancy Davolio', 'Wendy Mackenzie')
select
{
    Collection: p["@metadata"]["@collection"],
    ContactName: (p.Contact || 
        { Name: p.FirstName + " " + p.LastName }).Name
}

The output of the query in Listing 10.13 can be seen in Figure 10.2. You can also remove the select clause for this query to see how the results will change when you get three different types of documents back from a single query.

Figure 10.2 Common output shape for results from multimap index

Common output shape for results from multimap index

In a way, the multimap feature is similar to a union in a relational database. It allows you to query over multiple sources and get the results back from any of them. However, there's no limit on the shape of the results like there would be with a union, although that's quite convenient.

Full text indexes

The query in Listing 10.13 is nice, but it's awkward. We wouldn't want to ask our users to specify the full name of the person they're trying to find. We'll typically want to do a smarter search...a full text search, one might even say.

In the previous chapter, we looked at some full text search queries. They looked like search(Name, 'Nancy') and allowed us to efficiently search for results without the expense of scanning all of them. Listing 10.9 is a good example of how RavenDB breaks up the terms during indexing for quick lookups and speedy searches. But so far, we only looked at it with dynamic queries. How do I make use of the full text search capabilities of RavenDB using static indexes?

Full text search in RavenDB is composed of the following elements:

  • The analyzer you've selected.
  • The values and manner in which you're querying.
  • The field or fields you're indexing.

The analyzer for full text search is defined on a field or fields in the index, and it determines how RavenDB will break apart the text into individual terms. We've seen such an example in Listing 10.9, but let's consider the following string: "The white knight is running to the princess's tower to slay the dragon". How would full text search operate on such a string?

Full text search analysis

RavenDB will hand over this string to the analyzer and get a list of terms back. This is an extremely simplistic view of what's actually happening. But it's important to understand it so you'll have the proper mental model of what's actually going on under the hood. The simplest analyzer will do nothing to the text provided to it, and you'll get it back as is. This is what exact() does with dynamic queries — it lets RavenDB know that we should use the no-op analyzer, which does case-sensitive matches.

The default analyzer that RavenDB uses isn't doing much more than that, merely allowing us to query in a case-insensitive manner. This is done by converting the input string to lowercase (with all the usual casing required for Unicode-aware programs, of course). It's important to note that the analyzer runs during indexing and during query time. In this way, what ends up querying the actual index data structures is a value that's been passed through the analyzer during the query and compared to a value that was passed through the analyzer during the indexing.

Just changing strings to lowercase isn't that interesting, I'll admit, but analyzers can do much more. When you use search(), you'll use something called the "standard analyzer," and that's when things start to get interesting. The analyzer will break the input string into individual terms on a word boundary. So the previous string will be broken up to the following terms:

  • the (3x)
  • white
  • knight
  • is
  • running
  • to (2x)
  • princess's
  • tower
  • slay
  • dragon

Note that we have made the terms lowercase and that the term the appears three times and to appears twice. In many languages, there are certain words that appear so frequently that they're meaningless noise in most cases. In English, those would be words like a, the, to, from, is, are, etc. These are called stop words and are stripped from the terms the analyzers return because they add no semantic value to the search results.

The terms we end up with are

  • white
  • knight
  • running
  • princess
  • tower
  • slay
  • dragon

Note that the possessive "s" in princess's is something that the standard analyzer has removed. We could also reduce words to their stems, such as turn running into run. The standard analyzer doesn't do this, but you can select an analyzer that would do that. Analyzers are usually specific to the language (and sometimes even the specific business domain) that you're working on. The standard analyzer is a good default for English and most Latin-based languages, but sometimes you'll need more. In that case, you can look at the available analyzers and use them. A full list of the analyzers available by default with RavenDB can be found in the online documentation. Because RavenDB uses Lucene behind the scenes, it's quite easy to find analyzers for most needs that can be readily used by RavenDB. You can also define your own custom analyzers quite easily as well.

Let's look at how analyzers modify how RavenDB indexes documents. Figure 10.3 shows a sample document that we'll focus on.

Figure 10.3 Sample document for playing with full text search

Sample document for playing with full text search

Go ahead and create this document, and then create an index named test/search with the index definition in Listing 10.14.

Listing 10.14 This index definition uses the same field three times, to allow different indexing


from n in docs.Notes
select new 
{
    Plain = n.Description,
    Exact = n.Description,
    Search = n.Description
}

After adding the index definition, you'll need to customize the way RavenDB will index these fields. Click the Add field button and enter "Search" as the field name, then click Advanced and select Search as the value for the Indexing dropdown. Add another field and enter "Exact" in the name field. Then click Advanced and select Exact as the value for the Indexing dropdown. You can see how this should look in Figure 10.4.

Figure 10.4 Configuring the test/search index fields with different analyzers

Configuring the test/search index fields with different analyzers

And with that, you can click Save, and we're done. You can now go to the index Terms to see the different indexing methods that were used on each of these fields. The results are shown in Figure 10.5.

Figure 10.5 The indexed terms for the same values using different analyzers

The indexed terms for the same values using different analyzers

The Plain field was indexed as is, but in lowercase. (Note the first the in the string). The Exact field is almost the same, but it preserves the casing. (Again, notice the The at the beginning.) And the Search field is probably the most interesting one. There, we can see the whole process of breaking it up into individual terms, filtering out stop words and stripping the possessive "s" out.

Now, let's do some queries on this index.

Full text search queries

Querying a full text search field is an interesting experience because what we think we're doing and what's actually happening is so often drastically different, yet the end result is the same. Consider the query:
from index 'test/search' as n where search(n.Search, "princess flower tower").

If you run this query, you'll find that this actually matches the document, even though the word flower is nowhere to be found. That's because of the way RavenDB processes queries on full text search. It's enough that we have any match to be considered enough to return the result. More advanced options, such as phrase queries, are also available when using Lucene directly, such as the following queries:

  • where lucene(n.Search, ' "princess tower" ')
    Phrase query match (note the " in the query) because there's a princess followed by a tower in the text, even though we don't have the ' in the query.
  • where lucene(n.Search, ' "running princess" ')
    Also a match because the word running is followed by a princess (with the stop words removed).
  • where lucene(n.Search, ' "running knight" ')
    Not a match. There's no running followed by knight in the text.

As you probably remember, the lucene() method allows you to drop down directly into the Lucene syntax and compose complex full text search queries. I'm using it here primarily because it allows me to demonstrate the way full text search matches work. They're not quite so simple. You can read more about the full power of Lucene in the online documentation, but I want to focus on understanding how the queries work. Let's take, for example, the search() method and see how it operates.

The search() method accepts the query string you're looking for and passes it to the analyzer for the specified field. It then compares the terms the analyzer returned with the terms already in the index, and if there's a match on any of them, it's considered to be a match for the query. There's also the ranking of the results to take into account. The more terms that are matched by a particular document from the query, the higher it will be in the results. This is affected by such things as the term frequency, the size of the document and a lot of other things that I'm not going to cover but are quite interesting to read about.2

What happens when we're making a query on a full text field (one with an analyzer defined) without using search()? Let's see:

  • where n.Search = "princess tower"
    No match. There's no term princess tower for this field.
  • where n.Search = "dragon"
    Match. There's a term dragon for this field.

This is really strange, isn't it? But take a closer look at Figure 10.5, and it will be clear what's going on. There's a term dragon there, and when we use equality comparison, we compare against the terms directly, so we find a dragon but we don't find a single term princess tower. When we use search() or lucene(), we're performing more complex operations, which allows us to do more interesting queries.

For the same reason, it's not meaningful to talk about sorting on a full text field. The value you're sorting on can be any of the terms that were generated by the analyzer for this field. If you want to sort on such a field, you need to index it twice: once as a full text field and once as a normal field. You'll search on the full text field and sort on the normal field.

This leads us nicely to an important discussion: how to work with fields in the index.

Full text search fields

RavenDB doesn't require anything to match between the source document and its index entry. We saw that previously in Listing 10.14, when we indexed a single field from the document three different times, with different analyzers for each field. In many cases, you'll use the same field names, but there's no requirement to do that. This behavior is intentional because it gives you a lot of freedom with regards to how you're able to build your index and perform queries.

Listing 10.14 showed an example of how we can index a single field multiple times. Listing 10.15 shows the index definition for Companies/Search, which allows for the reverse scenario, where we're searching over multiple fields in the document using a single field in the index entry.

Listing 10.15 Combining several document fields in a single index field


from c in docs.Companies
select new
{
    Query = new[] {
        c.ExternalId,
        c.Name,
        c.Contact.Name
    }
}

The index definition in Listing 10.15 adds three different fields to the Query field on the index entry. When creating this index definition, you also need to register the Query field as full text search (as we've done with the Search field in the previous example). The question is, what does this give us?

This type of index is typically used to serve search pages directly. For example, we can run the following queries on this index:

  • from index 'Companies/Search' where search(Query, "ALFKI")
    Search companies using the external ID.
  • from index 'Companies/Search' where search(Query, "Alfreds")
    Search companies by full text search on the company name.
  • from index 'Companies/Search' where search(Query, "Anders")
    Search companies by full text search on the contact person's name.

Note that in all cases, we get the same result (companies/1-A). A user can type any of the above into the search text box and get the result they're looking for. Gone are the days of search pages with dozens of fields and extra long waiting times. You can search on any of the interesting fields that a client may remember without any hassle.

This is something that's quite easy to do but can significantly improve the life of our users. Now they have much greater freedom in querying, and they don't need to limit themselves to knowing exactly what value to search in what field. One of the things I recommend you do in such pages is to directly go into the details page, if there's just one result. This gives the users the impression that they can type just enough for your system to recognize who they're talking about and take them to the right place. It may seem like a small thing, but these are the kind of things that make a user really appreciate a system.

Of course, it's not always this easy. What happens if I know what I'm searching for but I can't quite get it right enough for the index to find it? There's a lot more that you can do with indexes in RavenDB.

Getting the most out of your indexes

Indexes are typically used to answer very specific questions, such as "Give me all the documents that match this criteria in this order." But RavenDB indexes are actually capable of doing much more. In this section, I want to highlight three of the more interesting capabilities of the indexes in RavenDB.

  • Suggestions — allowing you to ask RavenDB what the user probably meant to ask about.
  • More like this — suggesting similar documents to an existing one.
  • Facets — slicing and dicing of the data to provide you with detailed insights into what's going on in large result sets.

I'm going to cover them briefly here, mostly to introduce them and explain where and how they should be used. I'll leave the details on all the myriads of options and advanced features they have to the online documentation.

Suggestions

In the sample data, we have the companies/8-A document. It's for a company in Spain, whose owner name is: Martín Sommer. Note that diacritic over the í. It's reasonable for someone to not notice that and search for Martin. In this case, they'd find nothing. This can be frustrating, so we have a few ways in which we can help the user find what they're looking for.

In the same way we'll automatically go to the details page if there's only a single result, we'll also ask RavenDB if it can think of what the user meant to ask for. This can be done using the suggestions feature. Before we can use it, though, we need to enable it in the index, as shown in Figure 10.6.

Figure 10.6 Marking a field as having the suggestions feature

Marking a field as having the suggestions feature

With this change, we let RavenDB know that we'll be asking it to suggest options to queries the user has made. This typically requires RavenDB to spend more time in indexing, preparing all the options that a user can misspell in a search query, but the results can be astounding to users. Consider the query in Listing 10.16.

Listing 10.16 Querying for suggestions for a misspelled term


from index 'Companies/Search' 
select suggest(Query, "Martin")

We're asking RavenDB, "What could the user have meant by 'Martin' in the Query field?" RavenDB will try to look at the data in the index for this field and infer the intent of the user. If you care to know the details, RavenDB breaks the terms into pieces during the indexing process and scrambles them to simulate common errors. Those all go into an index that's used during query. This does increase the cost of indexing, but the cost of querying suggestions is typically very low. I wouldn't suggest3 applying this globally, but for user-facing searches, the dataset is typically pretty stable, so that works out great.

The result of the query in Listing 10.16 can be seen in Figure 10.7.

Figure 10.7 Suggested alternatives for "Martin"

Suggested alternatives for "Martin"

You can use the results of the suggestion query to show results to the user, or you can ask them what they meant, similar to how Google does it. This feature tends to get enthusiastic responses when users run into it.

I gave an example in Unicode because it's clear how it would be hard to use, but the same is possible using any type of misspelling, such as that in listing 10.17.

Listing 10.17 Asking RavenDB to suggest other options for 'Summer' from the terms in the index


from index 'Companies/Search' 
select suggest(Query, "Summer")

The query in Listing 10.17 will give sommer and cramer as the possible suggestions for summer.

I focused heavily on finding what the user meant when they misspelled something and didn't find anything, but suggestions can also be useful when you did find something but want to let the user know there are additional alternatives they might want to explore. Sometimes, this can give the user a reason to go exploring, although the "more like this" feature is more appropriate there.

More like this

My toddler likes playing "what looks like this" and it's a lot of fun. In a more serious setting, there are a lot of use cases for "find me more stuff that looks like this." In bug tracking, it might be finding a previous occurrence of a bug. With a product catalog, that might be finding another item that's roughly the same.

What the "more like this" feature is and isn't

The way "more like this" works beneath the surface is quite simple. You mark a field as full text search and define it to have a term vector. These two things together provide RavenDB the ability to build a query to find similar documents. There is a bit of smarts around how we decided the query should actually look, but that's the basis of how it works.

In many cases, this simple approach works well to find similar and related documents, especially if your dataset is large and the documents and data you're indexing are complex. In this case, the more data you have, the more information RavenDB has to decide upon a similar document vs. what's just random noise, common in your domain.

This isn't the basis of a recommendation engine, though. It's a good start and allows you to hit the ground running and demonstrate a feature quickly. But while there's quite a lot of tuning that you can do (see the online documentation for the full details), it's a feature that was developed to find documents based on shared indexed terms and nothing beyond that. True recommendation engines can do much more.

To make true use of the "more like this" feature, we typically use a large dataset and utilize large text fields that give us enough information to distinguish between noise and what's of real value. This is especially true when we're talking about user-generated content, such as comments, posts, emails and the like. It's of somewhat less use for static fields, but there are still quite a few interesting use cases for this feature.

We'll start by defining the Orders/Search index, as shown in Listing 10.18.

Listing 10.18 Index definition to use as more like this target to find similar orders


from o in docs.Orders
select new{
    Address = new[]{o.ShipTo.City, o.ShipTo.Country},
    Products = o.Lines.Select(x => x.Product)
}

Before we can start performing a "more like this" query, we need to configure a term vector for these fields, as shown in Figure 10.8.

Figure 10.8 Defining term vectors (for use with more like this) in the Orders/Search index

Defining term vectors (for use with more like this) in the Orders/Search index

Setting a field to use a term vector tells RavenDB we need to store enough information about each index entry to not only be able to tell from a term what index entries contained it but also to be able to go from an index entry to all the terms that it contained. A term may appear multiple times in an index entry (common when we're using full text search). For example, the word term has appeared multiple times in this paragraph. A term vector for the paragraph will contain a list of all unique words and how often they appeared.

Why didn't we define full text search on the fields in Orders/Search?

You might have noticed in Figure 10.8 that we didn't set the fields to use full text search. Why is that? The values we're going to put into these fields (a city, a country or a product ID) are all terms that we don't want to break up. In this case, using full text search on those fields would break them apart too much. For example, if we have a city named "New York," that is the city name. We don't want to break it into "New" and "York".

Another thing to note is that the Address field in the index is actually using an array to store multiple values in the same field. It's similar to using an analyzer, but for one-off operations, this is often easier. You can look at the index terms to see what got indexed.

With all of the preparations complete, we can now actually make our first "more like this" query, which is shown in Listing 10.19.

Listing 10.19 Finding similar orders to 'orders/535-A' based on products ordered and the city and country it was shipped to


from index 'Orders/Search' 
where morelikethis(id() = 'orders/535-A')

The query asks RavenDB to find similar orders to orders/535-A based on the Orders/Search index. This index has two fields marked with term vectors, which are used for this purpose. The orders/535-A document has a single purchased product (products/31-A) and was shipped to Buenos Aires, Argentina. Based on this, RavenDB will construct a query similar to this one: where Address = 'argentina' or Address = 'buenos aires' or Products = 'products/31-a'

Note that the casing on the query parameters is intentional because the data comes directly from the index terms, which were changed to lowercase by the default analyzer. I'm doing this to explicitly show the source of the data.

The result of the query in Listing 10.19 is 65 results. It's the same as if we replaced the morelikethis() call with the equivalent query that we figured out manually. Note that issuing the query on orders/830-A, for example, will have 25 items in the query that will be generated.

Why go into so much detail about how this is actually implemented? Well, consider the implications. As we know, in such queries, the more matches we have, the higher the rank of a result. So this explains how we get the "more like this" aspect. We query on the matches, and the higher the number of matches, the more we'll consider it similar to the target and return it. It's magic demystified but hopefully still pretty cool and a nice way to give the user something to follow up on.

I'm not going to cover all the options of morelikethis() here, but I wanted to point out that you have a great deal of control over exactly how RavenDB is going to match these documents. Take a look at Listing 10.20, where we want to apply the morelikethis() only on the Address field.

Listing 10.20 Querying with morelikethis() on a specific field


from index 'Orders/Search' 
where morelikethis(id() = 'orders/535-a', '{"Fields": ["Address"] }')

The results of the query in Listing 10.20 is just 15 similar orders: those that were sent to the same city and country. Other options allow you to decide what terms will be considered for the morelikethis() query based on their frequency of use and overall popularity. You can find the full details in the online documentation.

We're now going to look at another element of the RavenDB querying capabilities: facets and how we can use them to get all sorts of information about the result set for your queries.

Facets

To query is to ask a question about something. Usually the questions we ask are some variant of "give me all the documents matching this pattern", such as when we ask to get the last 50 orders from a particular customer. This is easy to understand and reason about. We run into problems when the size of the result set is so big that it's not meaningful for the user.

Consider the case of a support engineer fielding a phone call. The customer reports an error, so the engineer searches the knowledge base for all articles with "error" in them. As you can imagine, there are likely going to be quite a few of these. The poor engineer is unlikely to find something meaningful using such a strategy. On the other hand, very often we have a pretty good idea about the general query we want but not a clue how to narrow it down.

Facets are widely used, but they're the sort of feature that you don't really pay attention to. A good example of that would be YouTube. As you can see in Figure 10.9, searching YouTube for "Dancing" is an interesting experience. How would I be able to choose from over 280 million different videos?

Figure 10.9 Faceted search in YouTube

Faceted search in YouTube

Facets allow me to narrow down the search quite easily by exposing the inner structure of the data. There are two searches shown in Figure 10.9: the first doesn't have any filters applied, and the second filters for 4K and uploaded today. This reduces the results to a far more manageable number. It also gives me additional information, such as the fact that there are no matches with the type "Show" in the results. You can see another good example of facets in commerce. If I want to buy a new phone, I have way too many options. Searching eBay for "phone" gives me over 300,000 results. Figure 10.10 shows how eBay uses facets to help you narrow down the selection to just the right one.

Figure 10.10 Facets can help a customer to narrow down to the exact product they want

Facets can help a customer to narrow down to the exact product they want

In some cases, this is just there to help you feed the system the exact query it needs. In many other cases, facets actively assist the user in figuring out what kind of questions they need to ask. The feedback from the numbers in Figure 10.10, in contrast to the match / no match indication in Figure 10.9 is another factor, giving the user the ability to guide their searches.

Facets are a cool feature indeed. Let's see how you can use them in RavenDB. Facets require that you define an index for the fields you want to query and apply facets on. In this case, we'll use the Product/Search index in the sample data set. We'll start with the simple faceted query shown in Listing 10.21.

Listing 10.21 Range and field facets on product search


from index 'Product/Search'
select 
    facet(
        PricePerUnit < 10, 
        PricePerUnit between 10 and 50, 
        PricePerUnit between 51 and 100, 
        PricePerUnit  > 100
    ) as Price,
    facet(Category),
    facet(Supplier)

The results of the query shown in Listing 10.21 can be seen in Figure 10.11. The query itself is composed of three facets: a range facet on the PricePerUnit field and two field facets on Category and Supplier. As you can see, in the case of the range facets, we grouped all the matches in each particular range. And in the case of the field facet, we group by each individual value.

Figure 10.11 Faceted query results

Faceted query results

The query in Listing 10.21 is simple since it has no where clause. This is where you'll typically start — just giving the user some indication of the options they have for queries. Let's say the user picked suppliers 11 and 12 as the ones they want to drill down into. The query will then look like the one in Listing 10.22.

Listing 10.22 Faceted search over particular suppliers


from index 'Product/Search'
where Supplier in ('suppliers/11-a', 'suppliers/12-a')
select 
    facet(
        PricePerUnit < 10, 
        PricePerUnit between 10 and 50, 
        PricePerUnit between 51 and 100, 
        PricePerUnit  > 100
    ) as Price,
    facet(Category),
    facet(Supplier)

In Listing 10.22, we're querying over the same facets, but we're now also limiting it to just particular suppliers. The output of the Price facet will change, as shown in Listing 10.23.

Listing 10.23 'Price' facet output from the query in Listing 10.22


{
    "Name": "Price",
    "Values": [
        {
            "Count": 1,
            "Range": "PricePerUnit < 10"
        },
        {
            "Count": 6,
            "Range": "PricePerUnit between 10 and 50"
        },
        {
            "Count": 0,
            "Range": "PricePerUnit between 51 and 100"
        },
        {
            "Count": 1,
            "Range": "PricePerUnit > 100"
        }
    ]
}

As you can see, the number of results per each range has changed to reflect the new filtering done on the query.

The facets portion of the query is the very last thing that happens, after the entire query has been processed. This means that you can use any where clause you want and filter the results accordingly. However, any query that uses facet() must return only facet() results and cannot use clauses such as include or load.

Faceted queries are typically combined with the same query, sans the facets, to show the user the first page of the results as they keep narrowing down their selections. You'll typically use the Lazy feature to combine such multiple queries, as was discussed in Chapter 4.

Spatial indexes

In Chapter 9, we discussed spatial queries and used them with automatic indexes. Using a static index gives you a high degree of control over the spatial indexing that RavenDB will perform on your data. Let's create a new index called Companies/Spatial, as shown in Listing 10.24.

Listing 10.24 Defining an index with spatial support


from c in docs.Companies
select new
{
    Name = c.Name,
    Coordinates = CreateSpatialField(
            c.Address.Location.Latitude, 
            c.Address.Location.Longitude)
}

The CreateSpatialField method instructs RavenDB to use the provided latitude and longitude to create a spatial field named Coordinates. As usual, even though there are some companies with a null Address.Location, we can safely ignore that. RavenDB will handle null propagation during the indexes and save us from all those null checks.

With this index, we can now perform spatial queries. You can see such a query in Listing 10.25, querying for companies within a mile of 605 5th Ave. S., Seattle, Washington.

Listing 10.25 Spatial querying for companies using a static index


from index 'Companies/Spatial' 
where spatial.within(Coordinates, 
    spatial.circle(1, 47.5970, -122.3286, 'miles'))

The query in Listing 10.25 has a single result: the White Clover Markets company. You can also use the CreateSpatialField to pass a WKT string representing any arbitrary shape that will be indexed by RavenDB. So far, this seems to be pretty much the same as we've previously done with auto indexes. Static indexes allow you to customize the spatial indexing behavior. So let's see how.

Go to the Companies/Spatial edit index page and click on the Add field button. Set the Field name to Coordinates and click on the Spatial toggle. The result should be similar to Figure 10.12.

Figure 10.12 Spatial field indexing options

Spatial field indexing options

You can use these options to have fine-grained control over exactly how RavenDB will index your spatial data and process spatial queries. Of particular interest is the Max Tree Level field, which controls how precise the spatial queries are going to be and directly relates to the cost of spatial indexing.

How RavenDB handles spatial indexing and queries

This section isn't going to be an exhaustive study of how spatial indexing works, nor will it dive too deeply into the actual implementation. The first topic is too broad for this book (and there are excellent resources online), and the second topic is unlikely to be of much interest to anyone who isn't actually implementing the querying support. Full details on the spatial behaviors and options are available in the online documentation. What this section is going to do is to give you a good idea of how RavenDB performs spatial queries — enough so that you'll be able to reason about the impact of the decisions you're making.

RavenDB offers three different spatial indexing strategies:

  • Bounding box
  • Geohash prefix tree
  • Quad prefix tree

To demonstrate the difference between these strategies, we'll use the Terms feature to see what's actually going on underneath it all. Go into Companies/Spatial and edit the Coordinates field to the BoundingBox strategy. Click Save and then go into the index terms. You should see something similar to Figure 10.13.

Figure 10.13 Bounding box indexing behind the scenes

Bounding box indexing behind the scenes

The bounding box strategy is the simplest one. Given a spatial shape, such as a point, a circle or a polygon, it computes that shape's bounding box and indexes its location. These are the Coordinates__minX, Coordinates__minY, Coordinates__maxX and Coordinates__maxY fields that you can see in Figure 10.13.

As for the actual values of these fields, these are the spatial coordinates that match the bounding box. Whenever you make a query, RavenDB will translate it to the same bounding box system. You can see a sample of this query translation in Listing 10.26.

Listing 10.26 Translated spatial query using bounding box


// actual query
from index 'Companies/Spatial' 
where spatial.within(Coordinates, spatial.circle(1, 47.5970, -122.3286, 'miles'))

// what actually gets executed
from index 'Companies/Spatial' 
where Coordinates__minX >= -125.3341 and
    Coordinates__maxX <= -119.3230 and 
    Coordinates__minY >= 45.5707 and
    Coordinates__maxY <= 49.623237136915

As you can imagine, this is a pretty cheap way to handle spatial queries, both during the indexing portion and at the time of the query. However, it suffers from a number of issues related to the accuracy of the solution. In many cases, you wing it and get by with a bounding box, but in truth it's limited in the kind of queries it can perform and how well it can answer them. In particular, the bounding box assumes that the world is flat (or at least that the bounding box is small enough that it can ignore the curvature of the earth).

Let's move to the next option and look at the geohash strategy. Go to the index, update the spatial option to GeohashPrefixTree and save the index. Now, go to the index terms, and you'll find something similar to Figure 10.14.

Figure 10.14 Geo hash indexing behind the scenes

Geo hash indexing behind the scenes

The pyramids you see are the actual geohashes, but before we can start talking about how RavenDB uses them, we need to explain what they are. The way geohashes work is by dividing the world into a grid with 32 buckets. They then divide each bucket in the grid further into another 32 buckets, and so on. You can play with geohashing in an interactive manner using the following sites:

Figure 10.15 A map showing the top level of geohash

A map showing the top level of geohash

By looking at the map in Figure 10.15 and the terms from Figure 10.14, we can see that the prefix 6 covers most of South America. The next level we have, 69, is mostly Argentina, and 69y is Buenos Aires. In other words, the longer the geohash, the more precise it is.

Figure 10.16 A map showing multiple levels of geohash

A map showing multiple levels of geohash

Look at the spatial indexing options in Figure 10.12. You can see the Max Tree Level there, which determines the accuracy of the spatial indexing. This, in turn, determines the length of the geohash. A tree level of 9 (the default) gives us a resolution of about 2.5 meters. That somewhat depends on the exact location on the earth that you are searching.

When indexing a shape, RavenDB will pixelate it to the required resolution and enter the geohashes that cover it. The more irregular the shape and the higher the precision required, the more work that's needed to generate the terms that match the query. At query time, we do the reverse and find the matches. Note that with the geohash strategy, the geohash indexing creates the relevant shapes in a rough fashion, after which we check whether all the shapes inside the geohash match our actual query.

In other words, with spatial queries using geohash (or quad), there are two stages to the query. First, do a rough match on the shapes. This is what you see in the index terms. Second, have the actual spatial geometry check to see if the shape matches the query.

The quad tree strategy is similar in many respects to the geohash but uses a different coordinate system (a grid of four buckets, hence the name quad). Quad buckets are always squares, while geohash buckets can be rectangular. This might make a difference if you're using heatmaps and want a more predictable zoom in/out.

Figure 10.17 Coordinates terms when indexing using quad prefix tree

Coordinates terms when indexing using quad prefix tree

Geohash is more widely used, and it's supported by many platforms and tools. The selection of geohash or quad typically doesn't have any major effect on indexing speed or queries; it primarily depends on what you're going to be using this for. If you're going to display spatial data, you'll probably want to select a mode that works best with whatever mapping component you're using.

Bounding box strategy is much cheaper at indexing time and can produce very quick queries, but that's at the expense of the level of accuracy you can get. This is good if what you typically care about is rough matches. For example, if your query is something like "find me the nearest restaurants", you likely don't care too much about the shape. On the other hand, if you're asking for "give me all the schools in this district", that's something quite different, and you'll care about the exact shape the query is processing.

The performance cost of spatial indexing is directly related to the tree level you chose, and a very granular level with complex shapes can cause long indexing times. You'll usually not notice that, because indexing is async and doesn't hold up other tasks, but it can impact operations when you are creating a new index or storing large number of entities to the database all at once. In particular, if you have an expensive spatial index, it is usually better to avoid indexing documents that have very high rate of change since RavenDB will need to re-index them every time, even if the spatial fields didn't change.

In such cases, it might be better to move the spatial data to a dedicated collection containing just that, so only the documents with the spatial data will be indexed if they change. This is typically not required, since even with spatial indexing, the indexing is fast enough for most cases, but it can be a valid approach if you need to speed up indexing times and are getting held up by the spatial indexing level you require.

Splitting the spatial data to a separate collection brings up an interesting issue, though. How do we deal with indexing of data that is located on multiple documents?

Indexing referenced data

The document modeling rules we've reviewed so far call for documents to be independent, isolated and coherent. Documents should be able to stand on their own without referencing other documents. Sometimes, however, you need data from a related document to create a good search experience.

For example, using the sample dataset, you want to search for employees using the name of their manager. In general, I dislike such requirements. I would much rather have the user first search for the appropriate manager and then find all the employees that report to this manager. From many perspectives, from the purity of the model to the user experience you can provide, this is the better choice.

In particular, searching employees by a manager's name leads to confusion about which manager you're talking about if more than one manager have the same name. Even so, sometimes this is something you just have to do. The users may require it to behave in this manner because the old system did it this way, or maybe the model they use actually calls for this. So how would you do it?

One way to handle this requirement is to store the document field on the related entity. In other words, you'd store the manager's name in the employee document. That works, but it will rightly raise eyebrows when you look at your model. This isn't a value we want to freeze in time, such as a product's price or its name in the Orders collection. A manager's name is their own, and they're free to modify it as they wish. If we stored that name in the employee document, it would mean we'd have to update the employee document whenever we updated the manager. That is...not so nice, even if RQL makes it easy, as you can see in the patching done in Listing 10.27.

Listing 10.27 Using RQL to update the manager's name in all the managed employees


from Employees as e
where e.ReportsTo = $managerId
update {
    e.ManagerName = $managerName;
}

The only thing that's of interest in Listing 10.27 is the passing of parameters to both the query and the update clause. Other than that, it's a perfectly ordinary patch. However, having to do such things will likely result in the ghost of "you shoulda normalized that" haunting you. In particular, while this can be a viable solution for certain things, it isn't elegant, and it goes against the grain of the usual thinking in database design.

Another option is to ask RavenDB to handle this explicitly during the indexing process, as you can see in Listing 10.28.

Listing 10.28 Getting values from a related document at indexing time


from e in docs.Employees
let manager = LoadDocument(e.ReportsTo, "Employees")
select new
{
    e.FirstName, 
    e.LastName,
    ManagerFirstName = manager.FirstName,
    ManagerLastName = manager.LastName
}

I've created Listing 10.28 as an index named Employees/Search. And I can query it as shown in Listing 10.29:

Listing 10.29 Querying information from a related document


from index 'Employees/Search'
where ManagerLastName = 'fuller'

As you can see, we're able to query the related information. But how does this work? In Listing 10.28, you can see the use of an unfamiliar function: LoadDocument(e.ReportsTo, "Employees"). This is the key for this feature. At indexing time, RavenDB will load the relevant document and allow you to index its values. Note that we need to specify which collection we're loading the document from.

This is all well and good, but what happens when the manager's name changes? Go ahead and change employees/2-A's LastName to Smith. Then execute the query in Listing 10.29 again. You'll find no results. But if you run it on ManagerLastName = 'Smith', you'll find the missing documents.

In other words, when using LoadDocument in the index, you can index data from a related document, and it's RavenDB's responsibility to keep such related data up to date. In fact, this is why we need to know which collection you're loading the document from: so we can watch it for changes (in a similar way to how we watch the collection that we're indexing). The loaded documents can be from the collection we're already indexing (as is the case in Listing 10.28) or an unrelated collection.

The cost of tracking related documents

The LoadDocument behavior doesn't come without costs. And these costs come in two areas. First, during indexing, we need to load the relevant document. That may require us to read a document that currently resides on the disk, skipping many of the performance optimizations (such as prefetching) that we apply when we need to index a batch of documents. This cost is usually not large and is rarely an issue.

Of more concern is the fact that whenever any document in a collection that's referenced by LoadDocument is changed, the index needs to scan and assess whether any references need to be reindexed. This is usually the more expensive portion of this feature.

When you consider using the LoadDocument feature, first consider whether this is used to paper over modeling issues. In particular, LoadDocument allows you to do "joins" during the indexing process. That can be very useful or lead you down a problematic road, depending on your usage. In particular, if a large number of documents reference a single document (or a small set of them), then whenever that referenced document is changed, all the documents referencing it will also need to be reindexed. In other words, the amount of work that an index has to do because of a single document change can be extremely large and may cause delays in indexing.

These delays will only impact the specific index using LoadDocument and will have the effect of making it do more work. But other work on the server will continue normally, and queries (including to this particular index) can run as usual (but may show outdated results). Other indexes or operations on the server will not be impacted.

The whole referenced document vs. referencing documents can be a bit confusing, so an example might be in order. In Listing 10.28, we have the Employee referencing the manager using the ReportsTo field. In other words, the referencing document is the Employee and the referenced document is the manager. (In this case, it's also a document in the Employees collection.) The topology of the references is critical in this case.

If the number of employees that report to the same manager is small, everything will work just fine. Whenever the manager document is updated, the employees managed by this document will be reindexed. However, consider the case where all employees report to the same manager and that manager's document is updated. We'd now need to index everything that referenced it, which can be a huge number of documents.

Remember that the reindexing will happen on any change in the referenced document. In Listing 10.28, we are only using the manager's FirstName and LastName, but RavenDB will reindex the referencing employees even if the only change was an updated phone number.

LoadDocument is a very powerful feature, and it deserve its place in your toolbox. But you should be aware of its limitations and costs. Unfortunately, it's used all too often as a means to avoid changing the model, which will usually just defer a problem rather than fix the modeling issues. All of that being said, if you aren't going to have a large number of references to a single document and you do want a query based on data from related documents, LoadDocument is a really nice way to do so.

Dynamic data

RavenDB is schemaless. You can store any kind of data in any way you want — to a point. This flexibility doesn't extend to querying. You do need to know what you're querying on, right? Of course, you can query on fields in a dynamic fashion, and it will work because the query optimizer will add the new field you just queried onto the index. But adding a field to an index requires us to reindex it, and that isn't quite what we want. We want to be able to say, "This index will cover any of the fields I'm interested in" and have that done regardless of the shape of the data we index.

We can do that in RavenDB quite easily with a bit of pre-planning. Let's look at Listing 10.30, showing the Employees/DynamicFields index and how RavenDB allows us to index dynamic fields.

Listing 10.30 Employees/DynamicFields index, using dynamic fields


from e in docs.Employees
select new 
{
    _ = e.Address.Select(field => 
            CreateField(field.Key, field.Value)
    ),
    __ = CreateField(e.FirstName, e.LastName),
    Name = e.FirstName + " " + e.LastName
}

There are several things going on with the Employees/DynamicFields index in Listing 10.30. First, we used the _ variable name as an index output to indicate that this index is using dynamic field indexes. (Otherwise, RavenDB will error with an invalid field name when you query.)4 You can see that you can mix and match dynamic index names with static field names. RavenDB doesn't mind, it just works.

We also used the CreateField method twice. The first time, we used it to index all the fields inside the Address object. In this way, we aren't explicitly listing them one at a time, and if different documents have different fields for the Address object, each document will have different fields indexed.

The second time we called CreateField is much stranger. This created a completely dynamic field whose name is the employee's FirstName and whose value is the employee's LastName. This is an example of dynamic fields that are created explicitly. With the index defined, we can now start querying it, as you can see in Listing 10.31.

Listing 10.31 Querying over dynamic fields


from index 'Employess/DynamicFields'
where City = 'London'

Even though the index doesn't have an explicit field named City, we can still query on it. For that matter, we can also query using where Nancy='Davolio'. You can add any field that you want to the Address object, and it will be indexed. No two documents must have the same fields; the Employees/DynamicFields shown in Listing 10.30 can accept any structure you want. Dynamic fields complete the schemaless nature of RavenDB and allow for complete freedom of operations.

It's still recommended that you mostly use static indexing, mostly because it's easier to reason about and work with. While RavenDB doesn't actually care for the field names you have, it's usually easier if you use dynamic fields only when needed. In particular, we've seen users that used dynamic fields to index every single field in their documents. That works, and sometimes you'd want to do this. But in most cases, it's an issue of "we might need this in the future" and is rarely, if ever, used.

Indexing documents has a cost that is proportional to the number of indexed fields, so indexing things that you'll likely not need will end up costing time and disk space for no return.

Summary

We started this chapter by discussing what indexes are in RavenDB. We saw that, like an onion,5 indexes come in layers. There's the index definition, specifying what and how we should index, the index on disk and the actual indexing process. We started working with indexes explicitly by defining indexes using Linq expressions. Such expressions give us the ability to select the fields (and computed values) that we want to be able to index.

We looked into how the indexes actually work, starting from a document being transformed into the index entry by the index definition all the way to the actual processing that happens for the data being indexed. We looked at how the values we index are broken into terms according to the specified analyzer and then are stored in a way that allows quick retrieval of the associated documents. We also saw that we can store information directly in the index, although that's reserved for special cases.

Once we covered what indexes are, we started to look at how we can use static indexes in RavenDB. Multimap indexes allow us to index more than a single collection in one index, giving us the ability to merge results from different sources in a single query. Full text search is a world unto itself, but RavenDB contains everything you can wish for in this regard.

We looked at how full text search analysis processes text and how it's exposed in the index terms in the Studio. RavenDB allows you to utilize full text search using both purpose-built methods (such as StartsWith(), Search(), etc.) and a lower level interface exposed via the Lucene() method that lets you interface directly with the Lucene engine.

A nice trick you can use in indexing is to merge multiple fields from the index into a single field in the index entry, which allows you to query over all these fields at once. This kind of behavior is perfect for building search pages, avoiding the need to have a field for each option and simplifying the user interface to a single textbox. In the area of search, your users will surely appreciate you looking like Google, and RavenDB makes such behavior quite easy to implement.

In fact, as we have seen, indexes aren't limited to just plain searching. Features such as suggestions allow RavenDB to analyze your data and guess what the user actually meant to search for, and "more like this" can be used to find similar documents to the one the user is looking at.

Facets are a way to dissect the result set you get from a query and gather additional information from it. Instead of forcing your users to go through thousands or millions of results, you can have RavenDB slice and dice the results so the users can dig deeper and find exactly what they're looking for. These last three features (suggestions, "more like this" and facets) are primarily used when you need to directly expose search operations to the users. They allow you to provide an intelligent and easy-to-use interface to expose the data from your system in a way that can be easily consumed, without having to work hard to do so.

RavenDB also supports spatial queries, allowing you to find documents based on their location on the globe. We looked at how such indexes are defined and what you can do with them. We also dove into how they're actually implemented and the varying costs of the levels of accuracy you can get from spatial queries.

We peeked into how RavenDB allows you to query related data by calling LoadDocument at indexing time. This moves the responsibility of updating the indexed data from related documents to RavenDB, which may increase the indexing time but has no impact on the cost of querying the information.

Finally, we looked at how RavenDB allows you to define indexes on dynamic data without needing any common structure between the indexed documents. This is useful for user generated data and when you're working on highly dynamic systems. Instead of the usual complexity involved in such systems, with RavenDB, you can just make everything work. The database will allow you to store, retrieve and query the data in any way you want.

In the next chapter, we'll look into RavenDB's MapReduce indexes, what they can do for you and how they actually work.


  1. The general recommendation is that you have a single index per collection with all the fields that you want to search defined on that index. It's better to have fewer and bigger indexes than many smaller indexes.

  2. See the Lucene in Action and Managing Gigabytes books, recommended in the previous chapter.

  3. Pun intended.

  4. We can only use _ once in an index output, and the name of the field doesn't matter when you use CreateField, so we typically just use __, ___, etc. for the dynamic field names.

  5. Or an ogre.