Working with Indexes
We've spent the last three chapters examining how querying works in RavenDB, as well as what indexes are, how they operate and what they can do. We looked at everything from simple map indexes to spatial queries, from performing full text queries to aggregating large amounts of data using MapReduce. What we haven't talked about is how you'll work with indexes in a typical business application.
This chapter will focus on that, discussing how to create and manage indexes from the client side, how to perform queries and what options are available for us on the client API. We explored some basic queries in Chapter 4, but we only touched on queries briefly — just enough to get by while we learn more about RavenDB. Now we're going to dig deep and see everything we can do with indexes and queries in RavenDB.
Creating and managing indexes
RavenDB is schemaless. You can have documents in any shape, way or form that you like. However, indexes are one of the ways to bring back structure to such a system. An index will take the documents as input and then output the index entries in a fixed format. Queries on this index must use the fields defined on the index (unless the index is doing dynamic field generation), and there's typically a strong tie between the structure of the index, the output of queries and the client code using it.
That's interesting because it means that changing the index might cause client code to break, and that strongly brings to mind the usual issues you run into with a fixed schema. This often leads to complexities when developing and working with schemas because versioning, deploying and keeping them in sync with your code is a hassle.
RavenDB allows you to define your indexes directly in your code, which in turn allows you to version the indexes as a single unit with the
rest of your system. In order to see how this works, we'll use a C# application. Open PowerShell
and run the commands shown in
Listing 12.1.
Listing 12.1 Creating a new RavenDB project
dotnet new console -n Northwind
dotnet add .\Northwind\ package RavenDB.Client
dotnet restore .\Northwind\
The commands in Listing 12.1 just create a new console application and add the RavenDB client package to the project. Now, go to
RavenDB and create a new database named Northwind
. Go to Settings
and then Create Sample Data
and click the Create
button. Click the View C# Classes
link, copy the code to a file called Entities.cs
and save it in the Northwind
app
folder.
We're now ready to start working with real indexes from the client side.
Working with indexes from the client
Before we get to defining new indexes, let's start with an easier step: querying on an existing index. Your Program.cs
file
should be similar to Listing 12.2.
Listing 12.2 This console application queries RavenDB for all London based employees
using System;
using System.Linq;
using Raven.Client.Documents;
using Orders;
namespace Northwind
{
class Program
{
static void Main(string[] args)
{
var store = new DocumentStore
{
Urls = new []
{
"http://localhost:8080"
},
Database = "Northwind"
};
store.Initialize();
using (var session = store.OpenSession())
{
var londonEmployees =
from emp in session.Query<Employee>()
where emp.Address.City == "London"
select emp;
foreach (var emp in londonEmployees)
{
Console.WriteLine(emp.FirstName);
}
}
}
}
}
The code in Listing 12.2 is the equivalent of "Hello World," but it will serve as our basic structure for the rest of this chapter.
The query we have in Listing 12.2 is a pretty simple dynamic query, the likes of which we already saw in Chapter 4. This is
translated to the following RQL: FROM Employees WHERE Address.City = $p0
. So far, there are no surprises, and if you check
the indexes on the database, you should find that the Auto/Employees/ByAddress.City
index was automatically created to satisfy
the query. How can we select the index we want to use for a query from the client side? You can see the answer in Listing 12.3.
Listing 12.3 Specifying the index to use for a query (using strings)
var ordersForEmployee1A =
from order in session.Query<Order>("Orders/Totals")
where order.Employee == "employees/1-A"
select order;
foreach (var order in ordersForEmployee1A)
{
Console.WriteLine(order.Id);
}
As you can see, the major difference is that we're now querying on the Orders/Totals
index and we pass that as a string to the
Query
method. Using this method means that we need to define the index somewhere, which leads to the deployment and
versioning issues that I already discussed. RavenDB has a better solution.
Defining simple indexes via client code
When using a strongly typed language, we can often do better than just passing strings. We can use the features of the language
itself to provide a strongly typed answer for that. We'll recreate the Orders/Totals
index in our C# code, as shown
in Listing 12.4. (You'll need to add a using Raven.Client.Documents.Indexes;
to the file.)
Listing 12.4 The index is defined using a strongly typed class
public class My_Orders_Totals : AbstractIndexCreationTask<Order>
{
public My_Orders_Totals()
{
Map = orders =>
from o in orders
select new
{
o.Employee,
o.Company,
Total = o.Lines.Sum(l =>
(l.Quantity * l.PricePerUnit) * (1 - l.Discount))
};
}
}
We use My/Orders/Totals
as the index name in Listing 12.4 to avoid overwriting the existing index. This way, we can compare the
new index to the existing one. There are a few interesting features shown in Listing 12.4. First, we have a class definition
that inherits from AbstractIndexCreationTask<T>
. This is how we let RavenDB know this is actually an index definition
and what type it will be working on.
The generic parameter for the My_Orders_Totals
class is quite important. That's the source collection for this index.
In the class constructor, we set the Map
property to a Linq expression, transforming the documents into the index entries.
The orders
variable is of type IEnumerable<Order>
, using the same generic parameter as was passed to the index class.
Now we just need to actually create this index. There are two ways of doing that. Both are shown in Listing 12.5.
Listing 12.5 Creating indexes from the client side
// create a single index
new My_Orders_Totals().Execute(store);
// scan the assembly and create all the indexes in
// the assembly as a single operation
var indexesAssembly = typeof(My_Orders_Totals).Assembly;
IndexCreation.CreateIndexes(indexesAssembly, store);
The first option in Listing 12.5 shows how we can create a single index. The second tells RavenDB to scan the assembly provided and create all the indexes defined there.
Automatically creating indexes
The
IndexCreation.CreateIndexes
option is a good way to avoid managing indexes manually. You can stick this call somewhere in your application's startup during development and as an admin action in production. This way, you can muck about with the index definitions as you wish, and they'll always match what the code is expecting.In other words, you can check out your code and run the application, and the appropriate indexes for this version of the code will be there for you, without you really having to think about it. For production, you might want to avoid automatic index creation on application startup and put that behind an admin screen or something similar. But you'll still have the option of ensuring the expected indexes are actually there. This makes deployments much easier because you don't have to manage the "schema" outside of your code.
After running the code in Listing 12.5, you'll see that there is an index named My/Orders/Totals
in the database. By convention,
we replace _
with /
in the index name. Now is the time to try to query this index, in a strongly typed manner, as you can
see in Listing 12.6.
Listing 12.6 Specifying the index to use for a query (strongly typed)
var ordersForEmployee1A =
from order in session.Query<Order, My_Orders_Totals>()
where order.Employee == "employees/1-A"
select order;
The second generic parameter to Query
is the index we want to use, and the first one is the item we're querying on.
Note that in this case, what we query on and what we're getting back is the same thing, so we can use Order
as both the
item we query on and the return type. But that isn't always the case.
Working with complex indexes using strongly typed code
As we've seen in previous chapters, there isn't any required correlation between the shape of the document being indexed and the output of the index entry. In fact, there can't be if we want to support dynamic data and schemaless documents. That means that when we're talking about indexing, we're actually talking about several models that are usually either the same or very similar, but they don't have to be.
There are the following models to consider:
- The documents to be indexed.
- The index entry that was outputted from the index.
- The actual queryable fields in the index.
- The result of the query.
Consider the case of the following query: from Orders where ShipTo.City = 'London'
. In this case, all four models behave as if we're querying on the Orders
collection directly. But even in such a
simple scenario, that isn't the case.
The documents to be indexed are the documents in the Orders
collection, but what is actually being indexed here? In the
simplest case, it's an index entry such as {"ShipTo.City": "London", "@id": "orders/42-A"}
. When we query, we actually try
to find a match for ShipTo.City = 'London'
, and from there we fetch the document and return it.
Consider the query in Listing 12.7, on the other hand, which adds a couple of interesting wrinkles.
Listing 12.7 Using projections and query on array to show the difference between various models
from Orders as o
where o.Lines[].Product == "products/3-A"
select {
Company: o.Company,
Quantity: o.Lines
.reduce((sum, l) => sum + l.Quantity, 0)
}
The Lines[].Product
is a field that's used differently during indexing and querying. In the index entry generated
from the documents, the Lines[].Product
is an array. But during queries, we use it in an equality comparison as if it was a normal value. This
is because the array in the index entry was flattened to allow us to query any of the values on it.
The shape of the results of the query in Listing 12.7 is very different than the shape of the documents. That's because of the
projection in the select
. As long as we're working with RQL directly, we don't really notice, but how do we deal with such
different shapes on the client side?
When using a strongly typed language such as C#, for example, we need some way to convey the differences. We can do that using
explicit and implicit types. Consider the index My/Orders/Totals
that we defined in Listing 12.4. Look at the Total
field
that we computed in the index. How are we going to be able to query on that?
We need to introduce a type, just for querying, to satisfy the compiler. An example of such a query is shown in Listing 12.8.
Listing 12.8 Using a dedicate type for strongly typed queries
public class My_Orders_Totals :
AbstractIndexCreationTask<Order, My_Orders_Totals.Result>
{
public class Result
{
public string Employee;
public string Company;
public double Total;
}
// class constructor shown in Listing 12.4
}
var bigOrdersForEmployee1A =
(
from o in session.Query<My_Orders_Totals.Result, My_Orders_Totals>()
where o.Employee == "employees/1-A" &&
o.Total > 1000
select o
).OfType<Order>().ToList();
The code in Listing 12.8 shows a common usage pattern in RavenDB. First, we define a nested type inside the index class to
represent the result of the index. This is commonly called Result
, IndexEntry
or Entry
. There's no real requirement
for this to be a nested class, by the way. It can be any type that simply has the required fields. The idea here is that we
just need the compiler to be happy with us.
The problem with using the My_Orders_Totals.Result
class is that, while we can now use it in the where
clause, we aren't
actually going to get this class in the results. We'll get the full Order
document. We can tell the compiler that we'll
be getting a list of Order
by calling OfType<Order>()
. This is a client-side-only behavior, which only converts the type being used in the query and has no effect on the server-side query that will be generated.
Calling OfType
doesn't close the query. We can still continue to add behavior to the query to project the relevant data
or to select the sort order for the results, as you can see in Listing 12.9.
Listing 12.9 Adding projection and sorting after calling OfType
var bigOrdersForEmployee1A =
(
from o in session.Query<My_Orders_Totals.Result, My_Orders_Totals>()
where o.Employee == "employees/1-A" &&
o.Total > 1000
orderby o.Total descending
select o
).OfType<Order>();
var results =
from o in bigOrdersForEmployee1A
orderby o.Employee
select new
{
o.Company,
Total = o.Lines.Sum(x => x.Quantity)
};
The RQL generated by the query in Listing 12.9 is shown in Listing 12.10.
Listing 12.10 The RQL generated by the query in Listing 12.9
from index 'My/Orders/Totals' as o
where o.Employee = $p0 and o.Total > $p1
order by Total as double desc, Employee
select {
Company : o.Company,
Total : o.Lines.map(function(x) { return x.Quantity; })
.reduce(function(a, b) { return a + b; }, 0)
}
As you can see, even though Listing 12.9 has the orderby
clauses in different locations and operating on different types, the
RQL generated doesn't care about that and has the proper sorting.
The last part is important. It's easy to get caught up with the code you have sitting in front of you while forgetting that, underneath it all, what's sent to the server is RQL. In many respects, we torture the type system in these cases to get it to both agree on the right types and to allow us to generate the right queries to the server.
Listing 12.4 shows how we can create a simple index from the client. But we're still missing a few bits. This kind of approach only lets us create very simple indexes. How are we going to handle the creation of a MapReduce index?
Defining MapReduce indexes via client code
On the client side, a MapReduce index is very similar to the simple indexes that we've already seen. The only difference is that
we have an issue with the strongly typed nature of the language. In Listing 12.4, we defined index My_Orders_Totals
and used a
generic parameter to indicate that the source collection (and the type this index is operating on) is Order
.
However, with a MapReduce index, we have two types. One is the type that the Map
will operate on, just the same as we had before. But there's another type, which is the type that the Reduce
is going to work on. As you probably expected, we'll also pass
the second type as a generic argument to the index. Listing 12.11 shows such a MapReduce index using strongly typed code.
Listing 12.11 Defining a map-reduce index from code
public class My_Products_Sales :
AbstractIndexCreationTask<Order, My_Products_Sales.Result>
{
public class Result
{
public string Product;
public int Count;
public double Total;
}
public My_Products_Sales()
{
Map = orders =>
from order in orders
from line in order.Lines
select new
{
Product = line.Product,
Count = 1,
Total = (line.Quantity * line.PricePerUnit)
};
Reduce = results =>
from result in results
group result by result.Product
into g
select new
{
Product = g.Key,
Count = g.Sum(x => x.Count),
Total = g.Sum(x => x.Total)
}
}
}
The index My_Products_Sales
in Listing 12.11 defined the Map
as we have previously seen in the My_Orders_Totals
index.
We also have another nested class called Result
. (Again, using a nested class is a mere convention because it keeps the Result
class near the index using it). However, we're also using the nested class for this type as the generic argument for
the base type.
This might look strange at first, but it's actually quite a natural way to specify a few things: first, that this index's Map
's source collection
is Order
, and second, that the output of the Map
and the input (and output) of the Reduce
are in the shape of the Result
class.
Note that I'm using the phrase "in the shape of" and not "of type." This is because, as you can see in the select new
clauses,
we aren't actually returning the types there. We're returning an anonymous type.
As long as the shape matches (and the server will verify that), RavenDB doesn't care. The actual execution of the index is done on the server side and is not subject to any of the type rules that you saw in the code listing so far. It's important to remember that the purpose of all the index classes and Linq queries is to generate code that will be sent to the server. And as long as the server understands what's expected, it doesn't matter what's actually being sent.
You can create the index in Listing 12.11 using either new My_Products_Sales().Execute(store)
or by running the
IndexCreation.CreateIndexes(assembly, store)
again. Go ahead and inspect the new index in the Studio. Remember, the index name
in the RavenDB Studio is My/Products/Sales
.
With the index on the server, we can now see how we can query a MapReduce index from the client side. This turns out to be pretty much the same as we've already seen. Listing 12.12 has the full details.
Listing 12.12 Querying a map-reduce index using Linq
var salesSummaryForProduct1A =
from s in session.Query<My_Products_Sales.Result, My_Products_Sales>()
where s.Product == "products/1-A"
select s;
The Query
method in Listing 12.12 takes two generic parameters. The first is the type of the query — in this case, the Result
class, which we also used in the index itself as the input (and output) of the Reduce
function. The second generic parameter is
the index that we'll use. In the case of the query in Listing 12.12, the output of the query is also the value emitted by the
MapReduce index, so we don't have to play any more games with types.
Strong types and weak lies
RavenDB goes to great lengths to pretend that it actually cares about types when you're writing code in a strongly typed language. The idea is that from the client code, you'll gain the benefits of strongly typed languages, including IntelliSense, compiler checks of types, etc.
That isn't what's actually being sent to the server, though. And while the vast majority of the cases are covered with a strongly typed API, there are some things that either cannot be done or are awkward to do. For such scenarios, you can drop down a level in the API and use the string-based APIs that give you maximum flexibility.
We've seen how to create simple indexes and MapReduce indexes, but we also have multimap indexes in RavenDB. How are we going to work with those from the client side?
Multimap indexes from the client
An index can have any number of Map
functions defined on it, but the code we explored so far in Listing 12.4 and Listing 12.11
only shows us how to define a single map. This is because the AbstractIndexCreationTask<T>
base class is meant for the common
case where you have only a single Map
in your index. If you want to define a multimap index from the client, you need
to use the appropriate base class, AbstractMultiMapIndexCreationTask<T>
, as you can see in Listing 12.13.
Listing 12.13 Defining a multimap index from the client side
public class People_Search :
AbstractMultiMapIndexCreationTask<People_Search.Result>
{
public class Result
{
public string Name;
}
public People_Search()
{
AddMap<Employee>(employees =>
from e in employees
select new
{
Name = e.FirstName + " " + e.LastName
}
);
AddMap<Company>(companies =>
from c in companies
select new
{
c.Contact.Name
}
);
AddMap<Supplier>(suppliers =>
from s in suppliers
select new
{
s.Contact.Name
}
);
}
}
The index in Listing 12.13 is using multimap to index Employees
, Companies
and Suppliers
. We already ran into this index
before, in Listing 10.12. At the time, I commented that dealing with a heterogeneous result set can be challenging — not for RavenDB
or the client API, but for your code.
You can see that, in Listing 12.13, we also have a Result
class that's used as a generic parameter. Technically, since we
don't have a Reduce
function in this index, we don't actually need it. But it's useful to have because the shape the index entry will take is explicit.
We call AddMap<T>
for each collection that we want to index, and all of the AddMap<T>
calls must have the output in the
same shape.
How about actually using such an index? Before we look at the client code, let's first consider a use case for this. The
index allows me to query across multiple collections and fetch results from any of the matches.
Consider the case of querying for all the results where the name starts with Mar
. You can see a mockup of how this will
look in the UI in Figure 12.1.
To query this successfully from the client, we need to specify both the type of the index and the shape we're querying
on. Luckily for us, we already defined that shape: the People_Search.Result
nested class. You can see the query in Listing
12.14.
Listing 12.14 Querying a multimap index with heterogeneous results from the client
var results = session.Query<People_Search.Result, People_Search>()
.Where(item => item.Name.StartsWith("Mar"))
.OfType<object>();
foreach (var result in results)
{
switch(result)
{
case Employee e:
RenderEmployee(e);
break;
case Supplier s:
RenderSupplier(s);
break;
case Company c:
RenderCompany(c);
break;
}
}
In Listing 12.14, we're issuing the query on results in the shape of People_Search.Result
and then telling the compiler
that the result can be of any type. If we had a shared interface or base class, we could have used that as the common type
for the query. The rest of the code just does an in-memory type check and routes each result to the relevant rendering code.
Linq isn't the only game in town
The RavenDB query API is built in layers. At the top of the stack, you have Linq, which gives you strongly typed queries with full support from the compiler. Below Linq, you have the
DocumentQuery
API, which is a bit lower level and gives the user a programmatic way to build queries.You can access the
DocumentQuery
API throughsession.Advanced.DocumentQuery<T>
, as shown in the following query:var results = session.Advanced .DocumentQuery<object>("People/Search") .WhereStartsWith("Name", "Mar") .ToList();
This query is functionally identical to the one in Listing 12.14, except that we're weakly typed here. This kind of API is meant for programmatically building queries, working with users' input, and the like. It's often easier to build such scenarios without the constraints of the type system. The
DocumentQuery
API is capable of any query that Linq can perform. Indeed, since Linq is implemented on top ofDocumentQuery
, that's fairly obvious.You can read more about the options available to you with
DocumentQuery
in the online documentation.
We could have also projected fields from the index and gotten all the results in the same shape from the server. Writing such a query using C# is possible, but it's awkward and full of type trickery. In most cases, it's better to use RQL directly for such a scenario.
Using RQL from the client
Within RavenDB, RQL queries give you the most flexibility and power. Any other API ends up being translated to RQL, after all.
Features such as Language Integrated Query make most queries a joy to build, and the DocumentQuery
API
gives us much better control over programmatically building queries. But at some point, you'll want to just write raw RQL
and get things done.
There are several levels at which you can use RQL from your code. You can just write the full query in RQL, you can add
RQL snippets to a query and you can define the projection manually. Let's look at each of these in turn. All of the queries
we'll use in this section will use the SearchResult
class defined in Listing 12.15.
Listing 12.15 A simple data class to hold the results of queries
public class SearchResult
{
public string ContactName;
public string Collection;
}
Listing 12.16 shows how we can work directly with RQL. This is very similar to the query we used in Listing 10.13 a few chapters ago.
Listing 12.16 Querying using raw RQL
List<SearchResult> results = session.Advanced
.RawQuery<SearchResults>(@"
from index 'People/Search' as p
where StartsWith(Name, $name)
select
{
Collection: p['@metadata']['@collection'],
ContactName: (
p.Contact || { Name: p.FirstName + ' ' + p.LastName }
).Name
}
")
.AddParameter("$name", "Mar")
.ToList();
There are a few items of interest in the query in Listing 12.16. First, you can see that we specify
the entire query using a single string. The generic parameter that's used in the RawQuery
method is
the type of the results for the query. This is because we're actually specifying the query as a
string, so we don't need to play hard to get with the type system and can just specify what we want
in an upfront manner.
The query itself is something we've already encountered before. The only surprising part there
is the projection that checks if there's a Contact
property on the object or creates a new object
for the Employees
documents (which don't have this property).
Query parameters and RQL
In Listing 12.16, there's something that's both obvious and important to call out, and it's the use of query parameters. We use the
$name
parameter and add it to the query using theAddParameter
method.It's strongly recommended that you only use parameters and you don't build queries using string concatenation (especially when it involves users' input). If you need to dynamically build queries, using the
DocumentQuery
is preferred. And users' input should always be sent usingAddParameter
so it can be properly processed and not be part of the query.See also
SQL Injection Attacks
in your favorite search engine.
Listing 12.16 required us to write the full query as a string, which means that it's opaque to the compiler. We don't have to go full bore with RQL strings; we can ask the RavenDB Linq provider to do most of the heavy lifting and just plug in our custom extension when it's needed.
Consider the code in Listing 12.17, which uses RavenQuery.Raw
to carefully inject an RQL snippet into
the Linq query.
Listing 12.17 Using Linq queries with a bit of RQL sprinkled in
List<SearchResult> results =
from item in session.Query<People_Search.Result, People_Search>()
where item.Name.StartsWith("Mar")
select new SearchResult
{
Collection = RavenQuery.Raw("item['@metadata']['@collection']"),
ContactName = RavenQuery.Raw(@"(
item.Contact || { Name: item.FirstName + ' ' + item.LastName }
).Name")
}
Listing 12.17 isn't a representative example, mostly because it's probably easier to write it as a RQL query directly. But it serves as a good example of a non-trivial query and how you can utilize advanced techniques in your queries.
It's more likely that you'll want to use a variant of this technique when using the DocumentQuery
API. This is because you'll typically compose queries programmatically using this API and then want
to do complex projections from the query.
This is easy to do, as you can see in Listing 12.18.
Listing 12.18 Using a custom projection with the 'DocumentQuery' API
List<SearchResult> results =
session.Advanced.DocumentQuery<People_Search.Result, People_Search>()
.WhereStartsWith(x => x.Name, "Mar")
.SelectFields<SearchResult>(QueryData.CustomFunction(
alias: "item",
func: @"{
Collection: item['@metadata']['@collection'],
ContactName: (
p.Contact || { Name: item.FirstName + ' ' + item.LastName }
).Name
}")
).ToList();
The queries in Listing 12.16, 12.17 and 12.18 produce the exact same query and the same results, so it's your choice when to use either option. Myself, I tend to use RQL for complex queries where I need the full power of RQL behind me and when I can't express the query that I want to write in a natural manner using Linq.
I use the DocumentQuery
API mostly when I want to build queries programmatically, such as search
pages or queries that are composed dynamically.
Controlling advanced indexing options from the client side
In the previous section, we explored a lot of ways to project data from the People/Search
index, but
our query was a simple StartsWith(Name, $name)
. So if $name
is equal to "Mar"
, we'll find an
employee named Margaret Peacock
. However, what would happen if we tried to search for "Pea"
?
If you try it, you'll find there are no results. You can check the index's terms to explore why this is the case, as shown in Figure 12.2.
When you look at the terms in Figure 12.2, it's obvious why we haven't been able to find anything when
searching for the "Pea"
prefix. There's no term that starts with it. Our index is simply indexing
the terms as is, with minimal work done on them. (It's just lowercasing them so we can run a case-insensitive
search.)
We already looked at this in Chapter 10, in the section about full text indexes, so
this shouldn't come as a great surprise. We need to mark the Name
field as a full text search field. But
how can we do that from the client side? Listing 12.19 shows the way to do it.
Listing 12.19 Configuring index fields options via code
public class People_Search :
AbstractMultiMapIndexCreationTask<People_Search.Result>
{
public People_Search()
{
// AddMap calls from Listing 12.13
// removed for brevity
Index(x => x.Name, FieldIndexing.Search);
Suggestion(x => x.Name);
}
}
In Listing 12.19, you can see the Index
method. It configures the indexing option for the Name
field to
full text search mode. And the Suggestion
method is used, unsurprisingly enough, to indicate that this
field should have suggestions applied to it.
Creating weakly typed indexes
In addition to the strongly typed API exposed by
AbstractMultiMapIndexCreationTask
andAbstractIndexCreationTask
you can also use the weakly typed API to control every aspect of the index creation, such as with the following code:public class People_Search : AbstractIndexCreationTask { public override IndexDefinition CreateIndexDefinition() { return new IndexDefinition() { Maps = { @"from e in docs.Employees select new { Name = e.FirstName + ' ' + e.LastName }", @"from c in docs.Companies select new { c.Contact.Name }", @"from s in docs.Suppliers select new { s.Contact.Name }" }, Fields = { ["Name"] = new IndexFieldOptions { Indexing = FieldIndexing.Search } } }; } }
You're probably sick of the
People/Search
index by now, with all its permutations. The index definition above behaves just the same as all the otherPeople/Search
indexes we looked at, including being picked up byIndexCreation
automatically. It just gives us the maximum amount of flexibility in all aspects of the index.
There are other options available, such as using Store
to store the fields, Spatial
for geographical
indexing and a few more advanced options that you can read more about in the online documentation. Anything
that can be configured through the Studio can also be configured from code.
MultimapReduce indexes from the client
The last task we have to do with building indexes from client code is to build a MultimapReduce index. This
is pretty straightforward, given what we've done so far. We need to define an index class inheriting from
AbstractMultiMapIndexCreationTask
, define the Maps
using the AddMap
methods and finally define the
Reduce
function. Listing 12.20 shows how this is done.
Listing 12.20 MultimapReduce index to compute details about each city
public class Cities_Details :
AbstractMultiMapIndexCreationTask<Cities_Details.Result>
{
public class Result
{
public string City;
public int Companies, Employees, Suppliers;
}
public Cities_Details()
{
AddMap<Employee>(emps =>
from e in emps
select new Result
{
City = e.Address.City,
Companies = 0,
Suppliers = 0,
Employees = 1
}
);
AddMap<Company>(companies =>
from c in companies
select new Result
{
City = c.Address.City,
Companies = 1,
Suppliers = 0,
Employees = 0
}
);
AddMap<Suppplier>(suppliers =>
from s in suppliers
select new Result
{
City = s.Address.City,
Companies = 0,
Suppliers = 1,
Employees = 0
}
);
Reduce = results =>
from result in results
group result by result.City
into g
select new Result
{
City = g.Key,
Companies = g.Sum(x => x.Companies),
Suppliers = g.Sum(x => x.Suppliers),
Employees = g.Sum(x => x.Employees)
}
}
}
Listing 12.20 is a bit long, but it matches up to the index we defined in the previous chapter, in Listing 11.13. And
the only new thing in the Cities_Details
index class is the use of select new Result
instead of using select new
to create an anonymous class. This can be helpful when you want to ensure that all the Maps
and the Reduce
are using
the same output. RavenDB strips the Result
class when it creates the index, so the server doesn't care about it. This
is simply here to make our lives easier.
Deploying indexes
I briefly mentioned earlier that it's typical to deploy indexes using IndexCreation.CreateIndexes
or its async equivalent
IndexCreation.CreateIndexesAsync
. These methods take an assembly and the document store you're using and scan the assembly
for all the index classes. Then they create all the indexes they found in the database.
During development, it's often best to call one of these methods in your application startup. This way, you can modify an index and run the application, and the index definition is automatically updated for you. It also works great when you pull changes from another developer. You don't have to do anything to get the right environment set up.
Attempting to create an index that already exists on the server (same name and index definition) is ignored and has no effect
on the server or the cluster. So if nothing changed in your indexes, the entire IndexCreation.CreateIndexes
call does
nothing at all. Only when there are changes to the indexes will it actually take effect.
Locking indexes
Sometimes you need to make a change to your index definition directly on your server. That's possible, of course, but you have to be aware that if you're using
IndexCreation
to automatically generate your indexes, the next time your application starts, it will reset the index definition to the original.That can be somewhat annoying because changing the index definition on the server can be a hotfix to solve a problem or introduce a new behavior, and the index reset will just make it go away, apparently randomly.
In order to handle this, RavenDB allows the option of locking an index. An index can be unlocked, locked or locked (error). In the unlocked mode, any change to the index would be accepted, and if the new index definition is different than the one stored on the server, the index would be updated and re-index the data using the new index definition. In the locked mode, creating a new index definition would return successfully but would not actually change anything on the server. And in the locked (error) mode, trying to change the index will raise an error.
Usually you'll just mark the index as locked, which will make the server ignore any changes to the index. The idea is that we don't want to break your calls to
IndexCreation
by throwing an error.Note that this is not a security measure. It's a way for the operations team to make a change in the index and prevent the application from mindlessly setting it back. Any user that can create an index can also modify the lock mode on the index.
Index creation is a cluster operation, which means that you can run the command against any node in the database group and RavenDB will make sure that it's created in all the database's nodes. The same also applies for automatic indexes. If the query optimizer decides that a query requires an index to be created, that index is going to be created in all the database instances, not just the one that processed this query.
This means that the experience of each node can be shared among all of them and you don't have to worry about a failover from a node that has already created the indexes you're using to one that didn't accept any queries yet. All of the nodes in a database group will have the same indexes.
Failure modes and external replication
Being a cluster operation means that index creation is reliable; it goes through the Raft protocol and a majority of the nodes must agree to accept it before it's acknowledged to the client. If, however, a majority of the nodes in the cluster are not reachable, the index creation will fail. This applies to both manual and automatic index creation in the case of network partition or majority failure. Index creation is rare, though, so even if there's a failure of this magnitude, it will not typically affect day-to-day operations.
External replication allows us to replicate data (documents, attachments, revisions, etc.) to another node that may or may not be in the same cluster. This is often used as a separate hot spare, offsite backup, etc. It's important to remember that external replication does not replicate indexes. Indexes are only sent as a cluster operation for the database group. This allows you to have the data replicated to different databases and potentially run different indexes on the documents.
There are other considerations to deploying indexes, especially in production. In the next section, we'll explore another side of indexing: how indexes actually work.
How do indexes do their work?
This section is the equivalent of popping the hood on a car and examining the engine. For the most part, you shouldn't have to do that, but it can be helpful to understand what is actually going on.
An index in RavenDB is composed of
- Index definition and configuration options (
Maps
andReduce
, fields, spatial, full text, etc.). - Data on disk (where we store the results of the indexing operation).
- Various caches for portions of the data, to make it faster to process queries.
- A dedicated index thread that does all the work for the index.
What's probably the most important from a user perspective is to understand how this all plays together. An index in RavenDB
has a dedicated thread for all indexing work. This allows us to isolate any work being done to this thread and give
the admin better accountability and control. In the Studio, you can go to Manage Server
and then click Advanced
and
you'll see the Threads Runtime Info
. You can see a sample of that in Figure 12.3.
In Figure 12.3, you can see how much processing time is taken by the Orders/ByCompany
indexing thread.
A dedicated thread per instance greatly simplifies operational behaviors and allows us to apply several important optimizations. It means that no index can interfere with any other index. A slow index can only affect itself, instead of having a global effect on the system. It simplifies the code and algorithms required for indexing because there's no need to write thread safe code.
This design decision also allows RavenDB to prioritize tasks more easily. RavenDB uses thread priorities at the operating system level to hint what should be done first. Setting the index priority will affect the indexing thread priority at the operating system level. You can see how to change the index priority in Figure 12.4.
By default, RavenDB prioritizes request processing over indexing, so indexing threads start with a lower priority than request-processing threads. That means requests complete faster and indexes get to run when there's capacity for them (with the usual starvation prevention mechanisms). You can increase or lower the index priority and RavenDB will update the indexing thread accordingly.
RavenDB also uses this to set the I/O priority for requests generated by indexing. In this way, we can be sure that indexing will not overwhelm the system by taking too much CPU or saturating the I/O bandwidth we have available.
The last point is important because RavenDB's indexes are always built online, in conjunction with normal operations on the server. And this request I/O priority scheme applies to both indexing creation and updates with each change to the data. We don't make distinctions between the two modes.
What keeps the indexing thread up at night?
When you create a new index, RavenDB will spawn a thread that will start indexing all the documents covered by the index. It will go through the documents in batches, trying to index as many of them in one go as it can, until all are indexed. What happens then?
The index will then go to sleep, waiting for a new or updated document in one of the collections that this index cares about. In other words, until there's such a document, the thread is not going to be runnable. It isn't going to compete for CPU time and takes very few resources from the system.
If the indexing thread detects that it's been idle for a while, it will actively work to release any resources it currently holds and then go back to sleep until it's woken by a new document showing up.
Indexing in batches
RavenDB typically needs to balance throughput vs. freshness when it comes to indexing. The bigger the batch, the faster documents get indexed. But we only see the updates to the index when we complete the batch. During initial creation, RavenDB typically favors bigger batches (as much as the available resources allow) and will attempt to index as many documents as it can at once.
After the index completes indexing all the documents it covers, it will watch for any new or updated documents and index them as soon as possible, without waiting for more updates to come. The typical indexing latency (the time between when a document updates and when the index has committed the batch including this document) is measured in milliseconds on most systems.
The query optimizer's capability to create new indexes on the fly depends on making sure the new index isn't breaking things while it's being built. Because of this requirement, RavenDB is very careful about resource allocations to indexing. We talked about CPU and I/O priorities, but there's also a memory budget applied. All in all, this has been tested in production for many years and has proven to be an extremely valuable feature.
The ability to deploy, in production, a new index (or a set of indexes) is key for operational agility. Otherwise, you'll have to schedule downtime whenever your application changes even the most minor of queries. This kind of flexibility is offered not just for new indexes but also for when you're updating existing ones.
Side by side
During development, you'll likely not notice the indexing update times. The amount of data you have in a development database is typically quite small, and the machine is not usually too busy in handling production traffic. In production, the opposite is true. There's a lot of data, and your machines are already busy doing their normal routine. This means that an index deploy can take a while.
Let's assume we have an index deploy duration (from the time it's created to the time it's done indexing all the relevant documents) of five minutes. An updated index definition can't just pick up from where the old index definition left off. For example, we might have added new fields to the index, so in addition to indexing new documents, we need to re-index all the documents that are already indexed. But if we have a five-minute period in which we're busy indexing, what will happen to queries made to the index during that time frame?
All index updates in RavenDB are done using the side-by-side strategy. Go to the Studio and update the Orders/Totals
index
by changing the Total
field computation and save the document. Then immediately go to the indexes page. You should
see something similar to what's shown in Figure 12.5.
Figure 12.5 shows an index midway through an update. But instead of deleting the old index and starting the indexing from scratch (which will impact queries), RavenDB keeps the old index around (for answering queries and indexing new documents) until the new version of the index has caught up and indexed everything.
This way, you can minimize the effects of updating an index in production. Once the updated version of the index has completed
its work, it will automatically replace the old version. Of course, you can also force an immediate replacement if
you really need to. (Swap now
will do it.)
Auto indexes and the query optimizer
We talked about the query optimizer creating indexes on the fly several times, but now I want to shine a light on the kind of heuristics that the query optimizer uses and the logic that guides it.
At the most basic level, the query optimizer analyzes all the queries that don't specify an explicit index to use (anything
that doesn't start with from index ...
is fair game for the query optimizer). The query optimizer will attempt to
find an index that can answer the query being asked, but if it fails to find any appropriate indexes, it will go ahead and
create a new one.
One very important aspect is that the query optimizer isn't going to create an index blindly. Instead of only considering the current query when it's time to create a new index, the query optimizer is also going to weigh the history of the queries that were made against the database.
In other words, the logic that guides the query analyzer looks something like this:
- Is there an index that can match this query? If so, use that.
- If there's no such index, we need to create one.
- Let's take a look at all the queries that were made against the same collection as the one that's now being queried and see what would be the optimal index to answer all of these queries, including the new query.
- We need to create this new optimal index and wait for it to complete indexing.
- We should retire all the automatic indexes that have been created so far that are now covered by the new index.
The idea here is that RavenDB uses your queries as a learning opportunity to figure out more about the operational environment, and the query optimizer is able to use that knowledge when it needs to create a new index.
Over time, this means that we'll generate the optimal set of indexes to answer any query that doesn't use an explicit index. Furthermore, it means that operational changes, such as deploying a new version of your application with slightly different queries, will be met with equanimity by RavenDB. The query optimizer will recognize the new queries and figure out if they can use the existing indexes. If they can't, the optimizer will create a new index for them. All existing queries will continue to use the existing indexes until the new indexes are ready. Then they'll switch.
All of this will be done for you, without anyone needing to tell RavenDB what needs to be done or babysit it. The fact that index modifications are cluster-wide also means that all the nodes in the cluster will be able to benefit from this knowledge.
Importing and exporting indexes
RavenDB's ability to learn as it goes is valuable, but even so, you don't always want to do that kind of operation directly in production. If you have a large amount of data, you don't want to wait until you already deployed your application to production for RavenDB to start learning about the kind of queries that it's going to generate. During the learning process, there might be several paths taken that you want to skip.
You can run your application in a test environment, running a set of load tests and making the application issue all its queries to your test RavenDB instance. That instance will apply the same logic and create the optimal set of indexes to answer the kind of queries it saw.
You can now export that knowledge from the test machine and import it into the production cluster. The new indexes will be built, and by the time you're ready to actually deploy your application to production, all the work has already been done and the indexes are ready for the new queries in the updated version of your application.
Let's see how that can work, shall we? In the Studio, go to Settings
and then to Export Database
. Ensure that only
the Include Indexes
is selected and click the Export Database
button. You can see what this looks like in Figure 12.6.
You can then take the resulting file and import that into the production instance (Settings
and then Import Database
)
and the new indexes will be created. The query optimizer will then take them into account when it needs to decide which
index is going to handle which query.
Indexing and querying performance
When it comes time to understand what's going on with your indexes, you won't face a black box. RavenDB tracks and
externalizes a lot of information about the indexing processes and makes it available to you, mostly via the Studio in Indexes
and then the Indexing Performance
page. You can see a sample of what it looks like when the
system is indexing in Figure 12.7.
The timeline view in Figure 12.7 shows several indexes running concurrently (and independently). (Solid colors mean the index batch is complete, stripes mean this is an actively executing index.) And you can hover over each of the steps to get more information, such as the number of documents indexed or the indexing rate, as shown in Figure 12.8.
This graph can be very useful for investigating what exactly is going on inside RavenDB without having to look through a pile of log files. For example, look at the thread details that we previously discussed (see Figure 12.3 for what this looks like in the Studio) and notice that a particular indexing thread is using a lot of CPU time.
You can go into the Indexing Performance
window and simply look at what's taking so much time. For example, you may
be using the "Suggestions" feature, which can be fairly compute-intensive with high update rates. An example of this
is shown in Figure 12.8, where you can see the exact costs of suggestions during indexing.
Figure 12.8 shows a fairly simple example, but the kind of details exposed in the timeline can give you a better idea of what exactly is going on inside RavenDB. As part of ongoing efforts to be a database that's actively trying to help the operations team, RavenDB is externalizing all such decisions explicitly. I encourage you to look at each of these boxes. The tooltips reveal a lot of what's going on, and this kind of view should quickly give you a feeling about how much things should cost. That way, you can recognize when things are out of whack if you are exploring some issue.
Having a good idea of what's going on during indexing is just half the job. We also need to be able to monitor the other side: what's going on when we query the database. RavenDB actively monitors such actions and will bring it to the operator's attention when there are issues, as shown in Figure 12.9.
Figure 12.9 shows the large result set alert, generated when a query returns a very large number of results while not using streaming. (Streaming queries were discussed in Chapter 4.) This can lead to higher memory utilization on both client and server and is considered bad practice. RavenDB will alert you to this issue and provide the exact time and the query that caused it so you can fix the problem.
In the same vein, very slow queries are also made explicitly visible to the operators because they're something they probably need to investigate. There are other operational conditions that RavenDB monitors and will bring to your attention — anything from slow disk I/O to running out of disk space to network latency issues. We'll discuss alerts and monitoring in RavenDB in much more depth in the next part of the book, so I'll save it till then.
Error handling in indexing
Sometimes, your index runs into an error. RavenDB actually goes to great lengths to avoid that. Property access
inside the index will propagate nulls transitively. In other words, you can write the index shown in Listing 12.21
and you won't get a NullReferenceException
.
Listing 12.21 Accessing a null 'manager' instance will not throw an exception
public class Employees_Managers
: AbstractIndexCreationTask<Employee>
{
public Employees_Managers()
{
Map = emps =>
from e in emps
let manager = LoadDocument<Employee>(e.ReportsTo)
select new
{
Name = e.FirstName + " " + e.LastName,
HasManager = manager != null,
Manager = manager.FirstName + " " + manager.LastName
};
}
}
The employees/2-A
document has null as the value of ReportsTo
. What do you think will happen when the index
shown in Listing 12.21 is busy indexing this document? LoadDocument
will return null
(because the document ID
it got was null
) and the value of HasManager
is going to be false because there's no manager for employees/2-A
.
However, just one line below, we access the manager
instance, which we know is null
.
Usually, such an operation will throw a NullReferencException
. RavenDB, however, rewrites all references so they
use null propagation. The actual mechanism by which this is done is a bit complex and out of scope for this topic, but you
can imagine that RavenDB actually uses Manager = manager?.FirstName + " " + manager?.LastName
everywhere. Did
you notice the ?.
usage? This means "if the value is null
, return null
; otherwise, access the property."
In this way, a whole class of common issues is simply averted. On the other hand, the index will contain a name for a manager for employees/2-A
. It will be " "
because the space is always concatenated with the values,
and null
concatenated with a string is the string.
Some kinds of errors don't really let us recover. Consider the code in Listing 12.22. The index itself isn't very interesting, but we have an int.Parse
call there on the PostalCode
property.
Listing 12.22 Parsing UK PostalCode
as int
will throw an exception
public class Employees_PostalCode
: AbstractIndexCreationTask<Employee>
{
public Employees_PostalCode()
{
Map = emps =>
from e in emps
select new
{
Name = e.FirstName + " " + e.LastName,
Postal = int.Parse(e.Address.PostalCode)
};
}
}
The PostalCode
property in the sample data set is numeric for employees from Seattle and alphanumeric for
employees in London. This means that for about half of the documents in the relevant collection, this index is
going to fail. Let's see how RavenDB behaves in such a case. Figure 12.10 shows how this looks in the Studio.
We can see that the index as a whole is marked as errored. We'll ignore that for the moment and focus on the
Index Errors
page. If you click on it, you'll see a list of the errors that happened during indexing. You
can click on the eye icon to see the full details. In this case, the error is "Failed to execute mapping
function on employees/5-A
. Exception: System.FormatException
: Input string was not in a correct format.
... System.Number.ParseInt32
..."
There are two important details in that error message: we know what document caused this error
and we know what this issue is. These details make it easy to figure out what the problem is. Indeed,
looking at employees/5-A
, we can see that the value of the PostalCode
property is "SW1 8JR"
. It's not really
something that int.Parse
can deal with.
So the indexing errors give us enough information to figure out what happened. That's great. But what about
the state of the index? Why is it marked as errored? The easiest way to answer that question is to query
the index and see what kind of error RavenDB returns. Executing the following query
from index 'Employees/PostalCode'
will give us this error: "Index 'Employees/PostalCode' is marked as errored.
Index Employees/PostalCode is invalid, out of 9 map attempts, 4 has failed. Error rate of 44.44% exceeds allowed
15% error rate."
Now things become much clearer. An index is allowed to fail processing only some documents. Because of the dynamic nature of documents in RavenDB, you may get such failures. However, allowing such failures to go unattended is dangerous. An error in indexing a document means that this particular document is not indexed. That may seem like a tautology, but it has important operational implications. If the document isn't indexed, you aren't going to see it in the results. It is "gone".
While the indexing error is intentionally very visible, if you're running in an unattended mode, which is common, it may be a while before your users' complaints of "I can't find that record" make you check the database. What would be worse is if you had some change in the application or behavior that caused all new documents to fail to index. Because of that, an index is only allowed a certain failure rate. We'll mark the entire index as errored when that happens.
An index in an error state cannot be queried and will return an immediate error (similar to the error text above) with an explanation of what's going on. With an explicit error, it's much easier to figure out what's wrong and then fix it.
Summary
We started this chapter by discussing index deployments, from the baseline of defining indexes using strongly
typed classes to the ease of use of IndexCreation.CreateIndexes
to create all those indexes on the database.
We re-implemented many features and scenarios that we already encountered, but this time we implemented them from the
client's code perspective. Building indexes using Linq queries is an interesting experience. We started from
simple indexes and MapReduce indexes with AbstractIndexCreationTask<T>
and then moved to multimap and MultimapReduce
indexes with AbstractMultiMapIndexCreationTask<T>
base classes.
We explored how to query RavenDB from the client side, starting with the simplest of Linq queries and building toward more flexibility with some more complex queries. With both Linq queries and the strong typed indexes, we talked about the fact that RavenDB isn't actually aware of your client-side types, nor does it really care about them.
All the work done to make strongly typed indexes and queries on the client side is purely there so you'll have good compiler, IntelliSense and refactoring support inside your application. In the end, both queries and indexes are turned into RQL strings and sent to the server.
We looked at how we can directly control the IndexDefinition
sent to the server, giving us absolute
power to play with and modify any option that we wish. This can be done by using the non-generic
AbstractIndexCreationTask
class and implementing the CreateIndexDefinition()
method.
In a similar sense, all the queries we run are just fancy ways to generate RQL queries. We looked into all
sorts of different ways of using RQL queries in your applications, from using RQL directly by calling
RawQuery
(and remembering to pass parameters only through AddParameter
) to poking holes in Linq
queries using RavenQuery.Raw
method to using a CustomFunction
to take complete control over the
projection when using DocumentQuery
.
Following the discussion on managing the indexes, we looked into how indexes are deployed on the cluster (as a reliable cluster operation, with a majority consensus using Raft) and what this means (they're not available for external replication and they require a majority of the nodes to be reachable to create/modify an index).
We dived into the execution environment of an index and the dedicated thread that RavenDB assigns to it. Such a thread makes managing an index simpler because it gives us a scope for prioritizing CPU and I/O operations, as well as defines a memory budget to control how much RAM this index will use. This is one of the key ways that RavenDB is able to implement online index building. Being able to limit the amount of system resources that an index is using is crucial to ensure that we aren't overwhelming the system and hurting ongoing operations.
The process of updating an index definition got particular attention since this can be of critical importance in production systems. RavenDB updates indexes in a side-by-side manner. The old index is retained (and can even index new updates) while the new index is being built. Once the building process is done, the old index is removed in favor of the new one in an atomic fashion.
We briefly looked at the query optimizer, not so much to understand what it's doing but to understand what it means. The query optimizer routinely analyzes all queries and is able to create indexes on the fly, but the key aspect of that is that it uses that information to continuously optimize the set of indexes you have. After a while, the query optimizer will produce the optimal set of indexes for the queries your application generates.
You can even run a test instance of your application to teach a RavenDB node about the kind of queries it should expect and then export that knowledge to production ahead of your application deployment. In this way, RavenDB will prepare itself ahead of time for the new version and any changes in behavior that it might have.
We then moved to performance and monitoring. RavenDB exposes a lot of details
about how it indexes documents in an easy-to-consume manner, using the Index Performance
page, and it
actively monitors queries for bad practices such as queries that return too many results or are very slow.
The result of this level of monitoring is that the operations team is made aware that there are issues that
they might want to take into account and resolve, even if they aren't currently critical.
We want to head things off as soon as possible, after all, and not wait until the sky has fallen to start figuring out that there were warning signs all along the way. At the same time, these alerts aren't going to spam your operations team. That kind of behavior builds tolerance to any kind of alerts because they effectively become noise.
We closed the chapter with a discussion of error handling. How does RavenDB handle indexing errors? How are they made visible to the operators, and what kind of behavior can you expect from the system? RavenDB will tolerate some level of errors from the index, but if there are too many indexing issues, it will decide that's not acceptable and mark the whole index as failing, resulting in any query against this index throwing an exception.
This chapter has marked the end of theory and high-level discussion and moved toward a more practical discussion on how to operate RavenDB. In the next part of the book, we're going to focus on exactly that: the care and feeding of a RavenDB cluster in production. In other words, operations, here we come.