Indexes: Analyzers
-
RavenDB uses indexes to facilitate fast queries powered by Lucene, the full-text search engine.
-
The indexing of a single document starts from creating Lucene's Document according to an index definition. Lucene processes it by breaking it into fields and splitting all the text from each field into tokens (or terms) in a process called tokenization. These tokens/terms will be kept in the index, and later can be searched upon.
The tokenization process uses an object called an Analyzer. -
The indexing process and its results can be controlled by various field options and by the Analyzers.
-
In this page:
Understanding Analyzers
Lucene offers several Analyzers out of the box.
New customized analyzers can also be created.
Various Analyzers differ in the way they split the text stream ("tokenize"),
and in the way they process those tokens in post-tokenization.
The examples below use the following text:
The quick brown fox jumped over the lazy dogs, Bob@hotmail.com 123432.
Analyzers that remove common "Stop Words":
Stop words (e.g. the, it, a, is, this, who, that...) are often removed to narrow search results by including only words that are used less frequently.
If you want to include words such as IT (Information Technology), be aware that these analyzers will recognize IT as one of the stop words and remove it from searches. This can affect other acronyms such as WHO (World Health Organization) or names such as "The Who" or "The IT Crowd".
To prevent excluding acronyms, you can either spell out the entire title instead of abbreviating it or use an analyzer that doesn't remove stop words.
-
StandardAnalyzer, which is Lucene's default, will produce the following tokens:
[quick] [brown] [fox] [jumped] [over] [lazy] [dog] [bob@hotmail.com] [123432]
Removes common "stop words".
Separates on whitespace and punctuation that is followed by whitespace - a dot that is not followed by whitespace is considered part of the token.
Converts to lower-case letters so that searches aren't case-sensitive.
Email addresses and internet hostnames are one token.
Splits words at hyphens, unless there's a number in the token, in which case the whole token is interpreted as a product number and is not split. -
StopAnalyzer will work similarly, but will not perform light stemming and will only tokenize on white space:
[quick] [brown] [fox] [jumped] [over] [lazy] [dogs] [bob] [hotmail] [com]
Removes numbers and symbols and separates tokens with them.
This means that email and web addresses are separated.
Removes common "stop words".
Separates on white spaces.
Converts to lower-case letters so that searches aren't case sensitive.
Analyzers that do not remove common "Stop Words"
-
SimpleAnalyzer will tokenize on all non-alpha characters and will make all the tokens lowercase:
[the] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dogs] [bob] [hotmail] [com]
Includes common "stop words".
Removes numbers and symbols and separates tokens with them.
This means that email and web addresses are separated.
Separates on white spaces.
Converts to lower-case letters so that searches aren't case sensitive. -
WhitespaceAnalyzer will just tokenize on white spaces:
[The] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dogs,] [Bob@hotmail.com] [123432.]
Only separates on whitespaces.
Preserves upper/lower cases in text, which means that searches will be case-sensitive.
Email and web addresses, phone numbers, and other such forms of ID are kept whole. -
KeywordAnalyzer will perform no tokenization, and will consider the whole text as one token:
[The quick brown fox jumped over the lazy dogs, bob@hotmail.com 123432.]
Preserves upper/lower cases in text for case-sensitive searches.
Useful in situations like IDs and codes where you do not want to separate into multiple tokens.
Analyzers that tokenize according to the defined number of characters
- NGramAnalyzer will tokenize on predefined token lengths, 2-6 chars long, which are defined by
Indexing.Analyzers.NGram.MinGram
andIndexing.Analyzers.NGram.MaxGram
configuration options:
[.c] [.co] [.com] [12] [123] [1234] [12343] [123432] [23] [234] [2343] [23432] [32] [34] [343] [3432] [43] [432] [@h] [@ho] [@hot] [@hotm] [@hotma] [ai] [ail] [ail.] [ail.c] [ail.co] [az] [azy] [b@] [b@h] [b@ho] [b@hot] [b@hotm] [bo] [bob] [bob@] [bob@h] [bob@ho] [br] [bro] [brow] [brown] [ck] [co] [com] [do] [dog] [dogs] [ed] [er] [fo] [fox] [gs] [ho] [hot] [hotm] [hotma] [hotmai] [ic] [ick] [il] [il.] [il.c] [il.co] [il.com] [ju] [jum] [jump] [jumpe] [jumped] [l.] [l.c] [l.co] [l.com] [la] [laz] [lazy] [ma] [mai] [mail] [mail.] [mail.c] [mp] [mpe] [mped] [ob] [ob@] [ob@h] [ob@ho] [ob@hot] [og] [ogs] [om] [ot] [otm] [otma] [otmai] [otmail] [ov] [ove] [over] [ow] [own] [ox] [pe] [ped] [qu] [qui] [quic] [quick] [ro] [row] [rown] [tm] [tma] [tmai] [tmail] [tmail.] [ui] [uic] [uick] [um] [ump] [umpe] [umped] [ve] [ver] [wn] [zy]
You can override NGram analyzer default token lengths by configuring Indexing.Analyzers.NGram.MinGram
and Indexing.Analyzers.NGram.MaxGram
per index e.g. setting them to 3 and 4 accordingly will generate:
[.co] [.com] [123] [1234] [234] [2343] [343] [3432] [432] [@ho] [@hot] [ail] [ail.] [azy] [b@h] [b@ho] [bob] [bob@] [bro] [brow] [com] [dog] [dogs] [fox] [hot] [hotm] [ick] [il.] [il.c] [jum] [jump] [l.c] [l.co] [laz] [lazy] [mai] [mail] [mpe] [mped] [ob@] [ob@h] [ogs] [otm] [otma] [ove] [over] [own] [ped] [qui] [quic] [row] [rown] [tma] [tmai] [uic] [uick] [ump] [umpe] [ver]
Full-Text Search
To allow full-text search on the text fields, you can use the analyzers provided out of the box with Lucene.
These are available as part of the Lucene library which ships with RavenDB.
For most cases, Lucene's StandardAnalyzer
would be your analyzer of choice. As shown above, this analyzer is aware of e-mail and network addresses when tokenizing. It normalizes cases, filters out common English words, and does some basic English stemming as well.
For languages other than English, or if you need a custom analysis process, you can roll your own Analyzer
. It is quite simple and may be already available as a contrib package for Lucene.
There are also Collation analyzers
available (you can read more about them here).
Selecting an Analyzer for a Field
To index a document field using a specific analyzer, all you need to do is to match it with the field's name:
public class BlogPosts_ByTagsAndContent : AbstractIndexCreationTask<BlogPost>
{
public BlogPosts_ByTagsAndContent()
{
Map = posts => from post in posts
select new
{
post.Tags,
post.Content
};
// Field Tags will be tokenized by the SimpleAnalyzer
Analyzers.Add(x => x.Tags, "SimpleAnalyzer");
// Field Content will be tokenized by the custom analyzer SnowballAnalyzer
Analyzers.Add(x => x.Content, typeof(SnowballAnalyzer).AssemblyQualifiedName);
}
}
store.Maintenance.Send(new PutIndexesOperation(new IndexDefinitionBuilder<BlogPost>("BlogPosts/ByTagsAndContent")
{
Map = posts => from post in posts
select new
{
post.Tags,
post.Content
},
Analyzers =
{
{x => x.Tags, "SimpleAnalyzer"},
{x => x.Content, typeof(SnowballAnalyzer).AssemblyQualifiedName}
}
}.ToIndexDefinition(store.Conventions)));
Customized analyzer availability
The analyzer you are referencing to has to be available to the RavenDB server instance. When using analyzers that do not come with the default Lucene.NET distribution, you need to drop all the necessary DLLs into the RavenDB working directory (where Raven.Server
executable is located), and use their fully qualified type name (including the assembly name).
Creating Custom Analyzers
You can create a custom analyzer on your own and deploy it to RavenDB server. To do that pefrom the following steps:
- create a class that inherits from abstract
Lucene.Net.Analysis.Analyzer
(you need to referenceLucene.Net.dll
supplied with RavenDB Server package), - your DLL needs to be placed next to RavenDB's binaries (note it needs to be compatible with .NET Core 2.0 e.g. .NET Standard 2.0 assembly)
- the fully qualified name needs to be specified for an indexing field that is going to be tokenized by the analyzer
public class MyAnalyzer : Lucene.Net.Analysis.Analyzer
{
public override TokenStream TokenStream(string fieldName, TextReader reader)
{
throw new CodeOmitted();
}
}
RavenDB's Default Analyzers
RavenDB has three default analyzers that it uses to index text when no other analyzer was specified:
- Default Analyzer -
LowerCaseKeywordAnalyzer
- Default Exact Analyzer -
KeywordAnalyzer
- Default Search Analyzer -
RavenStandardAnalyzer
You can choose other analyzers to serve as your default analyzers by modifying the indexing configuration.
Default Analyzer
For regular text fields, RavenDB uses a custom analyzer called LowerCaseKeywordAnalyzer
. Its implementation
behaves like Lucene's KeywordAnalyzer
, but it also performs case normalization by converting all characters
to lower case. That is - RavenDB stores the entire text field as a single token, in a lowercased form. Given
the same sample text above, LowerCaseKeywordAnalyzer
will produce a single token:
[the quick brown fox jumped over the lazy dogs, bob@hotmail.com 123432.]
Default Exact Analyzer
For 'exact case' text fields, RavenDB uses Lucene's KeywordAnalyzer
, which treats the entire text field as one
token and does not change the case of the original text. To make an index store text with the exact case, see the
section on changing field indexing behavior below.
Default Search Analyzer
For full-text search text fields, RavenDB uses RavenStandardAnalyzer
, which is just an optimized version of
Lucene's StandardAnalyzer
. To make an index that allows full-text search, see the section on changing field
indexing behavior below.
Manipulating Field Indexing Behavior
By default, each indexed field is analyzed using the LowerCaseKeywordAnalyzer
which indexes a field as a single, lowercased term.
This behavior can be changed by setting the FieldIndexing
option for a particular field. The possible values are:
FieldIndexing.Exact
FieldIndexing.Search
FieldIndexing.No
Setting the FieldIndexing
option for this field to Exact
turns off the field analysis. This causes all the
properties to be treated as a single token and the matches must be exact (case sensitive), using
the KeywordAnalyzer
behind the scenes.
public class Employees_ByFirstAndLastName : AbstractIndexCreationTask<Employee>
{
public Employees_ByFirstAndLastName()
{
Map = employees => from employee in employees
select new
{
LastName = employee.LastName,
FirstName = employee.FirstName
};
Indexes.Add(x => x.FirstName, FieldIndexing.Exact);
}
}
FieldIndexing.Search
allows performing full-text search operations against the field using the StandardAnalyzer
by default:
public class BlogPosts_ByContent : AbstractIndexCreationTask<BlogPost>
{
public BlogPosts_ByContent()
{
Map = posts => from post in posts
select new
{
Title = post.Title,
Content = post.Content
};
Indexes.Add(x => x.Content, FieldIndexing.Search);
}
}
If you want to disable indexing on a particular field, use the FieldIndexing.No
option. This can be useful when you want to store field data in the index, but don't want to make it available for querying. However, it will still be available
for extraction by projections:
public class BlogPosts_ByTitle : AbstractIndexCreationTask<BlogPost>
{
public BlogPosts_ByTitle()
{
Map = posts => from post in posts
select new
{
Title = post.Title,
Content = post.Content
};
Indexes.Add(x => x.Content, FieldIndexing.No);
Stores.Add(x => x.Content, FieldStorage.Yes);
}
}
Ordering When a Field is Searchable
When a field is marked as Search
, sorting must be done using an additional field. More here.