Table of contents
Introduction
Diacritical marks are used to change the sound value of letters in many Latin alphabet languages. This can make searching more difficult, as users may omit diacritics during search, but the word may still be relevant. To be more precise, it’s not uncommon to replace diacritics with the closest letter in the Latin alphabet, e.g. Ó => O or Ü => U. You may want to end up matching a word term like Kraków with the exact term Kraków but also Krakow. Methods described below work for Lucene and Corax indexes.
Approaches
Diacritics normalization
Queries in RavenDB are always performed on an index, by comparing search terms against index entries. Hence, it is common that the first thing you might try is to populate index with normalized entries:
from doc in docs.Cities
select new
{
Name = doc.Name.Replace("Ü", "U").Replace("Ó, O").[...]
}
Index(x => x.Name, FieldIndexing.Search);
Code that replaces diacritics can also be defined as an additional source to write more complex logic and increase the readability of the index definition.
This can be problematic when dealing with multiple languages and requires a lot of effort. As the normalization was done manually in the index definition, you will also have to take care of it when querying.
session.Query<Cities, SearchIndex>()
.Where(x => x.Name == searchedPhrase.Replace("Ü", "U").Replace("Ó", "O"));
As you can see, this approach requires a lot of effort to get it right.
Custom analyzer
During indexing, RavenDB is using analyzers to transform your initial data to useful format for querying purposes. Overall, we can treat analyzer as text-pipeline that will process your sentences (the data) into tokens (this is called tokenization), and then process each token with some filter (e.g. remove stop-words), and transform (e.g. lowercase the term) before storing it in an index.
We’ve built in a couple of analyzers like Full-Text-Search (known also as RavenStandardAnalyzer) or LowercaseAnalyzer, however we give users the possibility to define custom analyzers to deal with non-standard requirements. User-defined analyzers are called custom analyzers. To learn more you can visit our documentation page about using analyzers.
Let’s start at the very beginning, if you want to define analyzer in your C# project it requires having Lucene.Net.dll (in 3.0.3 version) as reference since all analyzers inherit from Lucene.Net.Analysis.Analyzer, if you cannot add the library you can always define class as a string in purpose for deployment. (or do it via RavenDB Studio).
public class LanguageAnalyzer : Lucene.Net.Analysis.Analyzer
{
public override TokenStream TokenStream(string fieldName
,TextReader reader)
{
// [...]
return tokenStream;
}
}
Since we’re more likely to be dealing with search, we may want to take advantage of the standard analyzer and not implement everything from scratch, so that our analyzer can inherit from StandardAnalyzer instead of Lucene.Net.Analysis.Analyzer. StandardAnalyzer requires a version passed in its constructor to match some underlying behavior. Currently, we recommend using Lucene.Net.Util.Version.LUCENE_30, since RavenDB’s default full-text search also uses it.
The StandardAnalyzer performs tokenization (splitting your sentence into words) and also removes irrelevant words (stop words from the English dictionary).
The Lucene library contains a filter that converts all non-ascii diacritical characters into the nearest ASCII ones, and it’s available as Lucene.Net.Analysis.ASCIIFoldingFilter. Since we only want to fold the analyzer stream, the TokenStream method will look like this:
using System.IO;
using Lucene.Net.Analysis;
using Lucene.Net.Analysis.Standard;
namespace Analyzer;
public class LanguageAnalyzer : Lucene.Net.Analysis.Standard.StandardAnalyzer
{
public LanguageAnalyzer() : base(Lucene.Net.Util.Version.LUCENE_30)
{
}
public override TokenStream TokenStream(string fieldName, System.IO.TextReader reader)
{
return new ASCIIFoldingFilter(base.TokenStream(fieldName, reader));
}
}
The final step is to attach the analyzer to a field in your index. Since we’ve multiple ways of declaring it, examples below are in the constructor of the index (AbstractIndexCreationTask<T>).
//Analyzer as a class:
Analyzers.Add(x => x.Name, typeof(LanguageAnalyzer).AssemblyQualifiedName);
//Or by name
string analyzerName = "LanguageAnalyzer";
Analyzers.Add(x => x.Name, analyzerName);
To learn more about how to attach an analyzer visit our docs.
Deployment via operation
Analyzer deployment via operation has to be done before deploying a related index. Otherwise, deployment of an index will throw compilation errors within the database.
Finally, after following all the steps above, we get our term normalized.

As you can see, in our Name fields our characters with diacritics are omitted.
Complete index definition:
class MyIndex : AbstractIndexCreationTask<Dto>
{
public MyIndex()
{
Map = dtos => dtos.Select(x => new
{
OriginalName = x.Name,
Name = x.Name
});
Analyze(x => x.Name, "LanguageAnalyzer");
Index("OriginalName", FieldIndexing.Search);
}
}
Lowercased term without diacritics
In the example above I have used StandardAnalyzer which performs tokenization by white-spaces, which will split your words into tokens and create an array of terms. If you want to treat the whole sentence as a single term we suggest using KeywordAnalyzer as base instead.
Querying
To use your custom analyzer to transform text in the query you’ve to use search() method. Our search engine will apply an analyzer used inside indexing for sentences used for querying in search(FieldName, queried_text). For example when queried_text is Kraków the searched term (after transformation) will be krakow.
Otherwise (for other methods in WHERE clause) RavenDB will use standard analyzer (lowercase). You can learn more about the possibilities of this method visit search documentation
It is worth mentioning that since RavenDB v6.0, we have changed the way we handle search queries with wildcards. Queries that contain prefix or/and suffix wildcards are transformed with custom analyzers as well. That would mean a query like “mün*” will be analyzed before being sent to the search engine. The standard analyzer will remove all non alphabetical characters ( “mün*” => “mun”), and will not perform a prefix search. You can learn more about it here.
Summary
Dealing with diacritics may be done manually by the user in application as we have seen in the first approach. Gives that user flexibility what and how to handle everything but requires to be careful with extending your application because it has to be done everywhere manually. On the other hand, users can define a custom analyzer and move the transformation layer to the database itself without changing data in the client application and forget about it during application development.
Woah, already finished? 🤯
If you found the article interesting, don’t miss a chance to try our database solution – totally for free!