Diacritic-Sensitive Search in RavenDB

Highlights

Use custom analyzers in RavenDB to search words with or without diacritics.
Apply ASCIIFoldingFilter to convert characters like Ó or Ü into their closest ASCII forms.
Use RavenDB’s search() method so query terms are processed by the same analyzer used during indexing.

Diacritical marks are used to change the sound value of letters in many Latin alphabet languages. This can make searching more difficult, as users may omit diacritics during search, but the word may still be relevant. To be more precise, it’s not uncommon to replace diacritics with the closest letter in the Latin alphabet, e.g. Ó => O or Ü => U. You may want to end up matching a word term like Kraków with the exact term Kraków but also Krakow. Methods described below work for Lucene and Corax indexes.

Approaches

Diacritics normalization

Queries in RavenDB are always performed on an index, by comparing search terms against index entries. Hence, it is common that the first thing you might try is to populate index with normalized entries:

  from doc in docs.Cities 
  select new 
  {
  Name = doc.Name.Replace("Ü", "U").Replace("Ó, O").[...]
  }

  Index(x => x.Name, FieldIndexing.Search);

Code that replaces diacritics can also be defined as an additional source to write more complex logic and increase the readability of the index definition.

This can be problematic when dealing with multiple languages and requires a lot of effort. As the normalization was done manually in the index definition, you will also have to take care of it when querying.

session.Query<Cities, SearchIndex>()
   .Where(x => x.Name == searchedPhrase.Replace("Ü", "U").Replace("Ó", "O"));

As you can see, this approach requires a lot of effort to get it right.

Custom analyzer

During indexing, RavenDB is using analyzers to transform your initial data to useful format for querying purposes. Overall, we can treat analyzer as text-pipeline that will process your sentences (the data) into tokens (this is called tokenization), and then process each token with some filter (e.g. remove stop-words), and transform (e.g. lowercase the term) before storing it in an index.

We’ve built in a couple of analyzers like Full-Text-Search (known also as RavenStandardAnalyzer) or LowercaseAnalyzer, however we give users the possibility to define custom analyzers to deal with non-standard requirements. User-defined analyzers are called custom analyzers. To learn more you can visit our documentation page about using analyzers.

Let’s start at the very beginning, if you want to define analyzer in your C# project it requires having Lucene.Net.dll (in 3.0.3 version) as reference since all analyzers inherit from Lucene.Net.Analysis.Analyzer, if you cannot add the library you can always define class as a string in purpose for deployment. (or do it via RavenDB Studio).

  public class LanguageAnalyzer : Lucene.Net.Analysis.Analyzer
  {
              public override TokenStream TokenStream(string fieldName
                       ,TextReader reader)
         {
                  // [...]
                return tokenStream;
         }
  }

Since we’re more likely to be dealing with search, we may want to take advantage of the standard analyzer and not implement everything from scratch, so that our analyzer can inherit from StandardAnalyzer instead of Lucene.Net.Analysis.Analyzer. StandardAnalyzer requires a version passed in its constructor to match some underlying behavior. Currently, we recommend using Lucene.Net.Util.Version.LUCENE_30, since RavenDB’s default full-text search also uses it.

The StandardAnalyzer performs tokenization (splitting your sentence into words) and also removes irrelevant words (stop words from the English dictionary).

The Lucene library contains a filter that converts all non-ascii diacritical characters into the nearest ASCII ones, and it’s available as Lucene.Net.Analysis.ASCIIFoldingFilter. Since we only want to fold the analyzer stream, the TokenStream method will look like this:

  using System.IO;
  using Lucene.Net.Analysis;
  using Lucene.Net.Analysis.Standard;
  namespace Analyzer;

  public class LanguageAnalyzer : Lucene.Net.Analysis.Standard.StandardAnalyzer
  {
      public LanguageAnalyzer() : base(Lucene.Net.Util.Version.LUCENE_30)
      {
      }

      public override TokenStream TokenStream(string fieldName, System.IO.TextReader reader)
      {
          return new ASCIIFoldingFilter(base.TokenStream(fieldName, reader));
      }
  }

The final step is to attach the analyzer to a field in your index. Since we’ve multiple ways of declaring it, examples below are in the constructor of the index (AbstractIndexCreationTask<T>).

  //Analyzer as a class:
  Analyzers.Add(x => x.Name, typeof(LanguageAnalyzer).AssemblyQualifiedName);

  //Or by name
  string analyzerName = "LanguageAnalyzer";
  Analyzers.Add(x => x.Name, analyzerName);

To learn more about how to attach an analyzer visit our docs.

Deployment via operation

Analyzer deployment via operation has to be done before deploying a related index. Otherwise, deployment of an index will throw compilation errors within the database.

Finally, after following all the steps above, we get our term normalized.

As you can see, in our Name fields our characters with diacritics are omitted.

Complete index definition:

     class MyIndex : AbstractIndexCreationTask<Dto>
      {
          public MyIndex()
          {
              Map = dtos => dtos.Select(x => new
              {
                  OriginalName = x.Name,
                  Name = x.Name
              });

              Analyze(x => x.Name, "LanguageAnalyzer");
              Index("OriginalName", FieldIndexing.Search);
          }
      }

Lowercased term without diacritics

In the example above I have used StandardAnalyzer which performs tokenization by white-spaces, which will split your words into tokens and create an array of terms. If you want to treat the whole sentence as a single term we suggest using KeywordAnalyzer as base instead.

Querying

To use your custom analyzer to transform text in the query you’ve to use search() method. Our search engine will apply an analyzer used inside indexing for sentences used for querying in search(FieldName, queried_text). For example when queried_text is Kraków the searched term (after transformation) will be krakow.

Otherwise (for other methods in WHERE clause) RavenDB will use the standard analyzer (lowercase). You can learn more about the possibilities of this method visit search documentation

It is worth mentioning that since RavenDB v6.0, we have changed the way we handle search queries with wildcards. Queries that contain prefix or/and suffix wildcards are transformed with custom analyzers as well. That would mean a query like “mün*” will be analyzed before being sent to the search engine. The standard analyzer will remove all non-alphabetical characters ( “mün*” => “mun”), and will not perform a prefix search. You can learn more about it here.

Summary

Dealing with diacritics may be done manually by the user in the application, as we have seen in the first approach. Gives that user flexibility in what and how to handle everything, but requires being careful with extending your application because it has to be done everywhere manually. On the other hand, users can define a custom analyzer and move the transformation layer to the database itself without changing data in the client application, and forget about it during application development.

FAQs

What is the recommended way to handle diacritics in RavenDB?

The best approach is to define a custom analyzer that utilizes the Lucene.Net.Analysis.ASCIIFoldingFilter. This filter automatically converts non-ASCII diacritical characters (like accents or umlauts) into their nearest ASCII equivalents at the database level during indexing, ensuring you don’t have to handle string transformation logic inside your application code.

How do I correctly deploy a custom analyzer to my database?

You must deploy the custom analyzer via a server operation before creating or deploying the related index. If you attempt to deploy an index that references a custom analyzer before the analyzer itself exists on the server, the database will throw a compilation error.

How should I structure my queries to ensure the custom analyzer is applied?

You must use the Search() method in your LINQ queries. RavenDB’s search engine will apply your index’s custom analyzer to the queried text only when explicitly using full-text search methods like Search(); if you use exact-match methods like Where() or standard equality operators, RavenDB will bypass the custom analyzer and default to exact string comparison.

Diacritic-Sensitive search in RavenDB

Highlights

Approaches

Diacritics normalization

Custom analyzer

Deployment via operation

Lowercased term without diacritics

Querying

Summary

FAQs

What is the recommended way to handle diacritics in RavenDB?

How do I correctly deploy a custom analyzer to my database?

How should I structure my queries to ensure the custom analyzer is applied?

Woah, already finished? 🤯

Diacritic-Sensitive search in RavenDB

Highlights

Approaches

Diacritics normalization

Custom analyzer

Deployment via operation

Lowercased term without diacritics

Querying

Summary

FAQs

What is the recommended way to handle diacritics in RavenDB?

How do I correctly deploy a custom analyzer to my database?

How should I structure my queries to ensure the custom analyzer is applied?

Woah, already finished? 🤯

Related Articles

Master RavenDB: Indexing Staleness

Master RavenDB: Spotting red flags in index definitions

Master RavenDB: Projections & Performance