RavenDB 5.1 Features: Searching in Office documents

For a long time, whenever I tried to explain how RavenDB is a document database, people immediately assumed that I’m talking about Office documents (Word, Excel, etc) and that I’m building a SharePoint clone. Explaining that documents are a different way to model data has been a repeated chore, and we still get prospects asking about RavenDB’s office integration.

RavenDB 5.1 has a new feature, Nuget integration, which allows you to integrate Nuget packages into RavenDB’s indexes. Turns out, it takes very little code to allow RavenDB to search inside Office documents. Let’s consider a legal case system, where we track the progression of legal cases, the work done on them, billing, etc. As you can imagine, the amount of Word and Excel documents that are involved is… massive. Making sense of all of that information can be pretty hard. Here is how you can help your users with the use of RavenDB.

Here is the Filing/Search index definition:

As you can see, we are using two new features in RavenDB 5.1:

The LoadAttachment() / GetContentAsStream() methods, which expose the attachments to the indexing engine.
The Office.GetWordText() / Office.GetExcelText() methods, which extract the text from the relevant documents to be indexed by RavenDB.

Aside from that, this is a fairly standard index, we mark the Documents field as full text search (in red in the image below). There is also the yellow markers in the image, what are they for?

No, RavenDB didn’t integrate directly with Office, instead, we make use of the new Additional Assemblies (and the existing Additional Sources) to bring you this functionality. Let’s see how that works, shall we?

We tell RavenDB that for this index, we want to pull the NuGet package DocumentsFormat.OpenXml. And it will just happen, which means that we have the full power of this package in your indexes. In fact, this is exactly what we do. Here is the content of the Additional Sources:

What this code does is use the DocumetnsFormat.OpenXml package to read the data inside the provided attachments. We extract the text from them and then provide it to the RavenDB indexing engine, which enable us to do full text search on the content of attachments.

In effect, within the space of a single blog post, you can turn your RavenDB instance to a document indexing system.

Here is how we can query the data:

And the result is here:

And here is the relevant term inside the Office documents:

As you can imagine, this is a very exciting capability to add to RavenDB. There is much more that you can do with the ability to integrate such capabilities directly into your database.

RavenDB

RavenDB Cloud

Try

Experience interactive demos and playground server

RavenDB Docs

RavenDB Cloud Docs

Documentation Guide

Download

Features

Performance

Comparison

What’s New

Demo

Bootcamp

Webinars

Workshops

Inside RavenDB Book

GitHub

StackOverflow

Articles

Whitepapers

Events

Promotional Materials

Unlock your business potential

Use Cases

Articles

Whitepapers

Press Releases

Industry Reports

Performance

Comparison

Proof of Concept Program

Academic Program

Events

What’s New

Roadmap

On-premise Pricing

Cloud Pricing

Support

Proof of Concept Program

Academic Program

RavenDB 5.1 Features: Searching in Office documents

Woah, already finished? 🤯

Related Articles

RavenDB’s storage engine: Voron–unlocking the secret

Certificates from the Ground Up

Using analyzers in RavenDB indexing

Watch Live Demo

RavenDB

RavenDB Cloud

Try

RavenDB Docs

RavenDB Cloud Docs

Documentation Guide

Download

Features

Performance

Comparison

What’s New

Demo

Bootcamp

Webinars

Workshops

Inside RavenDB Book

GitHub

StackOverflow

Articles

Whitepapers

Events

Promotional Materials

Use Cases

Articles

Whitepapers

Press Releases

Industry Reports

Performance

Comparison

Proof of Concept Program

Academic Program

Events