Expanding Indexes with NuGet Packages
Using Additional Assemblies, index capabilities can now be significantly expanded with libraries imported from NuGet and other sources. This makes it possible to integrate a variety of existing technologies into the logic of your RavenDB index, such as:
- Machine Learning (ML) – see this blog post by Oren Eini for a step-by-step example of using image recognition to index pictures with descriptive tags.
- Optical Character Recognition (OCR) Image Processing – scan flat images to see if text can be extracted.
- File-type Conversion – see this blog post for an example of how to scan Word and Excel files (.docx and .xlsx) and extract text which can then be indexed.
These capabilities and many others can now be integrated directly into your indexes. Besides NuGet, Additional Assemblies also allows you to import libraries from runtime or from a local folder.
For the first time, attachment content can be indexed using static (or ‘custom’) indexes. Attachments are a kind of data that is associated with a document, but either it can’t be expressed as JSON, or we’d prefer to load and modify it separately from the document itself. Examples of attachments might be images, audio, or just pure binary data. Previously you could only index documents according to the names of their attachments. Now the attachment content itself can be indexed, as well as the attachment metadata.
Integrating Machine Learning into RavenDB
In this blog post, Oren Eini demonstrates how machine learning can be integrated into a RavenDB attachment index, producing an index that uses image recognition to classify and tag images. This takes advantage of the existing index extension feature.
Indexing Other Document Extensions
5.1 allows indexing of the other two kinds of document extensions as well: Time Series and Counters.
RavenDB 5.0 has the ability to automatically and seamlessly share data between servers. In 5.1 we offer our most advanced replication capabilities yet, granting more flexibility and control.
Revamped Pull Replication
The Pull Replication feature in 5.0 allowed you to configure servers to serve as pull replication ‘hubs’ and pull replication ‘sinks’, such that the replication was always initiated by the sink, which then received information from the hub. In 5.1 we’ve expanded this feature to allow information to be replicated from the sink to the hub as well. We’ve renamed the feature to simply ‘Replication’, which is assigned to ‘replication hubs’ and ‘replication sinks’.
In 5.1 Filtered Replication grants you fine-grained control over the replication process. This is especially useful when you want to protect sensitive data inside a network, or grant a user only partial access to information. For example, suppose there is a network of health clinics with one central database, and additional local servers at each clinic. While lots of information needs to pass between the central database and the clinics, we want to limit who can read a given patient’s information – and who can modify it.
Filtered replication uses document IDs to determine which servers have access to read and write which data. When you first establish replication between two RavenDB servers, you can configure them to only be able to send and receive documents with certain IDs, or with certain ID prefixes. Our network of clinics might want to allow Doctor Alice Smith to read only documents with these prefixes:
And write only to documents with these prefixes:
Time Series were introduced in 5.0 as a powerful and efficient way of tracking data over time. 5.1 introduces several improvements to Time Series that makes them even more useful and easy to work with.
Gap Filling is an entirely new feature that enables you to fill gaps in a Time Series by extrapolating their values. There are two interpolation modes:
nearest. ‘Linear’ interpolation calculates the values of the entries by assuming a straight line between the existing data points on either side. ‘Nearest’ gives each new entry the same value as the existing entry nearest to it.
Gap filling data is added to the results of queries. A query on Time Series with an aggregation can have the clause
with interpolation() which takes the interpolation mode. This is great for data processing techniques that rely on there always being values at each point. Interpolation can also be the first step in analyzing the data or predicting missing values.
This image shows the original data in blue, which had a resolution of one entry per minute. The red line is the result of creating additional entries for each second, using interpolation mode `nearest`. The value at each second is equal to that of the nearest existing value.
You can also apply a scaling factor on the results of a Time Series query, such that all the values are multiplied by some specified value. Using RQL
scale . This can be useful to process and graph data that is related but at different scales. Stocks, for instance, are more interesting for their percent rise and fall in value, than for the absolute price of each stock. So you could scale stock prices or currencies to make them display on the same graph.
Time Series can now be included in RavenDB ETL.
Streaming Time Series data with queries is now supported.
By default, queries are satisfied with the values stored in the appropriate index, and if the index doesn’t contain them, they are retrieved from the documents themselves. Queries now have a configurable projection behavior with these five modes:
These determine what the server will do if fields or values that have the same name in a document and in the index for that document.
The advantage of storing values in an index is that they are retrieved faster than retrieving data directly from the documents, with the disadvantage that this increases the index’ storage size.
Projection behavior grants you more fine tuned control over queries and provides more ways of taking advantage of index stored fields.
Introduced in RavenDB 5.0, Documents Compression now leaves its experimental phase and is ready for your project.
Documents Compression uses the top of the line Zstd compression algorithm to learn your data model and create dictionaries that represent redundant data across documents. This includes repetitions in document structure as well as the data itself.
In most datasets, this reduces storage size by more than 50% and in some cases more than 80%.