You are currently browsing legacy 5.3 version of documentation. Click here to switch to the newest 6.0 version.

We can help you with migration to the latest RavenDB

see on GitHub

Ongoing Tasks: ETL Basics

ETL (Extract, Transform & Load) is a three-stage RavenDB process that transfers data from a RavenDB database to an external target. The data can be filtered and transformed along the way.
The external target can be:
- Another RavenDB database instance (outside of the Database Group)
- A relational database
In this page:

Why use ETL

Share relevant data
Data that needs to be shared can be sent in a well-defined format matching your requirements so that only relevant data is sent.
Protect your data - Share partial data
Limit access to sensitive data. Details that should remain private can be filtered out as you can share partial data.
Reduce system calls
Distribute data to related services in your system architecture so that they have their own copy of the data and can access it without making a cross-service call.
i.e. A product catalog can be shared among multiple stores where each can modify the products or add new ones.
Transform the data
- Modify content sent as needed with JavaScript code.
- Multiple documents can be sent from a single source document.
- Data can be transformed to match a rational model used in the target destination.
Aggregate your data
Data sent from multiple locations can be aggregated in a central server.

For example:

Send data to an already existing reporting solution.
Point of sales systems can send sales data to a central place for calculations.

Defining ETL Tasks

The following ETL tasks can be defined:
- RavenDB ETL - send data to another RavenDB database
- SQL ETL - send data to an SQL database
- OLAP ETL - send data to an OLAP destination
- Elasticsearch ETL - send data to an Elasticsearch destination
The destination URL address is set by using a pre-defined named connection string.
This makes deployment between environments easier.
For RavenDB ETL, multiple URLs can be configured in the connection string as the target database can reside on multiple nodes within the Database Group in the destination cluster. If one of the destination nodes is down, RavenDB will run the ETL process against another node in the Database Group topology.
See more in Connection Strings
The tasks can be defined from code or from the Studio

ETL Stages

ETL's three stages are:

Extract - Extract the documents from the database
Transform - Transform & filter the documents data according to the supplied script (optional)
Load - Load (write) the transformed data into the target destination

Extract

The ETL process starts with retrieving the documents from the database.
You can choose which documents will be processed by next two stages (Transform and Load).

The possible options are:

Documents from a single collection
Documents from multiple collections
All documents (RavenDB ETL & OLAP ETL only)

Transform

This stage transforms and filters the extracted documents according to a provided script.
Any transformation can be done so that only relevant data is shared.
The script is written in JavaScript and its input is a document.

A task can be provided with multiple transformation scripts.
Different scripts run in separate processes, allowing multiple scripts to run in parallel.

In addition to the ECMAScript 5.1 API, RavenDB introduces the following functions and members:

`this`	object	The current document (with metadata)
`id(document)`	function	Returns the document ID
`load(id)`	function	Load another document. This will increase the maximum number of allowed steps in a script. Note: Changes made to the other loaded document will not trigger the ETL process.

Specific ETL functions:

`loadTo`	function	Load an object to a specified target. There are several possible flavors to the syntax of this command, find the general details below and more in the documentation for each ETL type. Note: An object will be sent to the destination only if the `loadTo` method was called.
Attachments:
`loadAttachment(name)`	function	Load an attachment of the current document
`hasAttachment(name)`	function	Check if an attachment with a given name exists for the current document
`getAttachments()`	function	Get a collection of attachments details for the current document. Each item has the following properties `Name`, `Hash`, `ContentType`, `Size`
`<doc>.addAttachment([name,] attachmentRef)`	function	Add an attachment to a transformed document that will be sent to a target (`<doc>`). Specific for Raven ETL only, see here for details

Batch processing

Documents are extracted and transformed by the ETL process in a batch manner.
The number of documents processed depends on the following configuration limits:

ETL.ExtractAndTransformTimeoutInSec (default: 30 sec)
Time-frame for the extraction and transformation stages (in seconds), after which the loading stage will start.
ETL.MaxNumberOfExtractedDocuments (default: 8192)
Maximum number of extracted documents in an ETL batch.
ETL.MaxNumberOfExtractedItems (default: 8192)
Maximum number of extracted items (documents, counters) in an ETL batch.
ETL.MaxBatchSizeInMb (default: 64 MB)
Maximum size of an ETL batch in MB.

Load

Loading the results to the target destination is the last stage.

Syntax

An object can generally be loaded to a specified target using one of the below templates:

The target is specified as a part of the loadTo command: loadToTarget(obj)
E.g., loadToOrders(obj)
The target is specified as an argument of the loadTo command: loadTo('Target', obj)
E.g., loadTo('Orders', obj)
- The target name in this syntax is not a variable and cannot be used as one: it is simply a string literal of the target's name.
- Separating the target name from the loadTo command makes it possible to include symbols like - and . in target names. This is not possible when the standard loadToOrders syntax is used because special characters are invalid in the name of a JS function.
- Note that the general syntax specified above, loadTo('Target', obj), changes for some ETL types. Find the accurate syntax for each ETL type in the type's specific documentation.

The target must be:

RavenDB ETL: a collection name
SQL ETL: a table name
OLAP ETL: a Folder name
Elasticsearch ETL: an Index name
Kafka ETL: a topic name

Updates are implemented by executing consecutive DELETEs and INSERTs. When a document is modified, the delete command is sent before the new data is inserted and both are processed under the same transaction on the destination side. This applies to both ETL types.

There are two exceptions to this behavior:

In RavenDB ETL, when documents are loaded to the same collection there is no need to send DELETE because the document on the other side has the same identifier and will just update it.
in SQL ETL you can configure to use inserts only, which is a viable option for append-only systems.

Note

In contrast to Replication, ETL is a push-only process that writes data to the destination whenever documents from the relevant collections were changed. Existing entries on the target will always be overwritten.

Loading data from encrypted database

If a database is encrypted then you must not send data in ETL process using a non encrypted channel by default. It means that a connection to a target must be secured:

In Raven ETL, a URL of a destination server has to use HTTPS (a server certificate of the source server needs to be registered as a client certificate on the destination server)
in SQL ETL, a connection string to an SQL database must specify encrypted connection (specific per SQL engine provided)

This validation can be turned off by selecting the Allow ETL on a non-encrypted communication channel option in the Studio (or setting AllowEtlOnNonEncryptedChannel if a task is defined using the client API). Please note that your data encrypted at rest won't be protected in transit then.

Troubleshooting

ETL errors and warnings are logged to the files and displayed in the notification center panel. You will be notified if any of the following events happen:

connection error to the target
JS script is invalid
transformation error
load error
slow SQL was detected

Fallback Mode

If the ETL cannot proceed the load stage (e.g. it can't connect to the destination) then it enters the fallback mode. The fallback mode means suspending the process and retrying it periodically. The fallback time starts from 5 seconds and it's doubled on every consecutive error according to the time passed since the last error, but it never crosses ETL.MaxFallbackTimeInSec configuration (default: 900 sec)

Once the process is in the fallback mode, then the Reconnect state is shown in the Studio.

Details and examples for type specific ETL scripts can be found in the following articles:

see on GitHub

RavenDB

RavenDB Cloud

Try

Experience interactive demos and playground server

RavenDB Docs

RavenDB Cloud Docs

Documentation Guide

Download

Features

Performance

Comparison

What’s New

Demo

Bootcamp

Webinars

Workshops

Inside RavenDB Book

GitHub

StackOverflow

Articles

Whitepapers

Events

Promotional Materials

Unlock your business potential

Use Cases

Articles

Whitepapers

Press Releases

Industry Reports

Performance

Comparison

Proof of Concept Program

Academic Program

Events

What’s New

Roadmap

On-premise Pricing

Cloud Pricing

Support

Proof of Concept Program

Academic Program

Ongoing Tasks: ETL Basics

Why use ETL

Defining ETL Tasks

ETL Stages

Extract

Transform

Batch processing

Load

Syntax

Note

Loading data from encrypted database

Troubleshooting

Fallback Mode

Related Articles

ETL

Studio

RavenDB

RavenDB Cloud

Try

RavenDB Docs

RavenDB Cloud Docs

Documentation Guide

Download

Features

Performance

Comparison

What’s New

Demo

Bootcamp

Webinars

Workshops

Inside RavenDB Book

GitHub

StackOverflow

Articles

Whitepapers

Events

Promotional Materials

Use Cases