see on GitHub

Ongoing Tasks: RavenDB ETL


  • RavenDB ETL Task creates an ETL process for a given database when the destination is another RavenDB database.

  • It can be defined in code or using the Studio by creating a RavenDB ETL task under Settings -> Manage Ongoing Tasks.

  • One RavenDB ETL task can have multiple transformation scripts.

  • Each script can be defined on a single collection, multiple selected collections or be applied to all documents regardless of the associated collection.

  • The script is executed per document once the document is created or modified.

  • In this page:

Figure 1. Configure RavenDB ETL task

RavenDB ETL in Studio

Transformation Script Options

Loading Documents

  • In order to load data to the destination database you must call the loadTo<CollectionName>() method and pass a JS object.

  • Indicating the collection name in the load method is a convention.

  • The objects passed to the loadTo method will be sent to the target database in the last Load stage.

  • All results created in a single ETL run will be sent in a single batch and processed transactionally on the destination.

  • For example, if you want to write data to Employees collection you need to call the following method in the script body:

loadToEmployees({ ... });

  • The method parameter must be a JS object. You can create it as follows:

loadToEmployees({
    Name: this.FirstName + " " + this.LastName
});

  • Or simply transform the current document object and pass it:

this.Name = this.FirstName + " " + this.LastName;

delete this.Address;
delete this.FirstName;
delete this.LastName;

loadToEmployees(this);

Documents Identifiers

  • The documents generated in the destination database are given an id according to the collection name specified in the loadTo method.

  • If the specified collection is the same as the original one then the document is loaded to the same collection and the original identifier is preserved.

  • For example, the following ETL script defined on Employees collection will keep the same identifiers in the target database:

// original identifier will be preserved
loadToEmployees({ ... });

  • If the 'loadTo' method indicates a different target collection, e.g. People,
    then the employee documents will get new identifiers that combine the original ID and the new collection name in the destination database.

// new identifier will be generated
loadToPeople({ ... });

  • In addition, ETL appends the symbol / to the requested id so that the target database will generate identifiers on its side.
    As a result, documents in the People collection in the target database will have identifiers such as: employees/1-A/people/0000000000000000001-A.

Filtering

  • Documents can be filtered from the ETL by calling the loadTo method only for documents that match some condition:

if (this.Active) {
    // load only active users
    loadToEmployees({ ... });
}

Loading Data from Other Documents

  • The load method loads a document with a specified ID during script execution.

// this.ReportsTo has some document ID
var manager = load(this.ReportsTo);

Accessing Metadata

  • The metadata can be accessed in the following way:

var value = this['@metadata']['custom-metadata-key'];

Creating Multiple Documents from a Single Document

  • The loadTo method can be called multiple times in a single script.
    That allows you to split a single source document into multiple documents on the destination database:

// documents will be created in `Addresses` collection
loadToAddresses({
    City: this.Address.City ,
    Country: this.Address.Country ,
    Address: this.Address.Line1
});

delete this.Address;

// documents will be created in the `Employees` collection
loadToEmployees(this);

Empty Script

  • An ETL task can be created with an empty script.
  • The documents will be transferred without any modifications to the same collection as the source document.

Attachments

  • Attachments are sent automatically when you send a full collection to the destination using an empty script.
  • If you use a script you can indicate that an attachment should also be sent by using dedicated functions:

    • loadAttachment(name) returns a reference to an attachment that is meant be passed to addAttachment()
    • <doc>.addAttachment([name,] attachmentRef) adds an attachment to a document that will be sent in the process, <doc> is a reference returned by loadTo<CollectionName>()

Sending attachments together with documents

  • Attachment is sent along with a transformed document if it's explicitly defined in the script by using addAttachment() method. By default attachment name is preserved.
  • The below script sends all attachments of a current document by taking advantage of getAttachments() function, loads each of them during transformation and adds to a document that will be sent to 'Users' collection on destination side

var doc = loadToUsers(this);

var attachments = getAttachments();

for (var i = 0; i < attachments.length; i++) {
    doc.addAttachment(loadAttachment(attachments[i].Name));
}

Changing attachment name

  • If addAttachment() is called with two arguments, the first one can indicate a new name for an attachment. In the below example attachment photo will be sent and stored under picture name.
  • In order to check the existence of an attachment hasAttachment() function is used

var employee = loadToEmployees({
    Name: this.FirstName + " " + this.LastName
});

if (hasAttachment('photo')) {
  employee.addAttachment('picture', loadAttachment('photo'));
}

Loading non existing attachment

  • Function loadAttachment() returns null if a document doesn't have an attachment with a given name. Passing such reference to addAttachment() will be no-op and no error will be thrown.

Accessing attachments from metadata

  • The collection of attachments of the currently transformed document can be accessed either by getAttachments() helper function or directly from document metadata:

var attachments = this['@metadata']['@attachments'];

Counters

  • Counters are sent automatically when you send a full collection to the destination using an empty script.
  • If a script is defined RavenDB doesn't send counters by default.
  • In order to indicate that a counter should also be sent, the behavior function needs to be defined in the script which decides if the counter should be sent if it's modified (e.g. by increment operation). It the relevant function doesn't exist a counter isn't loaded.
  • The reason that counters require special functions is that incrementing a counter doesn't modify the change vector of a related document so the document isn't processed by ETL on a counter change.
  • Another option of sending a counter is to explicitly add it in a script to a loaded document.

Counter behavior function

  • Every time a counter of a document from a collection that ETL script is defined on is modified then the behavior function is called to check if the counter should be loaded to a destination database.

Important

The counter behavior function can be defined only for counters of documents from collections that are ETLed to the same collections e.g.: a script is defined on Products collection and it loads documents to Products collection in a destination database using loadToProducts() method.

  • The function is defined in the script and should have the following signature:

function loadCountersOf<CollectionName>Behavior(docId, counterName) {
   return [true | false]; 
}

  • <CollectionName> needs to be substituted by a real collection name that ETL script is working on (same convention as for loadTo method)

  • The first parameter is the identifier of a document, the second one is the name of a modified counter for that doc.

  • If function returns true then a change value is propagated to a destination.

Example

  • The following script is defined on Products collection:

if (this.Category == 'software') {
    loadToProducts({
        ProductName: this.Name
    });
}

function loadCountersOfProductsBehavior(docId, counterName) {
   var doc = load(docId);

   if (doc.Category == 'software' && counterName = 'downloads')
        return true;
}

Adding counter explicitly in a script

  • The usage of counter behavior functions is limited to dealing with counters of documents that are loaded to the same collection. If a transformation script for Employees collection specifies that they are loaded to People collection in a target database then due to document ID generation strategy by ETL process (see above 'Documents Identifiers' paragraph) the counters won't be sent as the final ID of loaded document isn't known on the source side. Although you can use special functions in the script code to deal with counters then:

var person = loadToPeople({ Name: this.Name + ' ' + this.LastName });

person.addCounter(loadCounter('likes'));

  • The above example indicates that likes counter will be sent together with a document. It uses the following functions to accomplish that:
    • loadCounter(name) returns a reference to a counter that is meant be passed to addCounter()
    • <doc>.addCounter(counterRef) adds a counter to a document that will be sent in the process, <doc> is a reference returned by loadTo<CollectionName>()

Important

As the transformation script is run on a document update then counters added explicitly (addCounter()) will be loaded along with documents only if document is changed. It means that incremented counter value won't be sent until a document is modified and the ETL process will run the transformation for it.

Counter value override by ETL

Counters sent by ETL process always override the existing value on the destination. ETL doesn't send increment counter command but it sets the value using put command.

Revisions

  • Revisions are not sent by the ETL process.
    But, if revisions are configured on the destination database, then when the target document is overwritten by the ETL process a revision will be created as expected.

Deletions

  • As described above, the identifiers created for the sent documents can be different from the original source documents identifiers.
    The source isn't aware of the new IDs created so documents deletion requires a special approach.

  • In order to remove the matching documents on the destination side, the deletion of a document sends a command to remove documents that have an ID with a well-known prefix.

  • For example:

    1. Document employees/1-A is processed by ETL and put into People collection with ID:
      employees/1-A/people/0000000000000000001-A.
    2. Deletion of the employees/1-A document on the source side triggers sending a command that deletes documents having the following prefix in their ID: employees/1-A/people/.
  • If you output multiple documents from a single document then multiple delete commands will be sent, one for each prefix containing the destination collection name.

  • When documents are sent to the same collection and IDs don't change then deletion on the source results in sending a single delete command for a given ID.

  • Deletions can be filtered by defining deletion behavior functions in the script.

Collection specific function deletion handling

In order to define deletion handling for a specific collection use the following signature:

function deleteDocumentsOf<CollectionName>Behavior(docId) {
   return [true | false]; 
}

  • <CollectionName> needs to be substituted by a real collection name that ETL script is working on (same convention as for loadTo method)
  • The first parameter is the identifier of a deleted document.

Generic function for deletion handling

Another option is the usage of generic function for deletion handling:

function deleteDocumentsBehavior(docId, collection) {
   return [true | false]; 
}

  • The first parameter is the identifier of a deleted document.
  • The second parameter is the name of a collection.

  • A document deletion is propagated to a destination only if the function returns true.

  • By the time that ETL process runs a delete behavior function, a document is already deleted. If you want to filter deletions, you need some way to store that information in order to be able to determine if a document should be deleted in the delete behavior function.

Example - filtering out all deletions

loadToUsers(this);

function deleteDocumentsOfUsersBehavior(docId) {
    return false;
}

Example - storing deletion info in additional document

  • When you delete a document you can store a deletion marker document that will prevent from propagating the deletion by ETL. In the below example if LocalOnlyDeletions/{docId} exists then we skip this deletion during ETL. You can add @expires tag to the metadata when storing the marker document, so it would be automatically cleanup after a certain time by the expiration extension.

loadToUsers(this);

function deleteDocumentsOfUsersBehavior(docId) {
    var localOnlyDeletion = load('LocalOnlyDeletions/' + docId);

    return !localOnlyDeletion;
}

Example - filtering deletions using generic function

  • If you define ETL for all documents, regardless a collection they belong to, then the generic function might become very handy to filter out deletions using a collection name

function deleteDocumentsBehavior(docId, collection) {
    return 'Users' != collection;
}

Example

  • The following is an example of a RavenDB ETL script processing documents from Employees collection:

var managerName = null;

if (this.ReportsTo !== null)
{
    var manager = load(this.ReportsTo);
    managerName = manager.FirstName + " " + manager.LastName;
}

// load documents to `Employees` collection in the destination
loadToEmployees({
    // the loaded documents will have these fields:
    Name: this.FirstName + " " + this.LastName,
    Title: this.Title,
    BornOn: new Date(this.Birthday).getFullYear(),
    Manager: managerName
});