RavenDB ETL Task



RavenDB ETL Task - Definition

Figure 1. RavenDB ETL Task Definition

Create New RavenDB ETL Task

  1. Task Name (Optional)

    • Choose a name of your choice
    • If no name is given then RavenDB server will create one for you based on the defined connection string
  2. Preferred Node (Optional)

    • Select a preferred mentor node from the Database Group to be the responsible node for this RavenDB ETL Task
    • If not selected, then the cluster will assign a responsible node (see Members Duties)
  3. Connection String

    • Select an existing connection string from the list or create a new one
    • The connection string defines the destination database and its database group server nodes URLs

RavenDB ETL Task - Transform Scripts

Figure 2. RavenDB ETL Task - Transform Scripts

RavenDB ETL Task - Transform Scripts

  1. Click to add a new script

  2. Edit or Delete an existing script

  3. Enter the script to use.
    In the above example, each source document from the 'Products' collection will be sent to the 'ProductsInfo' collection in the destination database db3 (which is external to the cluster).
    Each new document will have 2 fields: 'ProductName' and 'SupplierName'.
    For detailed script options see Transformation Script Options.

  4. By default, updates to the ETL script will not be applied to documents that were already sent.
    When checking this option RavenDB will start the ETL process for this script from scratch ("beginning of time"),
    rather than apply the update only to new or updated documents.

  5. Select the collections for the ETL task - or - apply to all collections

RavenDB ETL Task - Details in Tasks List View

Figure 3. RavenDB ETL Task - Task List View

Tasks List View Details

  1. RavenDB ETL Task Details:

    • Task Status - Active / Not Active / Not on Node / Reconnect
    • Connection String - The connection string used
    • Destination Database - The destination database to which the data is being sent
    • Actual Destination URL - The server URL to which the data is actually being sent,
      the one that is currently used out of the available Topology Discovery URLs
    • Topology Discovery URLs - List of the available destination Database Group servers URLs
  2. Graph view:
    Graph view of the responsible node for the External Replication Task

RavenDB ETL Task - Offline Behaviour

  • When the source cluster is down (and there is no leader):

    • Creating a new Ongoing Task is a Cluster-Wide operation,
      thus, a new Ongoing RavenDB ETL Task cannot be scheduled.

    • If a RavenDB ETL Task was already defined and active when the cluster went down,
      then the task will not be active, data will not be ETL'ed.

  • When the node responsible for the ETL task is down:

    • If the responsible node for the RavenDB ETL Task is down,
      then another node from the Database Group will take ownership of the task so that the ETL process will continue executing.
  • When the destination node is down:

    • The ETL process will wait until the destination is reachable again and proceed from where it left off.

    • If there is a cluster on the other side, and the URL addresses of the destination database group nodes are listed in the connection string, then when the destination node is down, RavenDB ETL will simply start transferring data to one of the other nodes specified.

RavenDB ETL Task -vs- Replication Task

  1. Data ownership:

    • When a RavenDB node performs an ETL to another node it is not replicating the data, it is writing it.
      In other words, we always overwrite whatever exists on the other side, there is no conflicts handling.

    • The source database for the ETL process is the owner of the data.
      This means that any modifications done to the ETL'ed data on the destination database side are lost when overwriting occurs.

    • If you need to modify the ETL'ed data in the destination side, you should create a companion document on the destination database instead of modifying the ETL'ed data directly.
      The rule is: For ETL'ed data, you can look but not touch...

    • On the other hand, Data that is replicated with RavenDB's External Replication Task does not overwrite existing documents.
      Conflicts are created and handled according to the destination database policy defined.
      This means that you can change the replicated data on the destination database and conflicts will be solved.

  2. Data content:

    • With replication Task, all documents contained in the database are replicated to the destination database without any content modification.

    • Whereas in ETL, the document content sent can be filtered and modified with the supplied transformation script.
      In addition, partial data can be sent as specific collections can be selected.