Ongoing Tasks: Snowflake ETL



Creating a Task

A Snowflake ETL task can be created using Code or via Studio.

  • To create the task using Code:

  • To create the task using Studio:

    Find Here a detailed look at the creation of a Snowflake ETL task using Studio.

    In short:
    • Open the Studio Settings -> Ongoing Tasks view and click the Snowflake ETL option.

      Add New Snowflake Task

      Add New Snowflake Task

    • Use the New Snowfake ETL view to define and save the new task.

      Define Snowflake Task

      Define Snowflake Task

Snowflake warehouse setup

Prior to running the ETL task, please create the destination tables in the warehouse database that RavenDB records are to be transferred to.

Snowflake Tables

Use the Snowflake Tables section to select which tables the ETL process is to transfer data to.
For each table, specify a column that would be used as the document ID column.
RavenDB will fill the document ID column with the source document IDs, and use them to handle updates and deletes.

The column does not have to be a table's primary key.

Define Snowflake Tables

Define Snowflake Tables

Performance improvement suggestions

  • Insert data directly:
    The ETL process performs document updates by issuing DELETE and INSERT statements to the relational database. If your system uses append-only, you can boost the ETL process performance by setting Insert Only Mode to insert the data directly without running a set of DELETE statements first.

    • Using code (take a look here to see the snowflake-table definition in its context):
      new SnowflakeEtlTable 
      {
          TableName = "Orders", 
          DocumentIdColumn = "Id", 
          
          // Set Insert Only Mode
          InsertOnlyMode = true
      },
    • Using Studio: see Define a Snowflake ETL Task.
  • SQL tables indexes:
    To improve performance, you can define SQL tables indexes at least for the column used to hold the document ID.

Transformation Scripts

An ETL task can apply multiple transformation scripts.
The script is defined per collection, and cannot be empty.

loadTo Method

A transformation script is executed per document once the document is created or modified.
To load data to the destination table, call the loadTo<TableName>() method and pass it a user defined JavaScript object.

The name of the table that the document is inserted to is indicated as a part of the loadTo method name.
E.g., to write data into the database OrderLines table, use the following call in the script:

loadToOrderLines({ ... });

You must pass the method a JavaScript object, e.g. -

loadToOrderLines({
        OrderId: id(this),
        Qty: line.Quantity,
        Product: line.Product,
        Cost: line.PricePerUnit
});

Alternative Syntax

The target table name can be passed to the loadTo command separately as a string argument, using the following syntax:
loadTo('Target', obj)

E.g., the following two calls to loadTo are equivalent.
loadToEmployees(this);
loadTo('Employees', this);

  • The target name 'Employees' in this syntax is not a variable and cannot be used as one; it is simply a string literal of the target's name.
  • Separating the target name from the loadTo command makes it possible to include symbols like - and . in target names. This is not possible when the standard loadToEmployees syntax is used because including special characters in the name of a JS function turns it invalid.

Filtering

To filter documents out or include them in the transfer, use a logic of your choosing while calling loadTp.
E.g.,

if (this.ShipTo.Country === 'USA') {
    // load only orders shipped to USA
    loadToOrders({ ... });
}

Loading Other Documents

Use the load method to load a document with a specified ID during script execution.

var company = load(this.Company);

Accessing Metadata

Access metadata as follows:

var value = this['@metadata']['custom-metadata-key'];

Loading to Multiple Tables

The loadTo method can be called as many times as needed by a single script.
The following script demonstrates this by looping through the Lines array of an Order document, using consecutive loadTo calls to store each line's properties in the OrderLines database table.
The process is concluded with a final loadTo call, storing the lines' total cost in the Orders table.

var orderData = {
    Id: id(this),
    OrderLinesCount: this.Lines.length,
    TotalCost: 0
};

for (var i = 0; i < this.Lines.length; i++) {
    var line = this.Lines[i];
    orderData.TotalCost += line.PricePerUnit * line.Quantity;
    loadToOrderLines({
        OrderId: id(this),
        Qty: line.Quantity,
        Product: line.Product,
        Cost: line.PricePerUnit
    });
}
orderData.TotalCost = Math.round(orderData.TotalCost  * 100) / 100;

loadToOrders(orderData);

Document Extensions

Counters, Time series, and Revisions are not supported by Snowflake ETL.

Loading Attachments

To store binary data that is kept as RavenDB attachments, use the loadAttachment() method.

CREATE TABLE [dbo].[Attachments]
(
    [Id] int identity primary key,
    [OrderId] [nvarchar](50) NOT NULL,
    [AttachmentName] [nvarchar](50) NULL,
    [Data] [varbinary](max) NULL
)

then you can define the script to load document's attachments:

var attachments = this['@metadata']['@attachments'];

for (var i = 0; i < attachments.length; i++) {
    var attachment = {
        OrderId: id(this),
        AttachmentName: attachments[i].Name,
        Data: loadAttachment(attachments[i].Name)
    };
    loadToAttachments(attachment);
}

Attachments can also be accessed using the getAttachments() helper function rather than grabbing them from metadata.
To check whether an attachment exists, use the hasAttachment(name) function.

Transaction Processing

  • All records created in a single ETL run, one per each loadTo call, are sent in a single batch and processed as part of the same transaction.

  • The ETL task will issue an SQL INSERT statement with each document it loads to the warehouse database.
    The transaction is handled as an atomic unit, and the inserted documents will be stored in the warehouse database using a COMMIT statement only when the transaction completes.

    If the ETL task is interrupted while the transaction is underway, e.g. due to a server failover or because the task was restarted for some other reason, the transaction will be rolled back and the ETL task will process the interrupted batch from scratch, starting right after the last COMMIT.
    This may cause a delay for users of the destination database, who see incoming documents only when they are stored in the database when transactions complete.

Advanced Options

Command timeout:

You can set the number of seconds after which an SQL command will timeout.
Default: null (use provider default)

  • Set using Code:

    var snowflakeEtlConfig = new SnowflakeEtlConfiguration
    {
        Name = "task-name",
        ConnectionStringName = "snowflake-connection-string-name",
        
        // Set Command Timeout
        CommandTimeout = 5,
  • Set via Studio:

    Advanced options

    Advanced options

    Command timeout

    Command timeout