Ongoing Tasks: OLAP ETL

The OLAP ETL task creates an ETL process from a RavenDB database to a variety of destinations that are especially useful for conducting OLAP. These destinations currently include:
- Amazon S3
- Amazon Glacier
- Microsoft Azure
- Google Cloud Platform
- File Transfer Protocol
- Local storage
The data is encoded in the Apache Parquet format, an alternative to CSV that is much faster to query. Unlike CSV, Parquet groups the data according to its column (by field) instead of by row (by document).
In this page:

Client API

Creating an OLAP ETL task through the client is very similar to creating a RavenDB or SQL ETL task. All cases use the AddEtlOperation. For OLAP you will need an OlapEtlConfiguration which itself needs an OlapConnectionString. Their configuration options are listed below.

This is an example of a basic OLAP ETL creation operation:

AddEtlOperation<OlapConnectionString> operation = new AddEtlOperation<OlapConnectionString>(
    new OlapEtlConfiguration
    {
        ConnectionStringName = "olap-connection-string-name",
        Name = "Orders ETL",
        Transforms =
        {
            new Transformation
            {
                Name = "Script #1",
                Collections =
                {
                    "Orders"
                },
                Script = @"var orderDate = new Date(this.OrderedAt);
                           var year = orderDate.getFullYear();
                           var month = orderDate.getMonth();
                           var key = new Date(year, month);

                           loadToOrders(key, {
                               Company : this.Company,
                               ShipVia : this.ShipVia
                           })"
            }
        }
    });

AddEtlOperationResult result = store.Maintenance.Send(operation);

`OlapEtlConfiguration`

Property	Type	Description
`RunFrequency`	`string`	Takes a cron expression which determines how often the server will execute the ETL process.
`CustomPartitionValue`	`string`	A value that can be used as a partition name in multiple scripts. See below.
`OlapTables`	`List<OlapEtlTable>`	List of naming configurations for individual tables. See more details below.

`OlapConnectionString`

The OLAP connection string contains the configurations for each destination of the ETL task.

Property	Description
`LocalSettings`	Settings for storing the data locally.
`S3Settings`	Settings for an AWS S3 bucket.
`GlacierSettings`	Settings for an AWS Glacier.
`AzureSettings`	Settings for Azure.
`GoogleCloudSettings`	Settings for Google Cloud Platform.
`FTPSettings`	Settings for File Transfer Protocol.

ETL Destination Settings

This is the list of different settings objects that the OlapConnectionString object can contain.

`LocalSettings`

Property	Type	Description
`FolderPath`	`string`	Path to local folder. If this property is not set, the data is saved to the location specified in the setting `Storage.TempPath`. If that setting has no value, the data is instead saved to the location specified in the setting `CoreConfiguration.DataDirectory`.

`FtpSettings`

Property	Type	Description
`Url`	`string`	The FTP URL
`Port`	`int`	The FTP port
`UserName`	`string`	The username used for authentication
`Password`	`string`	Authentication password
`CertificateFileName`	`string`	The name of your local certificate file
`CertificateAsBase64`	`string`	The certificate in base 64 format

`S3Settings`

Property	Type	Description
`AwsAccessKey`	`string`	Main certificate for the AWS server
`AwsSecretKey`	`string`	Encryption certificate for the AWS server
`AwsSessionToken`	`string`	AWS session token
`AwsRegionName`	`string`	The AWS server region
`BucketName`	`string`	The name of the S3 bucket that is the destination for this ETL
`CustomServerUrl`	`string`	The custom URL to the S3 bucket, if you have one
`RemoteFolderName`	`string`	Name of the destination folder within the S3 bucket
`ForcePathStyle`	`bool`	Change the default S3 bucket path convention on custom S3 server

`GlacierSettings`

Property	Type	Description
`AwsAccessKey`	`string`	Main certificate for the AWS server
`AwsSecretKey`	`string`	Encryption certificate for the AWS server
`AwsSessionToken`	`string`	AWS session token
`AwsRegionName`	`string`	The AWS server region
`VaultName`	`string`	The name of your AWS Glacier vault
`RemoteFolderName`	`string`	Name of the destination folder within the Glacier

`AzureSettings`

Property	Type	Description
`StorageContainer`	`string`	Microsoft Azure Storage container name
`RemoteFolderName`	`string`	Path to remote Azure folder
`AccountName`	`string`	The name of your Azure account
`AccountKey`	`string`	Your Azure account key
`SasToken`	`string`	Your SaS token for authentication

`GoogleCloudSettings`

Property	Type	Description
`BucketName`	`string`	Google cloud storage bucket name
`RemoteFolderName`	`string`	Path to remote bucket folder
`GoogleCredentialsJson`	`string`	Authentication credentials to your Google Cloud Storage

`OlapEtlTable`

Optional, more detailed naming configuration.

Property	Type	Description
`TableName`	`string`	The name of the table. This should usually be the name of the source collection.
`DocumentIdColumn`	`string`	A name for the id column of the table. Default: "_id"

ETL Run Frequency

Unlike other ETL tasks, OLAP ETL operates only in batches at regular intervals, rather than triggering a new round every time a document updates.
If a document has been updated after ETL (even if updated data has not actually been loaded) they are distinguished by _lastModifiedTime, the value of the last-modified field in a document's metadata in unix time. This field appears as another column in the destination tables.

Transform Script

Transformation scripts are similar to those in the RavenDB ETL and SQL ETL tasks - see more about this in ETL Basics. The major difference is that data output by the ETL task can be divided into folders and child folders called partitions. Querying the data usually involves scanning the entire folder, so there is an efficiency advantage to dividing the data into more folders.

The `key` Parameter

As with other ETL tasks, the method that loads an entry to its destination is loadTo<folder name>(), but unlike the other ETL tasks, the method takes two parameters: the entry itself, and an additional 'key'. This key determines how many partitions there are and what their names are.

loadTo<folder name>(key, object)

The method's name determines the name of the parent folder that the method outputs to. If you want to output data to a folder called "Sales", use the method loadToSales(). The parameter key determines the names of one or more layers of child folders that contain the actual destination table.

The actual value that you pass as the key for loadTo<folder name>() is one of two methods:

partitionBy() - creates one or more child folders (one inside the other).
noPartition() - creates no child folders.

The child folders created by OLAP ETL are considered a sort of 'virtual column' of the destination table. This just means that all child folder names have this format: [virtual column name]=[partition value], i.e. two strings separated by a =. The default virtual column name is _partition.

partitionBy() can take one or more folder names in the following ways:

partitionBy(key) - takes a partition value and uses the default virtual column name _partition. The partition value can be a string, number, date, etc.
partitionBy(['name', key]) - takes a virtual column name and a partition value as an array of size two.
partitonBy(['name1', key1], ['name2', key2], ... ) - takes multiple arrays of size two, each with a virtual column name and a partition value. Each pair represents a child folder of the preceding pair.

Here are examples of possible values for partitionBy(), and the resulting folder names:

loadToMyFolder(
    partitionBy('one'),
    object
)
//Loads the data to /MyFolder/_partition=one/

loadToMyFolder(
    partitionBy(['month', 'August']),
    object
)
//Loads the data to /MyFolder/month=August/

loadToMyFolder(
    partitionBy(['month', 'August'], ['day', '22'], ['hour', '17']),
    object
)
//Loads the data to /MyFolder/month=August/day=22/hour=17

loadToMyFolder(
    partitionBy(this.Company),
    object
)
// Loads the data to e.g. /MyFolder/_partition=Apple

loadToMyFolder(
    partitionBy(['month', new Date(this.OrderedAt).getMonth()]),
    obj
)
//Loads the data to e.g. /MyFolder/month=8

Alternative Syntax

The target folder name can be passed to the loadTo command separately, as a string argument, using this syntax: loadTo('folder_name', key, object)

Example:
The following two calls to loadTo are equivalent.
loadToOrders(key, object)
loadTo('Orders', key, object)

The target name 'Orders' in this syntax is not a variable and cannot be used as one: it is simply a string literal of the target's name.
Separating the target name from the loadTo command makes it possible to include symbols like - and . in target names. This is not possible when the standard loadToOrders syntax is used because including special characters in the name of a JS function turns it invalid.

The Custom Partition Value

The custom partition value is a string value that can be set in the OlapEtlConfiguration object. This value can be referenced in the transform script as $customPartitionValue. This setting gives you another way to distinguish data from different ETL tasks that use the same transform script.

Suppose you want to create multiple OLAP ETL tasks that all use the same transform script and connection string. All the tasks will output to the same destination folders, but suppose you want to be able to indicate which data came from which task. This custom partition value gives you a simple way to achieve this: all the tasks can run the same script, and each script can output the data to a destination folder with the name determined by that task's custom partition value setting.

partitionBy(['source_ETL_task', $customPartitionValue])

In the case of multiple partitions, the custom partition value can be used more than once, and it can appear anywhere in the folder structure.

Script Example

//Define the object that will be added to the table
var orderData = {
    Company : this.Company,
    RequireAt : new Date(this.RequireAt),
    ItemCount: this.Lines.length
};

//Create the partition names
var orderDate = new Date(this.OrderedAt);
var year = orderDate.getFullYear();
var month = orderDate.getMonth();

//Load to the folder: /OrderData/Year=<year>/Month=<month>/
loadToOrderData(partitionBy(['Year', year], ['Month', month]), orderData);

Athena Examples

Athena is a SQL query engine in the AWS environment that can both read directly from S3 buckets and output to S3 buckets.

Here are a few examples of queries you can run in Athena. But first, you need to configure the destination for your query results: go to settings, and under "query result location" input the path to your preferred bucket. Read more here

Create a monthly_sales table from parquet data stored in s3:

CREATE EXTERNAL TABLE mydatabase.monthly_sales (
    `_id` string,
    `Qty` int,
    `Product` string,
    `Cost` int,
    `_lastModifiedTime` int
)
PARTITIONED BY (`dt` string)
STORED AS parquet
LOCATION 's3://ravendb-test/olap/tryouts/data/Sales'

Load all partitions:

MSCK REPAIR TABLE monthly_sales

Select everything in the table:

select *
from monthly_sales

Select specific fields:

select _id orderId, qty quantity, product, cost
from monthly_sales

Filter based on product name:

select *
from monthly_sales
where product = 'Products/2'

Filter based on date (this is where partitioning adds efficiency - only the relevant folders are scanned):

select *
from monthly_sales
where dt >= '2020-01-01' and dt <= '2020-02-01'

From all items sold, select the maximum cost (price) per order:

select _id orderId, max(cost) cost
from monthly_sales
group by _id

Querying for the most recent version in an append-only table: e.g. select everything in the table, and in case we have duplicates (multiple rows with the same id) - only take the most recent version (the one with the highest _lastModifiedTime):

SELECT DISTINCT o.*
FROM monthly_orders o
INNER JOIN
   (SELECT _id,
        MAX(_lastModifiedTime) AS latest
   FROM monthly_orders
   GROUP BY  _id) oo
   ON o._id = oo._id
       AND o._lastModifiedTime = oo.latest

Apache Parquet

Parquet is an open-source text-based file format. Like ORC, columns are stored together instead of rows being stored together (the same fields from multiple documents, rather than whole documents). This makes queries more efficient.

see on GitHub

RavenDB

RavenDB Cloud

Try

Experience interactive demos and playground server

RavenDB Docs

RavenDB Cloud Docs

Documentation Guide

Download

Features

Performance

Comparison

What’s New

Demo

Bootcamp

Webinars

Workshops

Inside RavenDB Book

GitHub

StackOverflow

Articles

Whitepapers

Events

Promotional Materials

Unlock your business potential

Use Cases

Articles

Whitepapers

Press Releases

Industry Reports

Performance

Comparison

Proof of Concept Program

Academic Program

Events

What’s New

Roadmap

On-premise Pricing

Cloud Pricing

Support

Proof of Concept Program

Academic Program

Ongoing Tasks: OLAP ETL

Client API

OlapEtlConfiguration

OlapConnectionString

ETL Destination Settings

LocalSettings

FtpSettings

S3Settings

GlacierSettings

AzureSettings

GoogleCloudSettings

OlapEtlTable

ETL Run Frequency

Transform Script

The key Parameter

Alternative Syntax

The Custom Partition Value

Script Example

Athena Examples

Apache Parquet

Related Articles

ETL

Client API

Studio

RavenDB

RavenDB Cloud

Try

RavenDB Docs

RavenDB Cloud Docs

Documentation Guide

Download

Features

Performance

Comparison

What’s New

Demo

Bootcamp

Webinars

Workshops

`OlapEtlConfiguration`

`OlapConnectionString`

`LocalSettings`

`FtpSettings`

`S3Settings`

`GlacierSettings`

`AzureSettings`

`GoogleCloudSettings`

`OlapEtlTable`

The `key` Parameter