Sharing Data and Making Friends with ETL
While talking about distributed work with RavenDB so far, we've focused primarily on the work that RavenDB is doing to replicate data between different nodes in the same cluster, either as part of the same database group or between different databases in potentially different clusters. This mode of operation is simple because you just set up the replication and RavenDB takes care of everything else. But there are other modes for distributing data in your systems.
External Replication assumes that you have another RavenDB instance and that you want to replicate everything to it. When replicating information, that's what you want, but we also need to be able to share only part of the data. ETL1 is the process by which we take data that resides in RavenDB and push it to an external source, potentially transforming it along the way. That external source can be another RavenDB instance or a relational database.
Consider a microservice architecture and a "customer benefits service." This service decides what kind of benefits the customer has. It can be anything from free shipping to a discount on every third jug of milk. And the logic can be as simple as "this customer is in our membership club" and as complex as trying to compute airline miles. Regardless of how the customer benefits service works, it needs to let other parts of the system know about the benefits that this customer has. The shipping service, the help desk service and many others need to have that information.
At the same time, we really don't want them to poke around in the customer benefits database (or worse, have a shared database for everything).2 We could design an API between the systems, but then the shipping service would be dependent on the customer benefits service always being up. A better solution is to define an ETL process between the two services and have the customer benefits service publish updates for the shipping service to consume. Those updates are part of the public contract of those services, mind you. You shouldn't just copy the data between the databases.
Another example is the reporting database. RavenDB is a wonderful database for OLTP scenarios, but for reporting, your organization likely already has a solution. There's little need to replace that. But you can't just dump the data from RavenDB directly into a relational database and expect things to work. We need to transform the data as we send it to match the relational model.
For all of those needs, RavenDB has the notion of ETL processes. RavenDB has built in ETL processes to another RavenDB instance and to a relational database (such as MS SQL Server, Postgres, Oracle, MySQL, etc.). Because RavenDB has native relational ETL, brownfield systems will typically start using RavenDB by replacing a single component at a time. RavenDB is used to speed up the behavior of high value targets, but instead of replacing the whole system, we use ETL processes to write the data to the existing relational database. We'll cover that later in this chapter, discussing the deployment of RavenDB as a write-behind cache.
In most cases, the rest of the system doesn't even need to know that some parts are using RavenDB. This is using RavenDB as the write-behind cache. Some part of your application reads and writes to RavenDB directly, and RavenDB will make sure to update the relational system. We'll start by talking about ETL processes between RavenDB instances because we explore the whole ETL process without introducing another database instance.
ETL processes to another RavenDB instance
Any nontrivial system is going to have at least a few ETL processes, and RavenDB has a good way to handle those. The simplest ETL process between two RavenDB nodes is when we tell RavenDB we want to send just a single collection to the other side. We first need to configure the connection string we'll use. The RavenDB connection string defines the destination database and the destination database-group nodes URLs. Since there can be multiple URLs, it's more convenient to put it all in one single location.
Go to Settings
and then to Connection Strings
and create a new RavenDB connection string. As you can see in Figure 8.1, I've defined a connection string to the helpdesk
database in the live-test instance.
With the connection string defined, we can now go ahead and build the actual ETL process to the remote instance. Go to Settings
and then to
Manage Ongoing Tasks
. Click the Add Task
button and then select RavenDB ETL
. You can see how this looks in Figure 8.2.
Give it a name and select the previously defined connection string.
What about security?
We'll cover security in depth in Chapter 13, but given that we've shown a connection string, I wanted to address a not-so-minor issue that you probably noticed. We don't have a place here to define credentials. This is because we're talking to another RavenDB server, and for that we use x509 certificates.
In practice, this means that inside the same cluster, ETL processes don't need any special configuration. (Nodes within the same cluster already trust one another.) Outside the cluster, you'll need to register the cluster's certificate with the remote cluster to allow the ETL process to work. You can read the full details of that in the security chapter.
Now we need to let RavenDB know what will be ETL'ed to the other side. We do that by defining scripts that control the ETL process. Click Add New Script
and give it a name, such as "Employees to Helpdesk." Then select the employees collection below, click on Add Collection
and
then on Save
. The result should look like Figure 8.3.
Because we didn't specify anything in the ETL script, RavenDB will just copy the entire collection as is to the other side. That's useful on its own because it allows us to share a single collection (or a few collections) with other parties with very little work.
ETL is a database task, with bidirectional failover
In the previous chapter, we learned about database tasks and how the cluster will distribute such work among the different database instances. If a node fails, then the responsibility for the ETL task will be assigned to another node.
Take note that in cases where we ETL to another RavenDB instance, we also have failover on the receiving end. Instead of specifying a single URL, we can specify all the nodes in the remote cluster. If one of the destination nodes is down, RavenDB will just run the ETL process against another node in the database group topology.
It's very important to remember that ETL is quite different from replication. When a RavenDB node performs ETL to another node, it's not replicating the data — it's writing it. In other words, there are no conflicts and thus no attempts to handle conflicts. Instead, we'll always overwrite whatever exists on the other side. As far as the destination node is concerned, the ETL process is just another client writing data to the database. That's done because the data is not the same. This is important because of a concept we haven't touched on so far: data ownership.
Data ownership in distributed systems
One of the key differences between a centralized system and a distributed system is that, in a distributed environment, different parts of the system can act on their local information without coordination from other parts of the system. In a centralized system, you can take a lock or use transactions to ensure consistency. But in a distributed system, that's not possible, or it's prohibitively expensive to do so. Because of this limitation, the concept of data ownership is very important.
Ideally, you want every piece of data to have a single well-defined owner and to only mutate that data through that owner. This allows you to put all the validation and business rules in a single place and ensure overall consistency. Everything else in the system will update that data through its owner.
A database group, for example, needs to handle the issue of data ownership. Conceptually, it's the database group as a whole that owns the data stored in the document. However, different database instances can mutate the data. RavenDB handles that using change vectors and conflict detection and resolution. That solution works for replication inside the database group, because all nodes in the group share ownership of the data. But it doesn't work for ETL.
The ETL source is the owner of the data, and it distributes updates to interested parties. Given that it's the owner, it's expected that the ETL source can just update it to the latest version it has. So if you made any modifications to the ETL'ed data, they would be lost. Instead of modifying the ETL'ed data directly, you should create in the destination database a companion document that you own. In other words, for ETL'ed data, the rule is that you can look but not touch.
An example of such a companion document is when you have ETL for help desk system users. The
users/123-B
document is owned by the users' database, and the help desk system will store all the information it needs about the user in theusers/123-B/helpdesk
document, ensuring there's no contention over the ownership of documents.
So far, we've only done ETL at the collection level, but we can also modify the data as it's going out. Let's see how we can do that — and why we'd want to.
ETL Scripts
Sometimes you don't want to send full documents through the ETL process. Sometimes you want to filter them or modify their shape. This matters, since ETL processes compose a part of your public interface. Instead of just sending your documents to a remote destination willy nilly, you'll typically only send data you're interested in sharing, and in a well-defined format.
Consider the employees we sent over the wire. We sent them as is, potentially exposing our own internal document structure and making it harder to modify in the future. Let's create a new ETL process that will send just the relevant details and add a script for sending redacted employee information over the wire. You can see what this looks like in Figure 8.4.
When we're using an ETL script to modify what's sent, it's essential to take into account that RavenDB will send only what we told it to. This seems obvious, but it can catch people unaware. If you don't have a script, the data sent to the other side will include attachments and go to the same collection as the source data.
However, when you provide your own script, you need to take responsibility for this yourself. Listing 8.1 shows a slightly more complex example.
Listing 8.1 Creating a subscription to process customers
let managerName = null;
if (this.ReportsTo !== null)
{
let manager = load(this.ReportsTo);
managerName = manager.FirstName + " " + manager.LastName;
}
loadToEmployees({
Name: this.FirstName + " " + this.LastName,
Title: this.Title,
BornOn: new Date(this.Birthday).getFullYear(),
Manager: managerName
});
The script in Listing 8.1 will send the employees, their title, their birth year and their manager over to the other side. You can see that the script is
actually a full-blown JavaScript that allows you complete freedom as to how you extract the data to load into the remote server. A word of caution
is required about using functions such as load
in this context, though. While this will work just fine, the referencing document will not be
updated if the referenced document has been updated. In other words, if the manager's name has been updated, it will not trigger an update
to the employees that report to this manager.
A good solution is to limit yourself to just the data from that particular document. That makes it easy to ensure that whenever the document is changed, the ETL process will reflect that change completely on the other side.
Resetting the ETL process after update
It's common to test out the ETL process as you develop it, but by default, updates to the ETL script will not be applied to documents that were already sent. This is done to avoid an expensive reset that would force RavenDB to send all the data all over again for a minor change. You can use the "Apply script to documents from beginning of time" option during the script update, as shown in Figure 8.5, to let RavenDB know that it needs to start the ETL process for this script from scratch, rather than apply the update only to new or updated documents.
Looking on the other side, you'll be able to see the ETL'ed document, as shown in Listing 8.2.
Listing 8.2 ETL'ed document on the destination
{
"BornOn": 1966,
"Manager": "Steven Buchanan",
"Name": "Anne Dodsworth",
"Title": "Sales Representative",
"@metadata": {
"@collection": "Employees",
"@change-vector": "A:84-4Xmt8lVCrkiCDii/CfyaWQ",
"@id": "employees/9-A",
"@last-modified": "2017-12-04T12:02:53.8561852Z"
}
}
There are a few interesting things in the document in Figure 8.2. First, we can see that it has only a single change vector entry (for the destination database). Also, the last modified date is when it was written to the destination, not when it was updated on the source.
Multiple documents from a single document
Another interesting ETL example is when we want to push multiple values out of a single document, as shown in Listing 8.3.
Listing 8.3 Sending multiple documents from a single source document in ETL
loadToEmployees({
Name: this.FirstName + " " + this.LastName,
Title: this.Title,
BornOn: new Date(this.Birthday).getFullYear(),
});
loadToAddresses({
City: this.Address.City,
Country: this.Address.Country,
Address: this.Address.Line1
});
The results of the script in Listing 8.3 can be seen in Figure 8.6. You can see that the Employees
documents were sent, but there are also the addresses documents.
For those, we use the prefix of the source document to be able to identify them.
An important consideration for sending multiple documents from a single source document is that on every update to the source document, all the documents created from this one are refreshed. In addition, you don't have control over the IDs being generated and shouldn't assume that they're fixed.
Attachments and Revisions with RavenDB ETL
Attachments are sent automatically over the wire when you send a full collection to the destination. If you do use a script, there's currently no way to indicate that attachments should also be sent. This feature is planned, but it wasn't completed in time for the 4.0 RTM release. Revisions are also not sent automatically, and another upcoming feature is support for ETL processes on top of the revision data, similar to how it's possible to use subscriptions with the
current
andprevious
versions of the document.
Use cases for ETL between RavenDB instances
When you have a complex system, composed of more than a single application, it's considered bad form to just go peek inside another application's database. Such behavior leads to sharing way too much between the applications and will require constant coordination between them as you develop and deploy them. A boundary between applications is required to avoid such issues.
Shared Database Integration Anti-Pattern
This kind of behavior is called the shared database integration and is considered to be an anti-pattern. For more information on why you should avoid such a system, I refer you to Martin Fowler's post on the matter and in particular to the summary, "most software architects that I respect take the view that integration databases should be avoided."
One way to create such a boundary is to mandate that any time an application needs some data from another application, it will ask that application. In concrete terms, whenever the help desk system needs to look up a user, it will go to the user's application and ask it to get that user's data. This is often referred to as a service boundary.
The problem with such a system is that many interactions inside a particular service require information owned by another service. Any support ticket opened by a user will require a call from the help desk service to the user management service for details and updates. As you can imagine, such a system still requires a lot of work. In particular, even though we have a clear boundary between the services and the division of responsibility between them, there's still a strong temporal coupling between them.
Taking down the user management service for maintenance will require taking down everything else that needs to call to it. A better alternative in this case is to not rely on making remote calls to a separate service but to pull the data directly from our own database. This way, if the user management service is down, it doesn't impact operations for the help desk service.
Note a key difference here between this type of architecture and the shared database model. You don't have a single shared database. Instead, the help desk database contains a section that's updated by the user management service. In this manner, the ownership of the data is retained by the user management service, but the responsibility for maintaining it and keeping it up is that of the help desk service.
If the user management service is taken down for maintenance, it has no impact on the help desk service, which can continue with all operations normally. The design of the ETL processes in RavenDB is meant to allow such a system to be deployed and operated with a minimum of hassle. That's also partly why the ownership rules and responsibility for changes is built the way it is.
ETL is explicitly about sending data you own to a third party that's interested in it but doesn't own it. As such, any change you make will, by necessity, overwrite any local changes in the destination. If you're interested in a shared ownership model, ETL is not the method you should use. Rather, you should choose external replication, discussed in the previous chapter.
Modeling concerns for ETL processes
An important aspect of using ETL processes in a multi-service environment is the fact that the ETL process itself is part of the service contract that needs to be deployed, versioned and managed as such. In particular, the format of the documents that are sent via the ETL process compose part of the public interface of the service. Because of this, you should think carefully about the shape of the data you expose and the versioning considerations around that.
Listing 8.1 is a good example of exposing just enough information to be useful to the other side without leaking implementation details or other aspects that may change over time. The ETL process design accommodates for the fact that you may have different processes and different outputs for different destinations. In this way, you may collaborate with another service to update your contract while maintaining the same behavior for all others.
Another option is to only allow additive changes; adding a property would be fine but removing one wouldn't. That usually works, but unfortunately Hyrum's Law applies. Even such innocuous changes can break a third-party application.
Controlled exposure of data via ETL
Beyond using ETL for disseminating documents between services, there are a few other scenarios in which it can be useful. A typical usage scenario for ETL is sending data from production (after redacting any sensitive information) to UAT or CI instances, allowing you to test realistic data sizes — and with real-world data.
Sending redacted data from production to UAT is just one application of a larger concept: controlling the degree to which you expose data to outside parties. We'll discuss authorization in detail in Chapter 13, but for now, know that RavenDB allows you to define permissions at the database level.
Sometimes, you need to apply such permissions on a per-collection or a per-document level. In some cases, it's per field or even dependent on the exact data in the document to control who sees the data. The ETL process in RavenDB offers a nice way to manage that using the customization scripts. You can select exactly which collections you want to send, and you're free to decide what data to send and what data to hold back.
The target database in this case will usually be another database in the same cluster, which you'll allow access to based on your own internal policies. This gives the operations team a lot of freedom in designing and implementing data exposure processes and policies.
Case study: ETL usage in point-of-sales systems
Another ETL usage example is data aggregation. One use case we've seen many times is the embedding of RavenDB inside a larger application. In this manner, each instance of the application also has its own instance of RavenDB. Consider a point-of-sales system in a supermarket, running its own copy of the store management application and talking directly to its own local database.
In such a case, we want to send some of the information we have (the new sales generated on that particular POS system) to the central server. At the same time, we have the central server send updates to each of the POS systems with new prices, products, etc. However, there's no need to send all the data back to the server or to have each POS system contain all the sales across the system. Figure 8.7 shows the data flow between the various components of the system in such a scenario.
The central server in the store will send updates using external replication to all the POS machines. In turn, whenever there's a new sale, the POS will use an ETL process to update the central server in the store, telling it about the new sales that were rung up on the machine. You can also imagine the same architecture writ large when you zoom out, with the central server of the supermarket chain updating each store in turn and aggregating all the sales across its stores.
Failover and recovery with ETL in distributed systems
The ETL feature in RavenDB, like any other feature, was designed with the knowledge that networks fail, have outages and can slow you down. In general, networks are not something that you can rely on. The ETL process was designed to be like the replication process in that it's resilient in the face of such issues.
In the worst case scenario, when the ETL process has no way to communicate with the other side, it will wait until the destination is reachable again and proceed from where it left off. In other words, it's a fully offline, async process. If the other side is not responding for any reason, we'll catch up when we can.
That is the worst case, but we can usually do better. An ETL process is an ongoing task in the cluster, and that means that while the task is assigned to a single node, the cluster as a whole is still responsible for it. If the node that was assigned to the ETL process is down for whatever reason, the cluster will move the task assignment3 to another node, and the ETL process will proceed from there.
Being robust on just the sender side is all well and good, but we're also robust on the receiving end. Part of defining the RavenDB connection string in Figure 8.1 was to add the URL of the destination. In this case, we have just one server. But if we have a cluster on the other side, we can list all the nodes in the cluster, and RavenDB will ensure that even if a node goes down on the other end, the ETL process will proceed smoothly.
Sending data to a relational database
In addition to supporting ETL processes to another RavenDB instance, there's also support for the ETL process to relational databases (Microsoft SQL, Oracle, PostgreSQL, MySQL, etc.). The idea here is that RavenDB holds the master copy of the data and wants to send the data to a relational database. This is desirable for many reasons.
You probably already have a reporting infrastructure in your organization. Instead of having to build one from scratch for use with your data inside of RavenDB, you can just let RavenDB schlep all that data to a relational database and use your pre-existing infrastructure. It's also common to use this feature during migrations from a relational database to RavenDB. Instead of trying to do it in a big bang sort of way, you'll slowly move features away from the relational database; but you'll use the ETL feature in RavenDB to make sure that the rest of the system is not actually aware that anything has changed.
Finally, a really interesting deployment model for RavenDB is as a write-behind cache. Because of RavenDB's speed and the different model, it's often much faster than a relational database for many types of queries. That makes it ideal for the kind of user-facing pages that need speedy reactions. Instead of moving the entire application to RavenDB, you can move just the few pages that have the most impact for users.
In many cases, you can use RavenDB in this manner as a read-only cache, albeit one that's much smarter than the usual cache and has excellent querying capabilities. But it often makes sense to use RavenDB for writes as well, if the scenario demands it. And in this case, you'll write directly to RavenDB and let RavenDB write to the relational database behind the scenes. That can be very helpful if your relational database is struggling under the load since a large portion of it will now be handled by RavenDB instead.
The impedance mismatch strikes back
Translating between the document model and the relational model isn't trivial. Let's use the sample data as an example. We have the notion of an
Order
document, which contains an array ofLines
. However, there's no good way to represent such an entity in a single table in the relational model. In order to bridge the gap in the models, we can use the ETL script to transform the data. Instead of sending it to a single table, we'll send the data from a single document to multiple tables.The concept of data ownership can also complicate things. As with RavenDB ETL, we need to be the owners of the data. In particular, the way that RavenDB implements updates to the data is via
DELETE
andINSERT
calls, not via anUPDATE
. This is because we don't know the current state in the relational data, and we need to make sure we have a clean slate for each write.That means you need to modify your foreign keys on the destination tables because they may be violated temporarily during the ETL operation until RavenDB makes it all whole again. If you're using Oracle or PostgreSQL, you can set such constraints as
deferrable
so they'll be checked only on commit. If you're using a database that doesn't support this feature (such as Microsoft SQL), you'll likely need to forgo foreign key constraints for the tables that RavenDB is writing to.
We'll start with a simple ETL process for Employees
as a way to explore the SQL ETL feature. Before we can start, we need to have created data in Microsoft SQL
with the appropriate table. You can see the table creation script in Listing 8.4.
Listing 8.4 Table creation script for the sample ETL process
CREATE TABLE Employees
(
DocumentId NVARCHAR(50) NOT NULL PRIMARY KEY,
Name NVARCHAR(50) NOT NULL,
Birthday DATETIME NOT NULL
)
Once you have created this table in the relational database, go into the RavenDB management Studio, then to Settings
and then to Connection Strings
. Create
a new SQL connection string, as shown in Figure 8.8. You'll need to provide your own connection string. Note that you can test that the connection string is
valid directly from the Studio.
Note the identity issue
In Figure 8.8, you can see a SQL connection string that uses
Integrated Security=true
. This is possible because the user that RavenDB is running under has permissions for the database in question. In this case, it's running as if it were in debug mode under my own user. In production, RavenDB will often run as a service account, and you'll either need the operations team to allow the service account access to the relational database or use a username and password to authenticate.
You can now go to Settings
and then Manage Ongoing Tasks
. Click Add Task
and select SQL ETL
. You can name the ETL process and select the appropriate connection
string to use here. Before we can really start, we need to let RavenDB know which tables are going to be used in the ETL process and what column in the destination
table is going to be used as the document ID column. Note that this doesn't have to be the primary key of the table in question. It just needs to be a column RavenDB can use to place its document ID in. You can see such a configuration in Figure 8.9.
Select the Employees
collection on the transformation script on the right and use the script from Listing 8.5.
Listing 8.5 Simple ETL script to send employees to SQL
loadToEmployees({
Name: this.FirstName + " " + this.LastName,
Birthday: this.Birthday
});
Click Add
on the transformation script and then Save
for the ETL process and head into your SQL database to inspect the results. You can see the output of
select * from Employees
in Figure 8.10.
You can now play with the data, modifying and creating employees as you wish. RavenDB will make updates to the relational database whenever a document is changed. This is done in an async and resilient manner. If the relational database is slow or not responding for any reason, it won't impact operations for RavenDB and we'll retry the ETL operation until we're successful.
Multi-table SQL ETL processes
The example in Listing 8.5 is pretty simple. We have a one-to-one match between the documents and the rows that represent them. We did have a bit of fun with
the concatenation of the first and last name into a single field, but nothing really interesting was going on there. Let's tackle a more complex problem: sending
the Orders
documents to the relational database.
Here, we can't just rely on one-to-one mapping between a document and a row. An Order
document can contain any number of lines, and we need to faithfully
represent that in the SQL destination. We'll start by first defining the tables that we need on the receiving end, as shown in Listing 8.6.
Listing 8.6 Table creation script for multiple tables ETL process
CREATE TABLE Orders
(
OrderId NVARCHAR(50) NOT NULL PRIMARY KEY,
Company NVARCHAR(50) NOT NuLL,
OrderedAt DATETIME NOT NULL,
Total FLOAT NOT NULL,
ShipToCountry NVARCHAR(50) NOT NULL
)
CREATE TABLE OrderLines
(
OrderLIneId int identity NOT NULL PRIMARY KEY NONCLUSTERED,
OrderId NVARCHAR(50) NOT NULL INDEX IX_OrderLines_OrderId CLUSTERED,
Price FLOAT NOT NULL,
Product NVARCHAR(50) NOT NULL,
Quantity INT NOT NULL
)
Once we've created the tables in Listing 8.6, we can proceed to actually define the ETL process for orders. First, we need to add the new tables, as you can see in Figure 8.11.
Once we've configured the additional tables for the ETL process, we can get started on sending the data over. We need to let RavenDB know both that we're going to be using these two tables and that it needs to send multiple values to the OrderLines
table. This turns out to be easy to do. The reason we need
to define the tables upfront for SQL ETL is that we're generating matching functions to be called from the script. In the case of the script to send the Orders
to the relational database, these are loadToOrderLines
and loadToOrders
, as you can see in Listing 8.7.
Listing 8.7 ETL Script for sending the Orders documents to the Orders and OrderLines tables
let total = 0;
for (let i = 0; i < this.Lines.length; i++) {
let line = this.Lines[i];
total += line.PricePerUnit * line.Quantity * (1 - line.Discount);
loadToOrderLines({
Quantity: line.Quantity,
Product: line.Product,
Price: line.PricePerUnit
});
}
loadToOrders({
Company: this.Company,
OrderedAt: this.OrderedAt,
ShipToCountry: this.ShipTo.Country,
Total: total,
});
The script in Listing 8.7 iterates over the Lines
in the order, telling RavenDB that we need to send the object to that table as well as compute a running total
of the order. Lastly, it calls loadToOrders
with the final tally for the Orders
table. RavenDB will actually call into the relational database and update it only when the script completes.
You might have noticed in Listing 8.7 and Listing 8.5 that we didn't specify the OrderId
or DocumentId
columns. That's because we implicitly define those when we set
up the tables that the SQL ETL process will use.
The care and feeding of production-worthy SQL ETL processes
With RavenDB ETL, there isn't a lot that you need to verify to get things right. But with SQL ETL, there are a few common pitfalls that you need to be aware of. Probably the most important one is the notion of indexes. Unlike RavenDB, relational databases usually don't learn from experience, and they require you to define indexes explicitly. You should define indexes on the relevant tables on at least the column used to hold the document ID.
Whenever RavenDB updates a document, it issues a set of DELETE
statements for each of the relevant tables. For example, such a DELETE
statement might look
like the code in Listing 8.8.
Listing 8.8 The first step when updating a document is to delete its existing data
DELETE FROM OrderLines WHERE OrderId = 'orders/830-A';
DELETE FROM Orders WHERE OrderId = 'orders/830-A';
As you can imagine, such statements are sent frequently during normal operations. That means that the relational database is going to be doing a lot of
queries on the OrderId
field (which is the one RavenDB is using to store the document ID). If the field in question isn't indexed, that can cause a lot
of table scans, having a negative impact on the performance of the relational server and slow down the ETL process.
If you look at the tables creation script in Listing 8.6, you'll note that we explicitly handled that. In the Orders
table, the OrderId
field is marked
as the primary key (which is clustered by default on SQL Server). This means that searches on that field are going to be very fast. With the OrderLines
table,
the situation is a bit more complex.
Here, we can't use the OrderId
as a primary key because a single order has multiple lines. We define a throwaway primary key using identity
, but we mark
it as NONCLUSTERED
and define a clustered indexed on the OrderId
. This kind of setup is ideal for the RavenDB ETL process. All deletes and writes from
RavenDB will always use a clustered index, and that minimizes the amount of operations that the relational database needs to perform.
Append-only systems
Sometimes you have a system that's built to never delete data — in fact, a regulator might require you to keep all data. In such cases, you can configure the SQL ETL process to use inserts only. This is configurable on a per-table basis. And it can also be a performance boost in some cases because we can directly insert the data without first running a set of
DELETE
statements to clean up the previous incarnation of the document we're sending over.
Errors and troubleshooting
As usual, we can't close a topic without discussing the error-handling strategy. Just like the RavenDB ETL process, the SQL ETL process is a task at the cluster level that's always assigned to a node. If the node in question fails, the cluster will assign the task to another node, which will proceed from the same place the previous node stopped at.
There are two more general classes of issues that need addressing, the first of which is when the relational database cannot be accessed, for whatever reason. If the relational database is down, too slow to respond or inaccessible, that would halt the entire ETL process and cause RavenDB to keep retrying the process until we've successfully connected to the relational database.
Another type of error is when we fail to complete the ETL process for a single statement. For example, in Listing 8.5, we're sending a Name
field that is
composed of the FirstName
and LastName
of an employee. If the length of the employee's name in question exceeds 50 characters (as defined in the
table creation script in Listing 8.4), we'll not be able to process that document. In such a case, when there is an error that pertains only to a particular
document, RavenDB will proceed with the ETL process, rather than halting it entirely.
An alert will be raised and the admin will need to resolve the issue, either by limiting the size of the name we send or by extending the size of the column on the relational table. The example of the column length is a simple one and illustrates the issue quite nicely. There is a non-recoverable error in the document that prevents us from completing the ETL process for that particular document but doesn't halt it for others. Other examples include a violation of not-null constraints and violations of foreign keys.
Because they don't halt the ETL process, it can be easy to miss the fact that some documents haven't been sent over. RavenDB generates an alert when this type of error occurs. You can see such an error by going to one of the Employees
documents and increasing the size of the FirstName
field until it is over 50
characters and saving. RavenDB will attempt to send the update over but will fail, and you'll get the error shown in Figure 8.12.
The error-handling strategy RavenDB uses in such a scenario is based on the assumption that as long as we're making progress, it's better to alert than to fail completely. But if the number of errors exceeds a certain threshold, we consider the entire ETL process to be failing and stop processing it until an admin tells us otherwise. The reasoning behind this behavior is that we want to ensure proper progress, but we don't want to limp along, failing on each and every document that we try to send.
Handling deletes in ETL
In both RavenDB ETL and SQL ETL, dealing with document changes is actually relatively easy. What's more complex is handling deletes. This is because we need to propagate the delete to the other side, but the delete is an absence of something. So how do we know what isn't there?
RavenDB needs to deal with this problem in many different forms. It's relevant for replication between databases (both inside the same cluster and external replication between clusters), for ETL processes, for indexes and even for backups. In all cases, this is handled via the tombstone concept. A tombstone is an entry that marks the absence of a document.
Usually you don't care about tombstones; they're an internal implementation detail. For ETL, what you need to know is that a deletion of a document will trigger the ETL process as usual, and it will remove the matching records on the destination side.
Summary
In this chapter, we went over the ETL processes supported by RavenDB. An ETL process allows you to take the documents residing in RavenDB and send them over to another location. This can be another RavenDB node or even a relational database. Along the way, you have the chance to mutate and filter the data, which opens up a lot of avenues for interesting use cases.
We looked at a few such examples. One was relying on RavenDB ETL to distribute a set of records to related services so they'll have their own copy of the data and can access it without making a cross-service call. In this case, the ETL process is part and parcel of the service public interface and needs to be treated as such. Another option is to provide partial views of the data. Access to production-grade data can be given to developers without allowing them to be exposed to details that should remain private, for example. We also looked at the reversed example, seeing how we can send data from multiple locations (the point-of-sales systems) to a central server that will aggregate all that data.
RavenDB also supports automatic ETL processes to a relational database. You can use scripts to decide how RavenDB will transform the document model into the table model, including non-trivial logic. This allows you to use RavenDB in a number of interesting ways:
- As an OLTP database while your reports are handled by the ETL target
- As a cache, for both reads and writes, with RavenDB sending updates to the backend relational database.
- As part of a migration strategy, where you move different segments of your applications to RavenDB while others continue to operate with no change.
We looked at some of the details of the ETL processes: how they handle failure and recovery and what kind of response you should expect from them. With ETL processes between RavenDB instances, you get failover on both the source and destination, with each cluster being able to route the ETL process to the right location even if some of the nodes have failed. With SQL ETL, a RavenDB node going down will cause the cluster to move the task to another node, while a problem with the relational database will halt the ETL process until the remote database is up and running.
RavenDB employs a sophisticated error-handling strategy with regards to errors writing to relational databases, trying to figure out whether a particular error is transient and relevant for a single record or should impact the entire ETL process. In the first case, we'll alert you about the error and continue forward, while in the latter, after alerting, we'll stop the ETL process until an admin has resolved the issue.
We looked into some of the more common options for utilizing ETL processes in your system, and I tried giving you a taste of the kind of deployments and topologies that are involved. It's important to note that this is just to give you some initial ideas. The ETL features are quite powerful, and they can be used to great effect in your environment. I encourage you to think about them not just as a feature to be used in the tactical sense but how they play in the grand architecture of your system.
In the next part of the book, we're going to start on an exciting topic: queries and indexes in RavenDB. This has been a long time coming, and I'm almost more excited than you are at this point.