Cluster: Cluster-Wide Transactions



Why Cluster-Wide Transactions

Usually, RavenDB uses the multi-master model and applies a transaction on a single node first and then asynchronously replicates the data to other members in the cluster. This ensures that even in the presence of network partitions or hard failures, RavenDB can accept writes and safely keep them.

The downside of the multi-master model is that certain error modes can cause two clients to try to modify the same set of documents on two different database nodes. That can cause Conflicts and make it hard to provide certain guarantees to the application. For example, ensuring the uniqueness of a user's email in a distributed cluster. Just checking for the existence of the email is not sufficient. Perhaps two clients may be talking to separate database nodes and both of them check that the user does not exist. They will both create what will end up as a duplicate user.

To handle this (and similar) scenarios, RavenDB offers the cluster-wide transaction feature. It allows you to explicitly state that you want a particular interaction with the database to favor consistency over availability to ensure that changes are going to be applied in an identical manner across the cluster even in the presence of failures and network partitions.

In order to ensure that, RavenDB requires that a cluster-wide transaction will be accepted by at least a majority of the voting nodes in the cluster. If it is not able to do so, the cluster-wide transaction will fail.

For the rest of this document we are going to refer to single-node transactions, applied on a single node and then disseminated using async replication vs. cluster-wide transactions that are accepted by a majority of the nodes in the cluster and then applied on each of them.

How Cluster Transactions Work

  1. A request sent from the client via SaveChanges() using TransactionMode.ClusterWide will generate a Raft Command and the server will wait for a consensus on it.
  2. When consensus is achieved, each node will validate the compare-exchange values first.
    If this fails, the entire session transaction is rolled back. From the nature of the Raft consensus algorithm the cluster-wide transaction should either eventually be accepted on all nodes or fail on all of them.
  3. Once the validation has passed, the request is stored on the local cluster state machine of every node and is processed asynchronously by the relevant database.
  4. The relevant database notices that it has pending cluster transactions and starts to execute them.
    Since order matters, a failure at this stage will halt the cluster transaction execution until it is fixed.
    The possible failure modes for this scenario are listed below.
  5. Every document that has been added by the cluster transaction gets the RAFT:int64-sequential-number Change Vector and will have priority if a conflict arises between that document and a document from a regular transaction.
  6. After the database has executed the requested transaction, a response is returned to the client.
    • Upon success, the client receives the transaction's Raft Index which will be added to any future requests. Performing an operation against any other node will wait for that Raft index to be applied first, ensuring order of operations.
  7. In the background, the Cluster Observer tracks the completed cluster transactions and removes the local cluster-state-machine only when it has been successfully committed on all of the database nodes.

Cluster Transactions Properties

The Cluster transaction feature enables RavenDB to perform consistent cluster-wide ACID transactions.
It can be composed of two optional parts:

  1. Compare Exchange values, which will be validated and executed by the cluster.

    Compare exchange key/value pairs can be created and managed explicitly in your code.
    Starting from RavenDB 5.2, they can also be created and managed automatically by RavenDB.
    Compare exchange entries that are automatically administered by RavenDB are called Atomic Guards,
    read more about them here.

  2. Store/Delete operations on documents, which are executed by the database nodes after the transaction has been accepted.

Atomicity
After having a quorum for the cluster transaction request by raft and a successful concurrency check of the compare exchange values, the transaction is guaranteed to be executed.
Failure during the quorum or the concurrency check will roll back the entire session transaction,
while failure during the commit of the documents will halt any further cluster transactions execution on the database until that failure is remedied (failure mode for the documents commits are described later here).

Consistency
Consistency is guaranteed on the requested node. The node will complete the request only when the transaction is completed and the documents are persisted on the node. The response to the client will contain the cluster transaction Raft Index. It will be added to any future requests in order to ensure that the node has committed that transaction before serving the client.

Durability
Once the transaction has been accepted, it is guaranteed to run on all the database's nodes, even in the case of system (or even cluster-wide) restarts or failures.

Concurrent Cluster-Wide and Single-Node Transactions

Case 1: Multiple concurrent cluster transactions

Optimistic concurrency for cluster-wide transactions is handled using the compare-exchange feature. The transaction compare-exchange operations are validated and if they can't be executed because the compare-exchange values have changed since the transaction was initiated, the entire session transaction is aborted and an error is returned to the client.

Optimistic concurrency at the document level is not supported for cluster-wide transactions. Compare-exchange operations should be used to ensure consistency in that regard. Concurrent cluster-wide transactions are guaranteed to appear as if they are run one at a time.

Cluster-wide transactions may only contain PUT / DELETE commands. This is required to ensure that we can apply the transaction to each of the database nodes without regard to the current node state (note: the update of a document is effectively executed as PUT command).

Information

If the concurrency check of the compare exchange has passed, the transaction will proceed and will be committed on all the database nodes.


Case 2: Concurrent cluster and non-cluster transaction

When mixing cluster-wide transactions and single-node transactions, you need to be aware of the rules RavenDB uses to resolve conflicts between them.

Documents changed by the cluster-wide transactions will always have precedence in such a conflict and will overwrite changes made in a single node transaction. It is common to use cluster-wide transactions for certain high-value operations such as the creation of a new user, sale of a product with a limited inventory, etc., and use single-node transactions for the common case.

A single node transaction that operates on data that has been modified by a cluster-wide transaction will operate as usual, as the cluster-wide transaction has already been applied (either directly on the node or via replication) the cluster-wide transaction will not be executed again.

Replication will try to synchronize the data, so in order to avoid conflicts every document that was modified under the cluster transaction will receive the special RAFT:int64-sequential-number Change Vector and the special flag FromClusterTx which ensures precedence over a regular change vector.


Case 3: Cluster transaction with an External incoming replication

While the internal replication with the cluster is discussed in the previous case, the case where two clusters are connected via external replication is a bit different.

The logic of documents that were changed by a cluster transaction versus documents that were changed by a regular transaction stays the same. However, in the case where a conflict is on a document that was changed by both a local cluster transaction and a remote cluster transaction, the local one will have precedence. Furthermore, the FromClusterTx flag will be removed, which means that on the next conflict the local is no longer treated as modified by a cluster-wide transaction.

Failure Modes in Cluster-Wide Transactions

No majority

A cluster-wide transaction can operate only within a functional cluster. Thus, if no consensus was acquired for the cluster transaction by the majority of the nodes or currently there is no leader, the transaction will be rolled back.


Concurrency issues for compare-exchange operations

Acquiring a consensus doesn't mean the acceptance of the transaction. Once the consensus is acquired, each node does a concurrency check on the compare-exchange values. If the concurrency check fails, the transaction will be rolled back.


Failure to apply transaction on database nodes

Once the transaction has passed the compare-exchange concurrency check, the transaction is guaranteed to be committed. Any failure at this stage must be remedied.

Failure How to fix it
Out of disk space Freeing space will fix the problem and allow cluster transactions to be committed.
Creation/Deletion of a document with a different collection Deleting the document from the other collection

Warning

The execution of cluster transactions on the database will be stopped until these types of failures are fixed.

Debug Cluster-Wide Transactions

The current state of the cluster transactions that are waiting to be completed by all of the database nodes can be found at:

URL Type Permission
/databases/*/admin/debug/cluster/txinfo GET DatabaseAdmin

Parameters

Name Type Description
from (optional) long (default: 0) Get cluster transactions from the raft change vector index.
take (optional) int (default: int.MaxValue) The number of cluster transaction to show.