Sharding: Overview
-
Sharding, supported by RavenDB from version 6.0 and on, is the distribution of a database's content between autonomous Shards.
-
In most cases, sharding is implemented to allow efficient usage and management of exceptionally large databases (i.e. a 10-terabyte DB).
-
Sharding is managed by the RavenDB server, no special adaptation is required from clients when accessing a sharding-capable server or a sharded database.
- The client API is unchanged under a sharded database.
Clients of RavenDB versions older than 6.0 (that provided no sharding support) can seamlessly connect a sharded database, without making any adaptations or even knowing that the database they connect is sharded. - Particular modifications in RavenDB features under a sharded database are documented in detail in feature-specific articles.
- The client API is unchanged under a sharded database.
-
Each Shard hosts and manages a unique subset of the database content.
Documents are sorted between shards by their document ID. -
Each RavenDB shard is hosted by at least one cluster node.
Shards can be replicated over multiple nodes to increase data accessibility. -
In this page:
Sharding
As a database grows very large,
storing and managing it may become too demanding for any single node.
System performance may suffer as resources like RAM, CPU, and storage are
exhausted, routine chores like indexing and backup become massive tasks,
responsiveness to client requests and queries slows down, and the system's
throughput spreads thin serving an ever-growing number of clients.
As the volume of stored data grows, the database can be scaled out by
splitting it into shards, allowing it to be
handled by multiple nodes and presenting practically no limit to its growth.
The size of the overall database, comprised of all shards, can reach in
this fashion dozens of terabytes and more while keeping the resources
of each shard in check and maintaining its high performance and throughput.
Licensing
Sharding is Fully Available on an Enterprise license.
- On a Developer license, the replication factor is restricted to 1.
- On Community and Professional licenses, all shards need to be on the same node.
Learn more about licensing here.
Client-Server Communication
As a client connects a sharded database, it is appointed a RavenDB server
that functions as an orchestrator and mediates all the communication
between the client and the database shards.
The client remains unaware of this process and uses the same API used by
non-sharded databases to load documents, query, and so on.
The additional communication between the client and the orchestrator and
between the orchestrator and the shards does, however, present an overhead
over the usage of a non-sharded database.
When Should Sharding Be Used?
While sharding solves many issues related to the storage and management of high-volume databases, the overhead it presents outweighs its benefits when the database size still poses no problem. We can postpone the transit to a sharded database when, for example, the database size is 100 GB, the server is well equipped and would comfortably handle a much larger volume, and no dramatic increase is expected in the number of potential users any time soon.
We recommend that you plan ahead for a transition to a sharded database when
your database size is in the vicinity of 250 GB, so the transition is already well
established when it reaches 500 GB.
RavenDB 6.0 and above can migrate its database to a sharded database via external replication or export & import operations.
You cannot, however, upgrade a non-sharded database into a sharded one.
To upgrade RavenDB to 6.0 and migrate the database data you will need
to upgrade the server, create a new, sharded database, and replicate or
export the data into it.
Shards
While each cluster node of a non-sharded database handles a full replica
of the entire database, each shard is assigned a subset of the
entire database content.
Take, for example, a 3-shards database, in which shard 1 is populated with
documents Users/1
..Users/2000
, shard 2 with documents Users/2001
..Users/4000
,
and shard 3 with documents Users/4001
..Users/6000
.
A client that connects this database to retrieve Users/3000
and Users/5000
would be served by an automatically-appointed orchestrator node
that would seamlessly retrieve Users/3000
from shard 2 and Users/5000
from
shard 3 and hand them to the client.
As much as clients are concerned a sharded database is still a single entity: the clients are not required to detect whether the database is sharded or not, and clients of RavenDB versions prior to 6.0, which had no sharding support, can access a sharded database unaltered.
Shard-specific operations are, however, available: a client can, for example, track the shard that a document is stored at and query this shard, and Studio can be used to relocate (reshard) documents from one shard to another.
Studio Document View
Shard Replication
Similarly to non-sharded databases, shards can be replicated by cluster nodes to ensure the continuous availability of all shards in case of a node failure, provide multiple access points, and load-balance the traffic between shard replicas.
The number of nodes a shard is replicated to is determined by the Shard Replication Factor.
Shard Replication
- In the image above, a 3-shards database is hosted by a 5-nodes cluster (where
two of the nodes, D and E, are unused by this database).
The Shard Replication Factor is set to 2, maintaining two replicas of each shard.
Buckets
Documents are stored in a sharded database within virtual containers named Buckets.
The number of documents and the amount of data stored in each bucket may vary.
Buckets Allocation
The number of buckets allocated for the whole database is fixed, always remaining
1,048,576 (1024 times 1024).
Each shard is assigned a range of buckets from this overall portion, in which
documents can be stored.
Buckets Allocation
Buckets Population
Buckets are populated with documents automatically by the cluster.
A hash algorithm is executed over each document ID. The resulting
hash code, a number between 0 and 1,048,576, is the number of the
bucket in which the document is stored.
Buckets Population
As buckets are spread among different shards, the bucket number allocated for a document also determines which shard the document will reside on.
Document Extensions Storage
Document extensions (i.e. Attachments, Time series, Counters, and
Revisions) are stored in the same bucket as the document they belong to.
To make this happen, the bucket number (hash code) they are given
is calculated by the ID of the document that owns them.
Anchoring Documents to a Bucket
You can make documents share a bucket (and therefore a shard) by
adding their ID a suffix by which RavenDB will calculate their bucket number.
Learn here why and how to do this.
Resharding
Resharding is the relocation of data placed on one shard, on another shard, to maintain a balanced database in which all shards handle about the same volume of data.
The resharding process moves all the data related to a certain
bucket, including documents, document extensions, tombstones, etc.,
to a different shard, and then associates the bucket with the new shard.
E.g.,
-
Bucket
100,000
was initially associated with shard 1.
Therefore, all data added to this bucket has been stored in shard 1. -
Resharding bucket
100,000
to shard 2 will:- Move all the data that belongs to this bucket to shard 2.
-
Associate bucket
100,000
with shard 2.
From now on, any data added to this bucket will be stored in shard 2.
Paging
From a user's perspective paging
is conducted similarly in sharded and non-sharded databases, using
the same API.
Paging is more costly in a sharded database, however, since the
orchestrator must load data from each shard and sort the retrieved
results before handing the selected page to the user.
Read more about this subject here.
Using Local IP Addresses
The local IP address of a cluster node can be exposed, so other cluster nodes would prioritize it when they access the node. Using a node's local IP address rather than a public one for inter cluster communications can speed up the service and offer substantial savings over time.
Using this method can be particularly helpful in a sharded cluster, since each client request is handled by an orchestrator, that may communicate the request and its results with all other shards.
Use this configuration option to expose a node's local IP address to other nodes.
Creating a Sharded Database
-
A RavenDB cluster can run sharded and non-sharded databases in parallel.
-
When a database is created, The user can choose whether it would be sharded or not. The ability to make this choice is provided by RavenDB (6.0 and on) by default, no further steps are required to enable the feature.