RavenDB vs Elasticsearch: The Best Open Source NoSQL Database
by Oren Eini
When deciding on your next NoSQL Download, which is best for you, RavenDB or ElasticSearch?
A modern top-notch database has to be equipped with a remarkable set of capabilities. Its work rate must be suitable for Big Data management. Its flexibility must allow it to easily upscale when the amount of data ascends exponentially, and its availability must be absolute, keeping it reachable for clients and customers even in the presence of failure.
How do the best open source NoSQL databases stand up to their challenges?
Two industry leaders, ElasticSearch and RavenDB, are reviewed by their approach toward and implementation of 10 parameters: data integrity, security, data model, version control, querying, data delivery, sharding, communication, memory management, and scaling out.
ElasticSearch uses Apache’s Lucene open source search engine. Lucene is not ACID, which exposes your database to various failure modes may end up in data corruption or loss that you may not even be aware of.
ElasticSearch is transactional on the document level, which gives you a standard of data consistency. The exposure, however, is that it can lead to partial batch transactions in which some documents are committed and others are not.
RavenDB is fully transactional across your database and throughout your cluster. You can use it to transmit documents and document batches with full confidence that whatever the transaction’s scope is, it would be completed as a whole or reverted in its entirety.
Each RavenDB instance is a standalone server that functions as such whenever the need arises. Each instance resides in cluster state, and can hook into your cluster as one of its nodes in a few clicks. Tasks like expanding your network, distributing your database while maintaining its integrity, sharing data and chores - are all trivial for RavenDB and its users.
A multi-master distributed strategy allows each node to function on its own if the network ever partitions. Clients can read and write using any node, and the changes made to an isolated server’s database are merged back to the cluster as soon as connectivity heals.
Both ElasticSearch and RavenDB encrypt data in transit and at rest.
ElasticSearch encrypts transferred data and authenticates clients using TLS X.509 certification.
Using a manual installation, your indexes and shards can be backed up fully and incrementally. Backup files are not encrypted.
Like ElasticSearch, RavenDB uses TLS X.509 certification during transit to encrypt data and authenticate its clients.
All data at rest, including documents, indexes and other components, is encrypted using 256-bit encryption and the XChaCha20 algorithm.
RavenDB provides a comprehensive automated backup system whose output files may be encrypted to keep them secure while stashed for safekeeping.
You can set backup routines by time interval and type to run a full backup every 24 hours and an incremental one every 30 minutes.
ElasticSearch is a distributed document store database.
RavenDB is a multi-model database that includes a Document Store, Key-Value Store, Graph API, Distributed Counters and Binary Attachments.
Version Control and Auditing
ElasticSearch offers no version control.
RavenDB supports version control at the database and collection level. If you enable it, any change in a document would generate a "revision" - an immutable copy of the document before it's been changed.
Keeping track of a document's development may be useful in many situations. In the case of regulated industries maintaining a trail of data modifications may even be required by law.
You can keep all the revisions ever created in our database or limit the revisions history by a chosen number of revisions or for a certain period of time.
You can revert any document to any of its revisions, traveling back and forth in time.
An especially powerful feature lets you use revisions on a bigger scale, and revert the whole database to its state at a certain time. This may turn from handy to crucial if a flow of mistaken data has rendered an important database unusable for example, and you want to follow the changes to their origin and salvage your data.
NoSQL Database Query Language
For anyone accustomed to SQL, getting used to ElasticSearch's DSL (Domain Specific) query language may take some time.
DSL isn't the easiest thing to implement, especially when somewhat more advanced features are involved like nesting aggregations and filters in a single query.
Complex searches are performed using a JSON-based format that tends to become overly verbose, yielding large and ridiculously complex queries.
The Raven Query Language (RQL) is an intuitive dialect of SQL. Some 80% of it is SQL, allowing beginners to start easily and experienced users to use their current knowledge as a bridge and move on to NoSQL as pleasantly as possible.
Carrying your experience along to your next level doesn't stop here, as advanced features like Graph querying are supported by RQL as well.
Data Delivery Performance
Querying and Aggregating the Data
ElasticSearch breaks its indexes into shards that can each be stored on a different node. An index that has grown beyond a single machine's capacity to handle it can thus be handled by several. This considerably improves server performance and lowers client latency.
While suitable for simple queries, more complex ones are likely to cause overheads like iterating through result sets and tallying the final result each time the query is executed. It is therefore common to find an add-on like Apache Hadoop aiding ElasticSearch in queries optimization.
To boost performance, RavenDB supports dynamic queries as well as predefined indexes. When you create and execute queries, a query optimizer automatically finds and improves their indexes. As your queries evolve, index optimization continues as well. The more queries you run, the better indexes RavenDB sets for you and the faster your queries are answered.
RavenDB implements a native MapReduce feature, eliminating the need for third-party add-ons. For aggregation queries, RavenDB tallies totals “the old-fashion way”, just the first time an aggregation query is made. From here on, the total is updated each time data is written to the database. This reduces the time it takes to produce aggregates by up to 99%.
ElasticSearch aggressively caches data for future queries.
Frequently used aggregates, like ones loaded from popular pages, can be cached for reuse. Queries are kept in editable form, but are also translated to Bitsets to provide clients with quicker response. Bitsets are cached to make them immediately available when reused data is requested, and are wisely updated as queries are added.
RavenDB supports caching at multiple levels. Repeated client requests are detected and often served directly from the client-side cache, only talking to the server to verify they are still current. This can dramatically reduce the amount of data transferred over the network and improve overall system performance.
RavenDB also supports aggressive caching, taking it one step further. Clients do not even need to approach the server to verify that the cached data is current. Instead, the server notifies them whenever data changes on a node, and outdated cached data is invalidated.
Sharding the Data
To handle large data sets, ElasticSearch divides its database into indices. Each index can be further divided into shards, which can be then replicated if necessary.
Shards would be normally placed on node machines near their potential users. Short distances to clients and the right number of shards reduce the load per each server request.
RavenDB doesn’t support automatic sharding in its current version, this feature is planned to be included in version 5.0.
Communication with Machine outside Your Cluster
In order to perform Extract, Transform and Load operations on ElasticSearch, you need to add on a third-party tool.
RavenDB supports automated data transfers using ETL between itself and relational databases, non-relational databases, and the cloud. No outside applications or third-party add-ons are needed.
You can replicate documents from your database to a relational database, enabling a wide variety of analyses and reports using your familiar setting and existing reporting toolset. It also makes working with polyglot architectures, especially microservices, a snap.
ElasticSearch uses JVM, whose standard garbage collection (GC) may stop any programming flow at an arbitrary point. This is called Stop the World garbage collection.
When 75% of a computer’s memory is clogged with unused objects, garbage collection is executed automatically. Such GC can cause higher CPU usage, increased latency, and frequent shard relocation as ElasticSearch tries to keep the cluster available and balanced.
ElasticSearch’s solution is to raise memory usage percentage manually, triggering the JVM to minimize garbage collection periods. This requires higher resource allocation to the system.
RavenDB runs on a managed runtime as well (the CoreCLR) and has to deal with similar garbage collection issues. Its solution is very different though: it has taken direct ownership over many memory operations, and now manages them itself outside the scope of the GC.
This means that RavenDB is able to further optimize its memory utilization, reduce memory usage and greatly diminish the cost of garbage collection.
This is part of the reason RavenDB is so fast, with each node capable of handling tens of thousands of requests per second with consistent latency and throughput.
Schema vs Schemaless
ElasticSearch requires you to define data by its type, which requires a schema. Once you set the data type for any field, you cannot change it. If you need to scale up queries, you may need a data migration.
RavenDB is schemaless. You do not need to set data types, and can modify documents as you please. Queries are not based on your schema or data structure, but on the information you are looking for.
You can scale up without having to make fundamental changes to your database setup, and need no migration. This may save much valuable time in your release cycle.
Easy scalability reduces latency, lightens the load off each node, and provides you with an extra layer of security as full database copies and chosen database pieces are easily replicated.