To Structure and Summarize Data from Various Sources HitHorizons Leverages RavenDB’s Unique Indexing Prowess January 13, 2023

To Structure and Summarize Data from Various Sources HitHorizons Leverages RavenDB’s Unique Indexing Prowess

by  Shahar Erez
published: January 13, 2023 | updated: January 19, 2023

RavenDB simplifies HitHorizon’s ability to organize and aggregate massive, heterogeneous data imports and reduce the DevOps costs of producing accurate reports while improving time to market of new features.

About HitHorizons

FinStat, through the project HitHorizons.com, aggregates data on European companies to improve the transparency of the business environment by providing firmographic data and statistical reports. They are a company of about 40 employees, 8 of whom are developers that are split between 2 projects. Their solutions are used by small and medium-sized enterprises, journalists, government agencies, non-profit organizations, business developers, and marketing professionals. Users can easily do background checks on potential business partners. Also, the general public uses their tools to get information about various service providers, such as home improvements or elderly care.

Their solution requires a data management system that can inexpensively and quickly process vast amounts of data and give fast answers. Their dataset grows quickly, so they need a solution that can grow with them. So far, it consists of 100s of gigabytes of data in about 100 million documents. They are used by hundreds of thousands of people and have millions of monthly page views.

Why HitHorizons and FinStat chose RavenDB

They tested performance and costs using out-of-the-box settings with RavenDB, Azure CosmosDB, Elasticsearch, PostgreSQL, and MSSQL.

“RavenDB was anywhere from twice as efficient to several times more efficient than the others.”

Andrej Krivulcik, Software Engineer at HitHorizons

They assume that if they had invested a lot of work to optimize the other databases, the others would likely have come closer, but this much optimization work would slow down their DevOps process. 

Benefits of RavenDB and the features used to unlock them

Maximizes TTM/release cycle velocity by removing DevOps obstacles

As a NoSQL database, RavenDB doesn’t require a strict schema which has given HitHorizons a lot of flexibility. DevOps teams enjoy the freedom because flexible structures make development much more intuitive. Flexible structure translates to more agility in development. It also enables you to process and make sense of heterogeneous data from many sources in various formats.

HitHorizons developers don’t need to map every possible field for every company, even if many fields contain information that isn’t relevant to a specific scenario. Instead, RavenDB’s map-reduce index organizes the data behind the scenes. If a company adds a data property and sends it to HitHorizons, it won’t cause any schema-related issues because the index is defined to map only specified fields.

They use RavenDB’s unique indexing functionality, including native map-reduce, to organize and analyze the messy data that they receive from millions of businesses so that they can make educated decisions on which features and business insights to focus on.

Testing in RavenDB is simplified with in-memory embedded servers designed to save time on integrated and unit tests.

Finally, in each release cycle, to quickly initialize a development database with the latest data, they use RavenDB’s simple backup-restore operation.

Easy to provide clean data for reporting

Today’s data scientists spend an average of 40% of their time cleaning data due to data being lost/corrupted or if it comes from various sources.

ACID Transactions: Unless they use ACID transactions, NoSQL databases can easily corrupt data. ACID transactions are the default in RavenDB, and because this was by design from the beginning, there is no performance penalty for preventing data corruption.

Map/Reduce Index structures and summarizes heterogeous data: HitHorizons regularly imports enormous amounts of semi-structured data in multiple formats and schemas. They need to turn this data into accurate, usable reports without breaking the bank.

“RavenDB’s native Map-reduce indexes bring structure to the unstructured data.”

While developing, they used RavenDB’s native Map-Reduce indexes to aggregate the data and be able to decide which fields to choose for their solution. For example, some fields were only filled by ~10% of the companies. With map-reduce, they can easily identify and not query irrelevant and incomplete data. Thus, they can feed their clients with valuable data in production while leaving the raw data as is. 

Another tool that RavenDB provides to filter and transform data is RavenDB’s ETL which allows you to define transform scripts before the data is transferred. ETL also doesn’t affect the source data.

Tremendous cost savings and better performance

RavenDB allows you to get the same amount of work done while using much less server energy, thus leaving a smaller ecological footprint and increasing profits. There are many features that substantially reduce trips to the server. RavenDB’s batching of concurrent ACID transactions, lazy requests, including multiple related documents in one session, and client-side caching with instant invalidation all add up to significant cost savings.

Also, your queries will never have to scan the whole dataset because in RavenDB they always use indexes. Whenever your data changes, the relevant indexes update themselves and can run computations so that queries don’t have to. RavenDB indexes do this work asynchronously by default. This means that even your complex queries are always lightning fast and inexpensive because the indexes have already done the work for them incrementally. In traditional databases, queries have to scan all of the data that they’re instructed to return every time a request is made. RavenDB indexes do that work once and are updated incrementally for any changes in data. This approach translates to substantial IO savings and faster performance.

User-friendly for the Operations team

Top-notch security certificates are out of the box in the RavenDB cloud platform and only take a few minutes to set up on-premise. Once you have a certificate, encryption in transit (https) is ready to go, and encryption at rest is easy to set up whenever creating a new database. You can give each of your clients different levels of security authorization per database to limit what they can access and which operations they can do. You can also specify which IPs have sole access to your cloud servers for built-in firewall-like capabilities.

So, if you’re using RavenDB’s managed cloud service, you will save time and money having to maintain a firewall, persistence layer, cluster certificates, external map-reduce, and no need to patch operating systems. Scaling vertically and horizontally is trivial, failover is seamless and nodes in a cluster can easily be spread over multiple data centers in the same region to prevent failure whenever a datacenter goes offline. They run on AWS, GCP, and Azure.

Ongoing backups are simple to set up and configure per database. They include full snapshots or logical backups plus incremental backups of any changes. As Andrej wrote, you just “set it and forget it.” If a database is lost due to hardware problems or user error, RavenDB makes it easy to restore a database from the backup that you choose.

Finally, the RavenDB Studio GUI is part of the package. It makes it easy to perform many tasks and provides modular dashboards to keep your fingers on the pulse of server resources utilization, indexes, performance, errors, notifications, cluster health, backups, and more.

Full-text search, spatial queries, aggregations, facets, highlights, and suggestions are all part of the package. Results are automatically ranked according to relevancy (and specific fields can be boosted.) You can specify search operators, use prefixes and wildcards, get suggestions and similar terms, show text snippets with terms highlighted, and more. You can also specify or create custom analyzers using the Lucene library.

High Availability – Servers crashed but the data cluster still ran smoothly

RavenDB clusters are exceptionally stable, providing automatic and seamless failover, then automatic rehabilitation of nodes when they come back online. To ensure high performance without breaking the bank, HitHorizons uses host-local NVMe disks available in cloud virtual machines. The downside to these disks is that they can lose all data in situations such as virtual machine reallocation in the data center.

In one remarkable instance, all three nodes in a RavenDB cluster suffered such catastrophic data loss staggered over the course of a single month. In each case, the cluster came back online automatically, thanks to RavenDB automatic cluster recovery and a bit of automation to prepare the disks after booting up on a fresh virtual machine with a blank disk.

One of the three instances even went unnoticed until several weeks later. The HitHorizons team was investigating an unrelated issue with RavenDB support. The support team pointed out that there was a cluster degradation at a time when there were no issues with the service. The degraded node restored the data from the other nodes without any service disruption.

In response to this, HitHorizons improved its monitoring to include the database and the underlying infrastructure. This enhances the insights into the database system as a whole.

Difficulties they had with RavenDB and how they solved them

Whenever they have a problem, they write RavenDB support, which Andrej thinks is fantastic. “It’s maybe the best support experience I’ve had.” They also use the active RavenDB GitHub community discussion forum, RavenDB’s rich documentation, and code walkthroughs. RavenDB founder Oren Eini gives monthly webinars where he demonstrates how to make the most of RavenDB, and interactive workshops are available. RavenDB founder Oren Eini teaches how to get started in this basics webinar.

Upgrading to the latest versions can cause features to stop working. Once, they had to roll back to the older version, test their solution with the new version, and then deploy it. Andrej advises people to test their project with any new version before releasing it in production.

Woah, already finished? 🤯

If you found the article interesting, don’t miss a chance to try our database solution – totally for free!

Try now try now arrow icon