Kafka ETL

Release 5.4 is all about integrations. RavenDB can seamlessly integrate with your existing ecosystem, pushing events into Kafka topics.

What is Kafka?

Linkedin engineers developed Kafka to address their own challenges of building complex distributed systems. Built with scalability and speed, this distributed messaging platform became a cornerstone of infrastructure at Linkedin. Today, daily, over 1.4 trillion messages pass through the Kafka infrastructure at this social network company. Following the initial implementation success, they released it as an open source project in January 2011. Since then, thousands of companies and development teams have embraced this approach to building highly-scalable event-driven systems.

In essence, Kafka is a distributed publish/subscribe messaging platform with a twist - it combines queuing with message retention on disk. As part of your infrastructure, Kafka enables cooperation between various components of your distributed system. This universal pipeline provides the means for exchanging messages between Publishers (a.k.a Producers) and Subscribers (a.k.a. Consumers).

Kafka logo

Stream vs Queue

Multiple producers can publish messages and push them to the Kafka stream, while, on the other end, multiple consumers can subscribe to messages, read them from the stream and react to their content. Kafka itself is not concerned with the content of the messages themselves. Data of many different types can coexist and are divided into “topics” for each type of data. Producers will assign a topic to each message they generate and push into Kafka’s event stream, while Consumers only need to concern themselves with the topics they are interested in. Hence, each topic can support multiple Producers and multiple Consumers.

Unlike queues, messages are retained by the Kafka cluster in a well-defined manner for each topic. Reading a message from the stream will not remove it - it will remain in the stream, available for any other consumer to read and process. Consequently, every consumer maintains and updates its own “cursor,” which represents the position within a specific topic. This mechanism enables several subscribers to operate independently over the same topic, not affecting each other. Additionally, every consumer can go back and forth over messages in a topic and can re-process them if needed.

Kafka stream vs queue

Transactional guarantees across the whole system

Built around the concept of a commit log, Kafka is optimized for writing and offers transactional guarantees for accepting and storing messages that arrive. This provides reliability that no messages will be lost, even during extreme spikes in messages generated by producers.

With RavenDB 5.4, you can offload this task to your database instead of implementing application code pushing messages and events to Kafka. Kafka ETL support allows RavenDB to take the role of message producer in a Kafka architecture. This ongoing task will extract data from the database, transform it with your custom script, and Load the resulting JSON object to a Kafka destination. Not only will this shorten development time, but you will also be able to rely on this persistent mechanism that implements a retry process to resiliently push your messages to Kafka.

Kafka transactional guarantees

Why should you use it?

Imagine a complex e-commerce system where you must perform multiple actions after a customer completes the checkout. These actions may include stock level adjustments, accounting operations, packing, shipping, capacity planning, etc. Good software design practices will guide you through the process of planning and design that will result in the implementation of multiple services. Each of these services represents one coherent set of operations performed by a department or office in your organization. And as they communicate to perform business operations, you need to implement the same communication patterns in your enterprise software.

Hence, after checkout is complete, Checkout Service will publish a notification to all interested services that want to react to this event. This publishing process needs to be reliable and technology agnostic - you do not want to lose important system events like completion of Order, and you also want to provide each team in your organization with a free choice of technologies to implement appropriate service.

Kafka why should you use it

Benefits of using built-in Kafka ETL

You can write those events from your code, but then you have to deal with race conditions and consistency issues in distributed systems. It is possible that you’ll commit a transaction to RavenDB, but fail to write the events to Kafka. Or the processor will be quick enough to read the event from Kafka before the RavenDB transaction commits. RavenDB’s Kafka ETL support allows you to seamlessly implement the Outbox pattern, to cleanly resolve this issue.

For example, along with the Order being created, you will generate the OrderShouldBeProcessedCommand document, and both will be saved transactionally to your RavenDB database. After that, Kafka ETL will pick up Command, hand it over to Kafka and delete it from the RavenDB database. Hence, you created a new Order document, and RavenDb automatically pushed Command to Kafka infrastructure.

You don’t need complex processes or fragile integration setups to ensure that your modifications and the Kafka messages are in line.

Besides commands, you can also create and push events to Kafka. Whenever a document is modified in the database, RavenDB can publish modification events to Kafka, which the Audit service will pick up. Imagine the scenario of a new employee signing a contract with your company - you will create a NewEmployee event and publish it to Employees Kafka Topic. Multiple Services will listen to this topic and react. Unlike queues, where the reading messages will remove it, Kafka retains messages so multiple consumers can read them. Hence, Services representing Human Resources, Orientation, and Onboarding will note the new Employee's arrival and kick off internal processes.

Finally, you can offload additional logic to Kafka ETL inside of RavenDB, moving over even more automation to the database. Instead of generating events for all comments, you can apply logic that will react to a product rating below a certain level and then publish BadProductReviewNotification. This way, RavenDb can pre-process documents, isolate ones that require special attention, and deliver them to interested parties via the Kafka event stream.

The Kafka ETL feature allows you to add those sorts of integrations after the fact, without needing to modify your existing applications. This gives you a lot of freedom to add additional behavior as you discover that, without needing code changes or complete understanding of the entire system. You can just take the documents in the database and have them show up on Kafka, where they’ll be processed by additional consumers.

Kafka benefits

Conclusion

The architecture we just described is a great example of system architecture that consists of loosely coupled components. Each component can emit and receive messages but is unaware who is generating them, nor which service or services will react to the messages it produces. Consequently, this also means that expanding your system with new functionality is easy, and it also obeys the Open-Close principle; you will be able to add new functionality in the form of additional services subscribing to specific topics without touching existing services.

Overall, this feature will simplify the life of your developers by reducing integration efforts. At the same time, this will also reduce overall implementation costs. And all of this will not affect the architecture of your system negatively. On the contrary - you will be producing flexible applications with loosely coupled components that can be safely expanded with new functionality.

Conclusion
See all RavenDB 5.4 features