re: How Discord supercharges network disks for extreme low latency
I read this blog post from Discord, which presents a really interesting approach to a problem that they ran into. Basically, in the cloud you have the choice between fast and ephemeral disks and persistent (and much slower) disks.
Technically speaking you can try to get really fast persistent disks, but those are freaking expensive. Ultra disks on Azure and io2 disks on AWS and Extreme persistent disks for GCP. The cost for a single 1TB disk with 25K IOPS would be around 1,800 USD / month, for example. That is just for the disk!
To give some context, an r6gd.16xlarge instance with 64 cores and 512 GB (!) of RAM as well as 3.8 TB of NVMe drive will cost you less than that. That machine can also do 80K IOPS, for context.
At Discord scale, they ran into the limitations of I/O in the cloud with their database and had to come up with a solution for that.
My approach would be to… not do anything, to be honest. They are using a database (ScyllaDB) that is meant to run on commodity hardware. That means that losing a node is an expected scenario. The rest of the nodes in the cluster will be able to pitch in and recover the data when the node comes back or is replaced. That looks like the perfect scenario for running with fast ephemeral disks, no?
There are two problems with ephemeral disks. First, they are ephemeral , wich means that a hardware problem can make the whole disk go poof. Not an issue, that is why you are using a replicated database, no? We’ll get to that.
Another issue that is raised by Discord is that they rely on disk snapshots for backups and other important workflows. The nice thing about snapshotting a cloud disk is that it is typically fast and cheap to do. Otherwise you need to deal with higher level backup systems at the database layer. The issue is that you probably do want that, because restoring a system from a set of disk snapshots that were taken at different times and at various states is likely to be… challenging.
Discord’s solution to this is to create a set of ephemeral disks that are mirrored into persistent disks. Reads are going to the fast disks and writes are spread around. A failure on the ephemeral disks will lead to a recovery of the data from the network disk.
Their post has the gory details and a lot more explanation aside. I wanted to write this post for a simple reason. As a database guy, this system scares me to pieces.
The issue is that I don’t know what kind of data consistency we can expect in such an environment. Do I get any guarantees about the equivalence of data between the fast disks and the network disk? Can they diverge from one another? What happens when you have partial errors?
Consider a transaction that modifies three different pages that end up going to three different fast disks + the network disk. If one of the fast disks has an error during write, what is written to the persistent disk?
Can I get an out-of-date read from this system if we read from the network disk for some reason? That may mean that two pages that were written in one transaction are coming back as different versions. That will likely violate assumptions and invariants and can lead to all sorts of… interesting problems.
Given that this system is meant to handle the failure modes, it sounds really scary because it is an additional layer of complexity (and one that the database is unaware of) to deal with.
Aside from the snapshots, I assume that the reason for this behavior is to avoid the cost of rebuilding a node when there is a failure. I don’t have enough information to say what is the failure rate and the impact on the overall system, but the solution provided is elegant, beautiful, and quite frankly, pretty scary to me.
There have been quite a few unknowns that we had to deal with in the realm of storage. But this adds a whole new layer of things that can break.