The tyranny of I/O and the rise of distributed systems

ayende Blog

A system that runs on a single machine is an order of magnitude simpler than one that reside on multiple machines. The complexity involved in maintaining consistency across multiple machines is huge. I have been dealing with this for the past 15 years and I can confidently tell you that no sane person would go for multi machine setup in favor of a single machine if they can get away with it. So what was the root cause for the push toward multiple machines and distributed architecture across the board for the past 20 years? And why are we see a backlash against that today?

You'll hear people talking about the need for high availability and the desire to avoid a single point of failure. And that is true, to a degree. But there are other ways to handle that (primary / secondary model) rather than the full blown multi node setup.

Some users simply have too much data to go around and have to make use of a distributed architecture. If you are gathering a TB / day of data, no single system is going to help you, for example. However, most users aren't there. A growth rate of GB / day (fully retained) is quite respectable and will take over a decade to start becoming a problem on a single machine.

What I think people don't remember so well is that the landscape has changed quite significantly in the past 10 – 15 years. I'm not talking about Moore's law, I'm talking about something that is actually quite a bit more significant. The dramatic boost in speed that we saw in storage.

Here are some numbers from the start of the last decade a top of the line 32GB SCSI drive with 15K RPM could hit 161 IOPS. Looking at something more modern disk with 14 TB will have 938 IOPS. That is a speed increase of over 500%, which is amazing, but not enough to matter. These two numbers are from hard disks. But we have had a major disruption in storage at the start of the millennium. The advent of SSD drives.

It turns out that SSDs aren't nearly as new as one would expect them. They were just horribly expensive. Here are the specs for such a drive around 2003. The cost would be tens of thousands (USD) per drive. To be fair, this was meant to be used in rugged environment (think military tech, missiles and such), but there wasn't much else in the market. In 2003 the first new commodity SSD started to appear, with sizes that topped at 512MB.

All of this is to say, in the early 2000s, if you wanted to store non trivial amount of data, you had to face the fact that you had to deal with hard disks. And you could expect some pretty harsh limitations on the number of IOPS available. And that, in turn, meant that the deciding factor for scale out wasn't really the processing speed. Remember that the C10K problem was still a challenge, but reasonable one, in 1999. That is, handling 10K concurrent connections on a single server (to compare, millions of connections per server isn't out of the ordinary).

Given 10K connections per server, with each one of them needing a single IO per 5 seconds, what would happen? That means that we need to handle 2,000 IOPS. That is over ten times what you can get from a top of the line disk at that time. So even if you had a RAID0 with ten drives and was able to get perfect distribution of IO to drive, you would still be about 20% short. And I don't think you'll want to get 10 drives at RAID0 in production. Given the I/O limits, you could reasonably expect to serve 100 – 300 operations per second per server. And that is assuming that you were able to cache some portion of the data in memory and avoid disk hits. The only way to handle this kind of limitation was to scale out, to have more servers (and more disks) to handle the load.

The rise of commodity SSDs changed the picture significantly and NVMe drives are the icing on the cake. SSD can do tens of thousands of IOPS and NVMe can do hundreds of thousands IOPS (and some exceed the million IOPS with comfortable margin).

Going back to the C10K issue? A 49.99$ drive with 256GB has specs that exceed 90,000 IOPS. Those 2000 IOPS we had to get 10 machines for? That isn't really noticeable at all today. In fact, let's go higher. Let's say we have 50,000 concurrent connections each one issuing an operation once a second. This is a hundred times more work than the previous example. But the environment in which we are running is very different.

Given an operating budget of 150$, I will use the hardware from this article, which is basically a Raspberry PI with SSD drive (and fully 50$ of the budget go for the adapter to plug the SSD to the PI). That gives me the ability to handle 4,412 requests per second using Apache, which isn't the best server in the world. Given that the disk used in the article can handle more than 250,000 IOPS, we can run a lot on a "server" that fits into a lunch box and cost significantly less than the monthly cost of a similar machine on the cloud. This factor totally change the way you would architect your systems.

The much better hardware means that you can select a simpler architecture and avoid a lot of complexities along the way. Although… we need to talk about the cloud, where the old costs are still very much a factor.

Using AWS as the baseline, I can get a 250GB gp2 SSD drive for 25$ / month. That would give me 750 IOPS for 25$. That is nice, I guess, but it puts me at less than what I can get from a modern HDD today. There is the burst capability on the cloud, which can smooth out some spikes, but I'll ignore that for now. Let's say that I wanted to get higher speed, I can increase the disk size (and hence the IOPS) at linear rate. The max I can get from gp2 is 16,000 IOPS at a cost of 533$.  Moving to io1 SSD, we can get 500GB drive with 3,000 IOPS for 257$ per month, and exceeding 20,000 IOPS on a 1TB drive would cost 1,425$.

In contrast, 242$ / month will get us a r5ad.2xlarge machine with 8 cores, 64 GB and 300 GB NVMe drive. A 1,453$ will get us a r5ad.12xlarge with 48 cores, 384 GB and 1.8TB NVMe drive. You are better off upgrading the machine entirely and running on top of the local NVMe drive and handling the persistency yourself than paying the storage costs associated with having it out as a block storage.

This tyranny of I/O costs and performance has had a huge impact on the overall system architecture of many systems. Scale out was not, as usually discussed, a reaction to the limits of handling the number of users. It was a limit on how fast the I/O systems could handle concurrent load. With SSD and NVMe drives, we are in a totally different field and need to consider how that affect our systems.

In many cases, you'll want to have just enough data distribution to ensure high availability. Otherwise, it make more sense to get larger but fewer machines. The reduction in management overhead alone is usually worth it, but the key aspect is reducing the number of moving pieces in your architecture.

NoSQL Database Demo

Live Demo

A customized
presentation of RavenDB