Building emergency pathways in your software (never to be used)
Preconditions, postconditions, and invariants, oh my!
The old adage about Garbage In, Garbage Out is a really important aspect of our profession. If you try to do things that don’t make sense, the output will be nonsensical.
On two occasions I have been asked, ‘Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?’ I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.
~Charles Babbage – Inventor of the first computer
As you can see, the issue isn’t a new one. And there are many ways to deal with that. You should check your inputs, assume they are hostile, double check on every layer, etc.
Those are the principles of sound programming design, after all.
This post is about a different topic. When everything is running smoothly, you want to reject invalid operations and dangerous actions. The problem is when everything is hosed.
The concept of emergency operations is something that should be a core part of the design, because emergencies happen, and you don’t want to try to carve new paths in emergencies.
Let’s consider a scenario such as when the root certificate has expired, which means that there is no authentication. You cannot authenticate to the servers, because the auth certificate you use has also expired. You need to have physical access, but the data center won’t let you in, since you cannot authenticate.
Surely that is fiction, right? Happened last year to Facebook (bad IP configuration, not certs, but same behavior).
An important aspect of good design is to consider what you’ll do in the really bad scenarios. How do you recover from such a scenario?
For complex systems, it’s very easy to get to the point where you have cross dependencies. For example, your auth service relies on the database cluster, which uses the auth service for authentication. If both services are down at the same time, you cannot bring them up.
Part of the design of good software is building the emergency paths. When the system breaks, do you have a well-defined operation that you can take to recover?
A great example of that is fire doors in buildings. They are usually alarmed and open to the outside world only, preventing their regular use. But in an emergency, they allow the quick evacuation of a building safely, instead of creating a chokepoint.
We recently got into a discussion internally about a particular feature in RavenDB (modifying the database topology). There are various operations that you shouldn’t be able to make, because they are dangerous. They are also the sort of things that allow you to recover from disaster. We ended up creating two endpoints for this feature. One that included checks and verification. The second one is an admin-only endpoint that is explicitly meant for the “I know what I mean” scenario.
RavenDB actually has quite a bit of those scenarios. For example, you can authenticate to RavenDB using a certificate, or if you have a root access on the machine, you can use the OS authentication mechanism instead. We had scenarios where users lost their certificates and were able to use the alternative mechanism instead to recover.
Making sure to design those emergency pathways ahead of time means that you get to do that with a calm mind and consider more options. It also means that you get to verify that your emergency mechanism doesn’t hinder normal operations. For example, the alarmed fire door. Or in the case of RavenDB, relying on the operating system permissions as a backup if you are already running as a root user on the machine.
Having those procedures ahead of time, documented and verified, ends up being really important at crisis time. You don’t need to stumble in the dark or come up with new ways to do things on the fly. This is especially important since you cannot assume that the usual invariants are in place.
Note that this is something that is very easy to miss, after all, you spend a lot of time designing and building those features, never to use them (hopefully). The answer to that is that you also install sprinklers and fire alarms with the express hope & intent to never use them in practice.
The amusing part of this is that we call this: Making sure this areup to code.
You need to ensure that your product and code are up to code.