Friday, February 13, 2009

Designed for Deployment Reliability

I was reading an interesting article on Designed for Reliability on The Daily WTF [1]. And then I started reading the comments. I think a lot of people don't get it.

The real story here (at least to me) is that because the system is so reliable, deployment rules are never done and enforced assuming that it might not be. It's a classic problem, because good deployment rules are hard. If you can keep things up and running, you're better off treating the problem as just maintaining the system, but eventually that just plain old fails. It always will. [2]

Ultimately, any system that hasn't had constant redeployment baked in is going to face this exact problem: the machine exploded, OMFG, how do I get it up and running? For one of your larger distributed systems, could you do that particularly easily? More importantly, have you actually tried?

My contention is this: if you don't design your processes such that you always have a reliable deployment story, your super-expensive hardware/system is worthless. Things always, eventually fail. You have to be ready for it.

Annual Power Down

One practice that I think is useful in a variety of ways is the annual power-down. In some areas of London (and in this I don't know whether it's a requirement all the time, but it's happened to me at two different locations, one in the City and one in Westminster Council, so it appears to be widespread), you have to switch off all electrical power to your office block once a year so that it can be tested and inspected. And that's absolute: no UPS, no diesel generators, you really do have to switch the machines off.

Luckily for all, this is scheduled. Beforehand, everybody who maintains a system that's run in the affected data centre scrambles to make sure that the stuff they support is likely to get up and running again without them doing any work. Mostly this is because the power's gonna come back on early Sunday morning, and if they want to party the previous night, they better make damn sure they're not getting woken up at 9:00 am Sunday morning. Startup scripts are checked, internal consistency of the system is checked, checklists are updated, everybody does their part.

It still won't work of course, which is why there are so many people on call and ready to help nudge their particular part of the world back into operation (if nothing else, hardware that runs full-bore 364 days of the year often picks power-up as a time to fail, so make sure you've got a lot of spare disks and power supplies at hand), but the discipline forces you to, at least once a year, make sure that you can get up and running before Monday morning.

Thus if something catastrophic happens, your scripts are on average 6 months behind (and we all know that's worst case: everybody pays attention to the lessons learned for 3 months after the power down, and everybody starts to get ready 3 months before hand, so worst case is you get caught right in the middle well of apathy), but you should still be able to get up and running with a minimum of fuss.

In this case, the annual power down is part of your methodology to ensure that you have a reproducible deployment system.

Can The System Help You?

So let's say you have a system that is designed to make it extremely easy to make runtime changes that would affect your ability to bring the system up and running again in the case of catastrophic failure. What might you do to help fight against this tendency? I think the only solution here is one of two cases:
  • Run every change through an external system that keeps track of what is done and potentially allows replay. Note that for this to work, every change (no matter how minor) must go through The System; this essentially prevents developers or support staff from being able to log into the box at all, because everything has to be tracked (though you could maybe get around this these days by just having some type of logging SSH shell or something).
  • Have the system self-maintain. In this case, the system maintains in some stable form (backed up disk for example) the exact configuration that it currently has, as well as perhaps the last N changes, so that it can be brought back to a consistent state quickly in the case of catastrophe.
The former has the advantage that the system doesn't have to know what's going on, the latter has the advantage that the system can be written to handle just this situation and not have to log unnecessary trivia.

The latter is actually easier than you might think:
  • Your VM (in the case of Erlang/OTP) might write out the state of all directly launched processes (as opposed to software launched ones) so that it knows the location and state of all code used to kick off the system;
  • Your container (like OSGi) might silently copy all bundles to a WORM location when loading so that there's an accurate record of what was running at any given point of time;
  • Your OS (ZFS FTW) might snapshot the state of the system any time something interesting happens so that you can go back and recover gracefully.

The Cloud/Utility Computing Take

So given that I think most systems are ultimately moving to Utility Computing (in-house or out in teh Cloudz), does that help you or hurt you in this case? Help, majorly.

Here's a good test: If You're Ready To Run On EC2, You're Done. Whether or not you're going to deploy on EC2, surviving the semantics of EC2 means you're ready.

The rationale here is simple: the semantics of EC2 require that your system be able to deploy at any time. Sure, you can log into a machine and poke around with it, but you're designing for N-way scalability, remember? If you do that, and you don't fork a new AMI with your changes, you're screwed.

Now that means that you have to spend a lot of time working on your AMIs to make sure that for any given production iteration you've got a reproducibly deployable set that represents the state of the universe for that iteration. This is a problem, and I know that it's caused some people trying to integrate with EC2 no end of woe, because it adds a lot of troublesome, time-consuming grunt-work that they're not used to and can't be made to go faster easily. [3]

But by going through that effort, you know that your system is always ready to handle any given node getting cycled at any time. You know that you're consistently ready to deploy.

Think of it as the deployment equivalent of a test-infected culture:
  • Testing and Cloud Deployment both institute costs (development costs and configuration costs accordingly)
  • Testing lets you develop faster by making sure you're confident that your system is always working
  • Cloud Deployment lets you support faster by making sure you can always bring the system up consistently in a crisis
See the parallels here? Up-front costs (development and configuration) lead to long-term reduction in those same costs by making the process consistent and reproducible.

So even if you're not going to deploy on anything other than your local hardware in a non-virtualized environment, pretending that you are can still reap some benefits. [4]

Conclusion

Design your systems and processes for reproducible deployment. If you don't, I guarantee you that the time you discover that your system can't come up without 3 days of hacking will be less than 24 hours before a major regulatory driven deadline.

Footnotes and Tangents

[1] Yes, this is an old article, but I started writing this a long time ago and am only now finishing my writeup. Yes, I have a whole lot of stuff that I think of a take on and never get around to finishing.
[2] In the Erlang/OTP community, this is definitely seen as an advantage: OTP makes it really easy to replace existing code in a running VM without taking anything down, so you can conceivably have a particular VM that's running for years. Read this post on Vinoski's blog for some classic Erlang insight.
[3] Systems like ElasticServer (from CohesiveFT) can help a lot here, because you can do your round-trip testing on VMWare on your workstation, and then just dump out the resulting images in AMI format rather than VMWare format when you're done.
[4] This isn't something that I've actually run a proper evaluation on, aside from designing distributed systems designed to come up and go down hundreds of times at a go for continuous-integration-driven testing. I'd want to run a formal evaluation of the benefits of virtualization configuration as opposed to test-driven rapid configuration to make this more concrete than just a rant.
blog comments powered by Disqus