Sam Gentle.com

Defence in depth

A few things I've written recently, including what the hell, against identity, motte and bailey goals, do and do not and going meta, have been about failure and how systems break down. I think there's an interesting unifying idea there that's worth going into separately.

I remember reading about NASA's software engineering in Feynman's notes on the Challenger disaster. Unlike the other departments involved, the software team had an incredible resilience to failure. In addition to fairly stringent engineering standards, they would do a series of full external QA tests and dry runs of all their systems, simulating an actual launch. Plenty of teams do dry runs and QA testing, of course, but the difference is that QA failures were considered almost as serious as real failures. That is, if your software failed these tests, it didn't kill anyone, but it probably could have.

At the heart of this is a paradox that speaks to that general problem of failure: you want to catch failures early, before they cause real problems, yet you still want to treat those failures as real despite their lack of consequences. Let's say you have to finish writing an article by a week from now, but to add a bit of failure-resistance you give yourself a deadline two days earlier. That way, if anything goes wrong, you'll still have time to fix it by the real deadline. Sure, you could say you're going to treat your self-imposed deadline as seriously as the actual deadline, but that's kind of definitionally untrue; the whole point of your deadline is that it's not as serious as the real one!

The general principle here is defence in depth: design your system so that individual failures don't take down the whole thing. An individual event would need to hit a miraculous number of failure points at once to cause a complete failure. But that assumes each event is discrete and disconnected from the others, like someone trying to guess all the digits of a combination lock at once. In reality, if smaller failures are ignored or tolerated, you really have one long continuous event, like a combination lock where you can guess one digit at a time. The difficulty per digit becomes linear when you really wanted it to be factorial.

In order to get that, you have to make sure that those individual failures can't be allowed to persist. But that is much easier said than done. Both Feynman's notes on Challenger and the NASA satellite post-mortem I referenced in going meta revealed this very problem in their organisation: the culture let individual errors accumulate until their defence in depth was broken. But in neither case was the general problem of slow compromise of defence in depth really addressed.

I see the main issue as proportionality. If you tell someone "the failure of this one individual widget should be treated as seriously as the failure of the entire space shuttle", all that's going to do is destroy the credibility of your safety system. Similarly, setting up your goals in such a way that one minor failure can sink the whole thing is just silly. I think what the NASA software team got right wasn't just that they took their QA testing so seriously, but that they also didn't take it too seriously. Failing QA might mean a serious re-evaluation, but failing the real thing probably means you lose your job.

A significant secondary issue is that the consequences increase enormously as you go meta. A single screw not being in the right place is a very minor failure, and deserves a very minor corrective action. However, failing to correct the missing screw is a much more major failure. It may appear to be just another aspect of the minor failure in the screw system, and thus fairly unimportant. However, it's really a failure in the defence in depth system, which is the kind of thing that can actually take down a space shuttle. Perhaps that counterintuitive leap from insignificant failure to catastrophic meta-failure is at the heart of a lot of defence in depth failures.

In the absence of any guidance from NASA, I'd suggest the following: set up your system in layers to exploit defence in depth. Make sure the consequence of a failure at each layer is serious, but not too serious to be credible. And make sure that failures in the layering itself are considered extremely serious regardless of their origin, as they have the most potential to take down the system as a whole.