Sam Gentle.com

Turing's nightmare

Yesterday I wrote about the effect of transitioning from a two-party (you and your computer) to a three-party (you, your computer, and the agents running on your computer) trust model. I'd like to cover another angle, though, which is that the definition of data has also changed. I've said before that computing is special because it invented an abstract unit of operation. In maths you add numbers together, in economics you add dollars together, in computing you add actions together. We sometimes say "code is data", because the only difference between a computer program and a really long number is how you interpret it.

But there is one important difference between code and data: you need a certain amount of complexity in order for something to be a computer. That bar, called Turing completeness, is surprisingly low, and can be met by weaving machines and water tubes, among other strange examples. However low that bar is, though, it's not zero. An mp3 file is not a computer program, a jpeg is not a computer program. All code is data, but not all data is code.

This property is very important, because of the problem of undecidability. Basically, when you run a program, you don't necessarily know what it's going to do. I don't just mean you personally, rather that it's a fundamental mathematical reality; in general, you can't prove what a program is going to do without just running it and seeing what happens. Is this program going to stop eventually or keep running forever? Undecidable. Will this program calculate my tax return or will it secretly compile a dossier of all my activity and send it to the government? Nobody knows!

I should back down from that last statement a little, because it's important to clarify that this is only true in the general sense. I could hand you a program whose code consists entirely of printf("hello, world!"); and you'd have a pretty good idea of whether it prints "hello, world!" or deletes your hard drive. But the point is that you need to be able to know what every program does, not just the trivial ones. A computer program can be arbitrarily complicated, especially if it's designed to be, and expert human examiners are reliably fooled by even comparatively simple malicious programs.

Since we can't necessarily tell whether code will try to hurt us, we retreat to the next available defense: limiting the consequences. By analogy, you don't necessarily know if a person wants to hurt you, but if they're handcuffed to a chair it's kind of a moot point. A program can intend to spy on you, but if it's running in the computer equivalent of a jail cell, only allowed to access your microphone when it asks the warden nicely, any spying it does is going to be pretty ineffectual. Of course, even the best wardens can be fooled, and it's very hard to make an inescapable prison. Still, it's the best we've got.

But even today in most (non-mobile, non-web) computing environments, programs run relatively unrestricted. Any program on an average Mac or PC can delete all your files, send spam, or steal your passwords and credit card information. The fact that most of them don't is really because of a lack of desire, not a lack of ability. Antivirus programs can catch or mitigate some of these, but in general the problem is, literally, unsolvable; undecidability is law. So we fall back on the next available defense: trust. Only run code written by people you trust, downloaded from somewhere you trust.

Unfortunately, I lied before when I said "not all data is code". What I should have said is "not all data is code... yet". An mp3 is not a program, but what about a word document? A web page? A DVD? Unfortunately, the answer to all three of those is "yes". Documents, websites and DVDs all contain embedded programming environments designed to allow interactivity of various kinds. And there's no such thing as a little bit of computing; once you have Turing completeness, that's code, no different from something written in Python or C. Each of these code-in-data environments has its own trust model and its own jail, and these are regularly subverted.

But the trend is for more code-in-data, more interactivity and more embedded programming languages. Why have a video when you can have a choose-your-own-adventure video? Why have a table when you can have a rich interactive spreadsheet with formulas and live graphs? Why have a dumb document when you can have something that updates as you read it, shows comments in real time, or, uh, expands? People are coming up with new kinds of expression that are fundamentally active rather than passive, and almost all of them pass that agonisingly low bar for computation.

So what we're left with is Turing's nightmare: hundreds of different mini-computers running thousands of different mini-programs, and all the while we're stuck trying to decide the undecidable: is this interactive video spying on me? Is this game going to delete my data? Is this document secretly trying to convince people it's a nigerian prince? Sure, we can make up for our fundamental inability to tackle that problem by relying on a combination of trust and restrictions. But every one of those mini-computers starts from scratch, making its own mistakes and increasing the already elephantine vulnerability surface area of the average computer.

This to me seems like the obvious reason that web and mobile software have trounced desktop software. Unlike the programs of old, web and mobile apps are explicitly designed to be untrusted, built around one centrally-administered jail rather than lots of little ones with their own rules. They are really the pioneers of data-as-code, the idea that when someone sends you a funny image, or a thing to look at, they're not sending you some inert piece of paper, they're sending a tiny person into your house who might trash it on their way out.

I would argue that the inevitable end to all of this, and the solution to Turing's nightmare, is to just go all-in. Data-as-code is the future, whether we want it or not. Inert and trustworthy data is not only obsolete and boring, it's also demonstrably a lie. Every passive data format we've invented has eventually mutated into an active one, and in the process ruined everyone's Tuesday. Why not stop fighting it? Let's have executable mp3s, build all our movies out of code, rename not_a_virus.jpg.exe to just not_a_virus.exe and be done with the whole thing.

Once we can fully accept that modern computers are less like a sterile factory and more like a teeming petri dish of skin flora, it'll be easier to focus on building the centralised containment and immunity systems that we need. With a combination of good models for delegating trust to gatekeepers, and more prison architects behind fewer prisons, I think the next generation of data-as-code personal computing looks not just more secure, but more interesting as well.