Merkle versions

I've become a big fan of semantic versioning since its introduction. The central idea is that versions should be well-defined and based on the public API of the project, rather than arbitrary feelings about whether a certain change is major or not. It also recognises the increasingly prominent role of automated systems (dependency management, build systems, CI/testing etc) in software, and that they rely much more than puny humans do on meaningful semantic distinctions between software versions.

But one thing that can be troublesome is being able to depend on the exact contents of a package. Although it's considered bad form and forbidden by some package managers, an author could change the contents of a package without changing its version. Worse still, it's possible that the source you fetch the package from may have been compromised in some way. What would be nice is to have some way of specifying the exact data you expect to go along with that version.

My proposed solution is to include a hash of that data in the version itself. So instead of 1.2.3 we can have 1.2.3+abcdef123456 That hash would need to be a Merkle tree of some kind, so as to recursively verify the entire directory tree. I couldn't find any particular standard for hashing directories, but I suggest git's trees as being one in fairly widespread use. You can find out the git tree hash of a given commit with git show -s --format=%T <commit>.

Two interesting things about this idea: firstly, the semver spec already allows a.b.c+hash as a valid version string, so no spec changes are required. Secondly, because the hash can be deterministically calculated from the data itself, you don't actually need package authors to use it for it to be useful! You could simply update your package installer or build system to check your specified Merkle version against the file contents directly, whether or not it appears in the package's actual version number.

It's funny, I never thought of versioning as something that would see much innovation, but I guess on reflection it's just another kind of names vs ids situation. I wonder if there will be a new place for human-centered numbering once it's been evicted from our cold, emotionless version strings.