Etaoin Shrdlu

I wrote about this pretty recently, but today I actually went and did it. To summarise the original idea, I wanted to take the conditional probabilities of adjacent letters that you type and match them with the conditional probabilities of adjacent words. This should make something like a Markov chain gibberish generator, but tuneable based on how random the letters you type are.

The biggest challenge for this was massaging the data. I needed not just the word and letter 2-grams, which I ended up grabbing from Peter Norvig's site, but specifically I needed 2-grams which also included terminal characters for the beginnings of words. Norvig's word data had this, but his letter data didn't, so I built my own from the 1-gram data by just summing the counts against all the first letters. Those sums probably don't compare exactly with the 1-gram data, but since I only needed relative counts I figured it was fine.

For my next trick, I wanted to trim down the word 2-grams substantially because, with only 26 letters, I wasn't going to be able to generate the 27th+ most likely word pairs anyway. Not only that, but I wasn't going to be able to generate any of the words that were only reachable from those words, and so on. Basically, I needed a simple exhaustive search that would go through and eliminate those unnecessary combinations. While I was at it, I added some filtering because the internet is mostly porn. The end result dropped the bigram size from 5MB to under 1MB, while also making it look less like bad erotic fan fiction.

There are a few things that I think could be improved. The main one is a better corpus. The internet is many things, but a repository for meaningful text is not one of them. The Google Books Ngram corpus is probably what I'd use if I had more time and wasn't in a first-world country with third-world internet. Another thing is that I could probably do something a bit more sophisticated with the rankings: the 26th most common word pair is still probably a lot less common than the 26th most common letter pair. What if I mapped the least common to the least common and stretched the distribution instead?

Regardless, I'm pretty happy with the result I got. If you want to play with it, a live version is on my demoserver and the code is on Github. I'll also write a bit more about generating the data (using Racket!) for my next prototype wrapup on Monday.