George Mandis

Mandis, George

Better Letter Distributions for Word Games

December 26th, 2018 • ~1,000 words • 6 minute read

Part of the joy of setting up a new computer is reassessing the way I've organized all my projects. There's usually some pruning, reorganization and revisiting of older projects.

One of those projects I ended up revisiting was a simple word game I started several years ago but never did much with. It's basically a shitty version of Boggle and something I'd thrown together with the intention of showing the bootcamp students I was leading at the time.

It was fun to play with for a moment, but I quickly realized the way letters were being introduced into the game could probably be improved. This is where my holiday-time procrastination under the guise of productivity began.

How the game worked

The game involved a 5x5 grid of letters. Using these letters you have to spell a word—they don't even have to be connected. When the letters are selected and the word is submitted they disappear from the board. If no word can be spelled you can pass. At the end of every turn, whether you've spelled a word or chosen to pass, a new letter is dropped into an available space on the board. When the board is full and no words can be spelled the game is over.

The letters were introduced completely at random using JavaScript's Math.random() function. The result was not as terrible as you might expect, but was not consistently playable and more consistently a little frustrating. To borrow a turn of phrase that mixes my metaphors a bit, the occasional string of 8+ consonsants together made it feel like the cards were stacked against you.

This shortcoming made sense to me. There are only 6 vowels out of the 26 letters in the alphabet and every word in English (with a few goofy execptions) requires a vowel. To make the game more fun and playable I would need to introduce letters in a way that assured a reasonable probability of being able to spell an English word using what was in play.

My first attempt at fixing letter distributions

I looked up letter distributions in English and found this Wikipedia page on letter frequency.

"Perfect!" I thought, foolishly.

My approach from here was to look at the frequency with which each letter occurreed in English and use that to determine how often the letter should be introduced. Without thinking it through much further it made some kind of sense to me.

To my initial surprise this approach was actually much worse! Suddenly I was swimming in Es and As all over the the board, the only consonsants consistently in sight being an occasional T or N.

You can only spell the word NEAT so many times before it stops being neat.

Looking at the distribution table it makes complete sense why this would be the case. The letter A constitutes 8.167% of all letters that appear in most words in the English language, according to the table I'd used. The letter Z in contrast appears only .074% of the time, meaning we'd expect to see a ratio of 110:1 for As to Zs. The letter E at 12.702% created an even worse ratio of 171:1!

Identifying the problem and fixing it

The biggest problem was my approach to thinking about the problem. The problem isn't how often a given letter appears relative to other letters, which is what my previous approach was addressing. Rather the problem is how often letters appear in words.

As an example, the letter A constitutes 8.167% of all letters that appear in most words in the English language. But a much higher percentage of words contain at least one A. Similarly, though the letter Z constitutes a predictably low .074% of letters that appear in English words, the percentage of words that might contain a Z in English in much higher.

Using a text file I'd download years ago containing 173,529 valid Scrabble words I ran tests to see what percentage of words from this list contained various letters.

(For all the pseudo-code examples below assumed words = 173529)


A = 94264 / words = 0.5432175602
E = 121433 / words = 0.6997850503
I = 102392 / words = 0.5900569934
O = 79663 / words = 0.4590760046
U = 46733 / words = 0.2693094526
N = 84485 / words = 0.4868638671
T = 83631 / words = 0.4819424995
Q = 2541 / words = 0.0146430856
R = 91066 / words = 0.5247883639
S = 104351 / words = 0.6013461727
X = 4607 / words = 0.0265488766
Z = 7079 / words  = 0.0407943341
H = 33656 / words = 0.1939502907
Y = 24540 / words = 0.1414172847

Aha! Here we can see that the Letter A appears in ~55% of the words on our list and the letter E nearly 70%. The letter Z on the otherhand appears in about 4% of the words on that list. Getting back to the ratios I mentioned previously, that's a much more usable 14:1 and 18:1. That's still high and we'll fiddle with things shortly, but it's setting up a scenario where the rare letters might actually get seen.

So how do we use this information to better our word games? We use them to develop a ratio of letters relative to one another. To rephrase that in way that's easier to visualize: if we're filling a bag full of letter tiles like we might in a game of Scrabble, we can use these numbers to determine how many of each letter tile should go in the bag.

My approach was this: I multiplied each percentage by 10 and used a ceiling function to not only round to an integer but assure we wouldn't end up with zero tiles for the least frequent letters like Z, X and Q. The result looked something like this:


A = ceil(94264 / words * 10) = 6
E = ceil(121433 / words * 10) = 7
I = ceil(102392 / words * 10) = 6
O = ceil(79663 / words * 10) = 5
U = ceil(46733 / words * 10) = 3
N = ceil(84485 / words * 10) = 5
T = ceil(83631 / words * 10) = 5
Q = ceil(2541 / words * 10) = 1
R = ceil(91066 / words * 10) = 6
S = ceil(104351 / words * 10) = 7
X = ceil(4607 / words * 10) = 1
Z = ceil(7079 / words * 10) = 1
H = ceil(33656 / words * 10) = 2
Y = ceil(24540 / words * 10) = 2

Now we have a ratios of 6:1 and 7:1 for A and E relative to Z. This was starting to feel familiar and more in line with word games I'd played. I looked up the letter distributions among popular board games to see how these numbers compared which led me to this wonderful forum post over at boardgamegeek.com. I was pleased to see that my numbers didn't look too different than a lot of popular word games!

In practice, my word game it felt much more playable with this distribution. The occasional consontant and/or vowel streak was no different than the one you might run into while playing a good game of Scrabble, lending just the right amount of challenge blended with luck. I might further tweak the individual letter distributions, but this approach gave me a great baseline to start from.

In conclusion

This process reminded me of a section in a book about competitive Scrabble players called Word Freaks. I read it quite a while ago and can't recall the passage exactly, but there's a section where the author looks into the origins of the game. All I remember is that the Alfred Mosher Butts, the inventor of Scrabble, tried many iterations of the game, including letter distributions, before landing on the game as we know it today.

His approach to making these choices involved studying the front page of the New York Times to calculate how often individual letters were used, in conjunction with some judgement calls to help gameplay such as limiting the number of "S" tiles to avoid easy pluralizations. I should reread the book to see if there's more information about how he calculated letter usage.

All in all, revisiting my word game was a fun exercise and reminder in realizing that sometimes the challenge isn't solving the problem but rather correctly identifying the problem you're trying to solve!

--

Published on Wednesday, December 26th 2018. Read this post in Markdown or plain-text.

If you enjoyed this consider signing-up for my newsletter or hiring me