What Letter-Pair Tileset Forms the Most Words?

Saturday, March 23, 2024.

While building a word game, Daniel Feldman ran into a problem that nerdsniped me instantly: what choice of twenty letter-pair tiles generates the most words?

A number of approaches were proposed in the ensuing thread, with some folks even wondering if the problem might be NP-complete. In this post I'll present a greedy algorithm that's linear in the dictionary size and quadratic in the squared alphabet size. I believe this finds an optimal solution, but haven't proven that formally.

Let's first define the problem.

The Problem Definition

There are 26 alphabet letters, and each tile has two letters on it, so that works out to a total of 26 * 26 = 676 possible tiles. We only get to choose a meager 20 of these 676 to form our tileset. Like in Scrabble, you can then rearrange subsets of the tileset to form dictionary words. The problem: find the tileset of size 20 that lets you form the most dictionary words.

A Much Smaller Example Tileset

Here's all possible tiles, with a specific size-3 tileset ed,pi,ti highlighted:

The ed,pi,ti tileset generates these 6 words:

pi
pied
pitied
ti
tied
tipi

Note that throughout this post I'll be using american-english under /usr/share/dict/ as the dictionary.

Initial Approach: Maximum Letter Pair Frequency

Before we get into the final greedy approach, let's try something more straightforward.

When Alfred Butts designed the Scrabble tileset, he looked at the front page of the New York Times and hand-tabulated letter frequencies. He then added more copies of frequently occurring letters.

While our problem is different in a couple ways – unlike Scrabble, duplicate tiles aren't allowed, and also unlike Scrabble, we can only hope to generate a small fraction of all dictionary words — this approach feels intuitively promising.

We'll iterate through the dictionary, split each word into letter-pairs, and count pair occurrences. But note:

Words of odd length get thrown out, because they can't be formed by a sequence of pairs.
Words that use the same letter pair twice also get thrown out, since, per the problem definition, our tileset doesn't contain repeats.

Among the words that remain, we'll pick the most frequently occurring pairs as our tileset.

Here's the Python code:

from collections import defaultdict
import sys
import re

args = sys.argv[1:]
NTILES = int(args.pop(0))
RE_VALID_WORD = re.compile(r'^([a-z][a-z]){1,%s}$' % NTILES)
RE_REPEATED_PAIR = re.compile(r'''
  ^(..)*
  (?P<letter1>.)(?P<letter2>.)
  (?P=letter1)(?P=letter2)
''', re.X)

freqs = defaultdict(lambda: 0)
words = []

for word in open(args.pop(0)) if args else sys.stdin:
  word = word.rstrip('\n')

  # discard words with odd length, capitals, apostrophes.
  if not RE_VALID_WORD.match(word):
    continue

  # split the word into letter pairs.
  pairs = re.findall(r'.{2}', word)

  # discard words with repeated pairs.
  if RE_REPEATED_PAIR.match(''.join(sorted(pairs))):
    continue

  # add to valid word list. update letter pair statistics.
  words.append(word)
  for pair in pairs:
    freqs[pair] += 1

# our tileset is the top N most frequently occurring pairs.
metric = lambda pair: freqs[pair]
tileset = set(sorted(freqs.keys(), key=metric)[-NTILES:])

# report the tileset.
print(','.join(sorted(tileset)))

# report all formable dictionary words.
for word in words:
  if not set(re.findall(r'.{2}', word)).difference(tileset):
    print(word)

You'll notice the program takes the tileset size, NTILES, as a command line argument. We can use this to run a sanity check on much smaller tilesets. When we do, we immediately see a problem:

./lettergen2 3 /usr/share/dict/american-english
ed,er,es
es

It's no surprise that ed,er,es occur most frequently, since they're common English word endings. However, word endings by themselves don't play nice together. They rely on word beginnings and middles to form complete words. And we already know from the small example above that ed,pi,ti generates 6 words. Generating 1 word, es, is suboptimal.

This maximum letter pair frequency approach was a greedy algorithm, and its failure makes one despair of finding any optimal greedy (read: simple) algorithm. To construct the optimal tileset from scratch, perhaps you need to perform search on a graph of successively longer words, so you're passing from word beginnings to word middles to word endings? I pursued this approach myself before giving up. Ruminate long enough, and your thoughts may even turn to dark subjects like the set cover problem, which is NP-Complete.

That level of despair didn't sit right with me though. Our problem isn't the same as the set cover problem, which seeks a set-of-sets that union together to form a bigger set. Let's accept a set-theoretic framing for a moment to see why.

Thinking in Terms of Sets

Suppose we're working with 676 sets, one for each letter pair. Each set contains all the words a particular letter pair occurs in. I'll denote these word sets W_aa, W_ab, … W_zz.

The tileset we seek is a subset of these 676 word sets. But it isn't a set-of-sets we get to union together like in the set cover problem. Consider: just because our tileset includes W_ed doesn't mean we can form the word need – our tileset must also contain W_ne for that. The "and" logic here feels more like set intersection – W_ne ∩ W_ed – than set union.

But set intersection isn't the right operation either. For example, just because our tileset contains W_ne and W_ed, and both contain the word needle, doesn't mean we can actually form the word needle. We'd also need W_le for that.

So it looks like constructing the list of formable words isn't a basic set operation over the tileset.

However, hidden in the negative space here, there is a basic set operation at work. We've seen that if our tileset doesn't include W_ne and doesn't contain W_ed, we have no hope of forming need, needle, nerd, edit or any of the words in either word set. When we omit W_ne and W_ed from the tileset, we lose all the words in the union W_ne ∪ W_ed.

Another name for "the set of all word sets omitted from our tileset" is the tileset's complement. The unformable words are the union of those omitted word sets. So what we're really seeking is a tileset whose complement has minimal union size. Now we're dealing with basic set operations!

At this point it occurred to me to try a new greedy approach, but working backwards this time. Instead of constructing the tileset from scratch by greedily picking the next best tile to add, what if we started with the full size 676 tileset, and greedily picked the least-bad tile to remove, until only 20 tiles remained?

Final Approach: Subtractive Minimum Damage

Without further ado, here's Python code for this new approach:

from collections import defaultdict
import sys
import re

args = sys.argv[1:]
NTILES = int(args.pop(0))
RE_VALID_WORD = re.compile(r'^([a-z][a-z]){1,%s}$' % NTILES)
RE_REPEATED_PAIR = re.compile(r'''
  ^(..)*
  (?P<letter1>.)(?P<letter2>.)
  (?P=letter1)(?P=letter2)
''', re.X)

words = defaultdict(set)

for word in open(args.pop(0)) if args else sys.stdin:
  word = word.rstrip('\n')

  # discard words with odd length, capitals, apostrophes.
  if not RE_VALID_WORD.match(word):
    continue

  # split the word into letter pairs.
  pairs = re.findall(r'.{2}', word)

  # discard words with repeated pairs.
  if RE_REPEATED_PAIR.match(''.join(sorted(pairs))):
    continue

  # add the word to the wordsets for its letter pairs.
  for pair in pairs:
    words[pair].add(word)

# work backwards from 676 to NTILES.
# remove the least-damaging letter pair at each step.
damage = lambda pair: len(words[pair])
while len(words) > NTILES:
  least_damaging_pair = min(words.keys(), key=damage)
  lost_words = words.pop(least_damaging_pair)
  for wordset in words.values():
    wordset.difference_update(lost_words)

# report the tileset.
print(','.join(sorted(words.keys())))

# report all formable dictionary words.
for word in sorted(set().union(*words.values())):
  print(word)

As you can see, the setup is substantially the same. This time, instead of counting appearances of each letter pair, we're populating the W_aa … W_zz word sets for the 676 letter pairs. If a word uses a letter pair, it appears in that pair's word set.

This gives us a direct and O(1) way of measuring the damage incurred by removing a letter pair: it is simply the size of its word set.

After the dictionary has been scanned, all letter pairs are in play, and we can start greedy removal. At each step we pick the letter that inflicts the least amount of damage in terms of formable words. Note that any word longer than 2 letters will appear in multiple word sets, and we have to remember to remove these extra copies of every "lost" word. Python's set data type is doing a fair bit of work here.

Results

You can use acg/lettergen to compare the results of the two approaches, which I'm calling maxfreq and subtractive. Simply type make and wait a few seconds:

make
running maxfreq/results1 ...
running maxfreq/results2 ...
running subtractive/results1 ...
running subtractive/results2 ...

maxfreq/results1: 12392 words
a,b,c,d,e,f,g,h,i,k,l,m,n,o,p,r,s,t,u,y

maxfreq/results2: 172 words
al,at,co,de,ed,en,er,es,in,le,li,ly,ng,on,re,ri,rs,st,te,ti

subtractive/results1: 12392 words
a,b,c,d,e,f,g,h,i,k,l,m,n,o,p,r,s,t,u,y

subtractive/results2: 292 words
ar,ca,co,de,di,ed,er,es,in,li,ng,nt,ra,re,ri,si,st,te,ti,ve

For completeness, I wrote Python scripts that handle the 1-letter tile case (lettergen1). There are only 26 tiles to pick the 20 from, and you can see that both approaches arrive at the same result of 12,302 formable words.

The 2-letter tile case (lettergen2) is another story. Maximum Letter Pair Frequency comes up with a tileset that generates 172 words, but Subtractive Minimum Damage does substantially better by finding a 292-word-generating tileset – a 1.7x improvement. You'll find the full lists of formable words at */results2.

So according to Subtractive Minimum Damage, the optimal tileset is the following:

ar,ca,co,de,di,ed,er,es,in,li,ng,nt,ra,re,ri,si,st,te,ti,ve

It's also interesting to experiment with different tileset sizes. For instance, try make -B NTILES=100. You'll notice as NTILES gets larger, the maxfreq approach converges on the subtractive approach. This makes sense: they should agree for NTILES=676 because there are no letter pair decisions to make. And in fact they should agree even earlier than that, since English doesn't use all possible letter pairs.

Open Questions & Further Research

Yes, but is Subtractive Minimum Damage optimal? The answer is I don't know! I vaguely remember proving greedy optimality once in undergrad computer science, but that was two decades ago. Pointers welcome.

What if there's a tie for least damaging letter pair? If there's no path dependence here, you should be able to pick either one at random, and the greedy subtractive approach should still arrive at an optimal solution. To explore this idea, I decided to pick from the top 2 least damaging tiles at random. Then I ran the script thousands of times. To my surprise, it did find a couple tilesets that produced 293 and 294 words – slightly better than the thought-to-be-optimal tileset! A revolting development. But this gap (1-2 tiles) is suspiciously small, and I'm just gonna go to press with what I've got.

What happens if the tileset can have repeats? I haven't thought about this too deeply, but it seems like it would spell trouble for a greedy approach, which can no longer make stepwise progress towards optimal subproblems.

What about words of odd length? Yeah it's awkward we have to exclude those, and it makes even less sense when you look at what motivated this problem (Daniel's Ambigame). One approach would be to pad all odd-lettered dictionary words with a trailing period, and then add a., b., c., and so on to the possible letter tiles. A similar trick with leading padding might let you split words after the 1st, 3rd, 5th etc character instead of always splitting at even indexes.

Is this a known problem? I mean surely Knuth solved this like 50 years ago? I found many related problems, but not this specific one. I worry this means it's considered too easy / too obvious, and I should feel embarrassed for writing a whole blog post about it. Anyway, please reach out if you know.

Is Scrabble's tileset optimal? I dabble in Scrabble myself, and I'd always heard it wasn't. In researching this, I learned that Peter Norvig has calculated more accurate English letter frequencies than Alfred Butt's. Norvig has a couple proposals for a better Scrabble tileset at the link. TL;DR no.

The Formable Word List

I buried the lede to avoid a wall of text. Here's the complete list of 292 formable words found by Subtractive Minimum Damage:

arcade, ardent, ares, arranged, arranger, arranges, arrant, arrest, arrested, arrive, arteries, artier, artist, artistes, calico, calicoes, cant, canted, canter, cantered, care, career, careered, caries, caring, casing, cast, caster, castes, castling, castrate, castrating, catering, cave, code, coding, coed, coin, coined, congestive, contesting, contraries, contrast, contrasted, contrite, contrive, core, coring, cosier, cosies, cost, costar, costarring, costed, costlier, cote, coteries, cove, covering, coveting, dear, dearer, decant, decanted, decanter, decoding, decorate, decorating, decorative, dedicate, dedicating, deed, deer, deli, delicate, deliveries, delivering, dent, dented, dentin, deranged, deranges, deriding, derisive, derive, desire, desiring, desist, desisted, destined, destines, detentes, detest, detested, died, dies, ding, dinged, dint, dire, direst, disinter, disinterring, dive, divest, divested, eddies, errant, erring, es, in, indeed, indelicate, indent, indented, indicate, indicating, indicative, inside, insist, insisted, intent, interest, interested, invent, invented, invest, invested, liar, lied, lies, linger, lingered, lint, lira, lire, list, listed, lite, literati, live, liveried, liveries, livest, rain, rained, rang, ranged, ranger, ranges, rant, ranted, ranter, rare, rarest, raring, rarities, raster, rate, rating, rave, raveling, re, rear, reared, rearranged, rearranges, recant, recanted, recast, recoveries, recovering, redecorate, redecorating, rededicate, rededicating, reed, rein, reindeer, reined, reinvent, reinvented, reinvest, reinvested, relied, relies, relive, rent, rented, renter, reside, resident, residing, resist, resisted, resister, rest, restarting, rested, restrain, restrained, retiring, reveling, revenged, revenges, reveries, revering, ride, riding, riling, ring, ringed, ringer, ringside, rising, rite, riveting, side, siding, sierra, silica, silicate, sing, singed, singer, singes, sire, siring, sister, site, siting, star, stared, stares, starling, starrier, starring, starting, starve, sterling, stinting, strain, strained, strainer, stranger, stride, strident, striding, string, stringed, stringer, strive, tear, teared, teed, tees, tent, tented, test, tested, tester, testes, ti, tide, tidied, tidier, tidies, tiding, tied, tier, ties, tiling, ting, tinged, tinges, tint, tinted, tirade, tire, tiredest, tiring, veer, veered, vein, veined, vent, vented, verier, verities, vest, vested, vestries

Posted by Alan on Saturday, March 23, 2024.