Foldable Words | bit-player

Packing up the household for a recent move, I was delving into shoeboxes, photo albums, and file folders that had not been opened in decades. One of my discoveries, found in an envelope at the back of a file drawer, was the paper sleeve from a drinking straw, imprinted with a saccharine message:

Drinking-straw wrapper inscribed “It’s A Pleasure To Serve You”

This flimsy slip of paper seems like an odd scrap to preserve for the ages, but when I pulled it out of the envelope, I knew instantly where it came from and why I had saved it.

The year was 1967. I was 17 then; I’m 71 now. Transposing those two digits takes just a flick of the fingertips. I can blithely skip back and forth from one prime number to the other. But the span of lived time between 1967 and 2021 is a chasm I cannot so easily leap across. At 17 I was in a great hurry to grow up, but I couldn’t see as far as 71; I didn’t even try. Going the other way—revisiting the mental and emotional life of an adolescent boy—is also a journey deep into alien territory. But the straw wrapper helps—it’s a Proustian aide memoire.

In the spring of 1967 I had a girlfriend, Lynn. After school we would meet at the Maple Diner, where the booths had red leatherette upholstery and formica tabletops with a boomerang motif. We’d order two Cokes and a plate of french fries to share. The waitress liked us; she’d make sure we had a full bottle of ketchup. I mention the ketchup because it was a token of our progress toward intimacy. On our first dates Lynn had put only a dainty dab on her fries, but by April we were comfortable enough to reveal our true appetites.

One afternoon I noticed she was fiddling intently with the wrapper from her straw, folding and refolding. I had no idea what she was up to. A teeny paper airplane she would sail over my head? When she finished, she pushed her creation across the table:

“It’s a Pleasure to Serve You” folded to read “I love You”

What a wallop there was in that little wad of paper. At that point in our romance, the words had not yet been spoken aloud.

How did I respond to Lynn’s folded declaration? I can’t remember; the words are lost. But evidently I got through that awkward moment without doing any permanent damage. A year later Lynn and I were married.

Today, at 71, with the preserved artifact in front of me, my chief regret is that I failed to take up the challenge implicit in the word game Lynn had invented. Why didn’t I craft a reply by folding my own straw wrapper? There are quite a few messages I could have extracted by strategic deletions from “It’s a pleasure to serve you.”

          itsapleasuretoserveyou   ==>   I love you.

          itsapleasuretoserveyou   ==>   I please you.

          itsapleasuretoserveyou   ==>   I tease you.

          itsapleasuretoserveyou   ==>   I pleasure you.

          itsapleasuretoserveyou   ==>   I pester you.

          itsapleasuretoserveyou   ==>   I peeve you.

          itsapleasuretoserveyou   ==>   I salute you.

          itsapleasuretoserveyou   ==>   I leave you.

Not all of those statements would have been suited to the occasion of our rendezvous at the Maple Diner, but over the course of our years together—17 years, as it turned out—there came a moment for each of them.

How many words can we form by making folds in the straw-paper slogan? I could not have answered that question in 1967. I couldn’t have even asked it. But times change. Enumerating all the foldable messages now strikes me as an obvious thing to do when presented with the straw wrapper. Furthermore, I have the computational means to do it—although the project was not quite as easy as I expected.

A first step is to be explicit about the rules of the game. We are given a source text, in this case “It’s a pleasure to serve you.” Let us ignore the spaces between words as well as all punctuation and capitalization; in this way we arrive at the normalized text “itsapleasuretoserveyou”. A word is foldable if all of its letters appear in the normalized text in the correct order (though not necessarily consecutively). The folding operation amounts to an editing process in which our only permitted act is deletion of letters; we are not allowed to insert, substitute, or permute. If two or more foldable words are to be combined to make a phrase or sentence, they must follow one another in the correct order without overlaps.

So much for foldability. Next comes the fraught question: What is a word? Linguists and lexicographers offer many subtly divergent opinions on this point, but for present purposes a very simple definition will suffice: A finite sequence of characters drawn from the 26-letter English alphabet is a word if it can legally be played in a game of Scrabble. I have been working with a word list from the 2015 edition of Collins Scrabble Words, which has about 270,000 entries. (There are a number of alternative lists, which I discuss in an appendix at the end of this article.)

Scrabble words range in length from 2 to 15 letters. The upper limit—determined by the size of the game board—is not much of a concern. You’re unlikely to meet a straw-paper text that folds to yield words longer than sesquipedalian. The absence of 1-letter words is more troubling, but the remedy is easy: I simply added the words a, I, and O to my copy of the Scrabble list.

My first computational experiments with foldable words searched for examples at random. Writing a program for random sampling is often easier than taking an exact census of a population, and the sample offers a quick glimpse of typical results. The following Python procedure generates random foldable sequences of letters drawn from a given source text, then returns those sequences that are found in the Scrabble word list. (The parameter k is the length of the words to be generated, and reps specifies the number of random trials.)

def randomFoldableWords(text, lexicon, k, reps):
    normtext = normalize(text)
    n = len(normtext)
    findings = []
    for i in range(reps):
        indices = random.sample(range(n), k)
        indices.sort()
        letters = ""
        for idx in indices:
            letters += normtext[idx]
        if letters in lexicon:
            findings.append(letters)
    return findings

Here are the six-letter foldable words found by invoking the program as follows: randomFoldableWords(scrabblewords, 6, 10000).

please, plater, searer, saeter, parter, sleety, sleeve, parser, purvey, laster, islets, taster, tester, slarts, paseos, tapers, saeter, eatery, salute, tsetse, setose, salues, sparer

Note that the word saeter (you could look it up—I had to) appears twice in this list. The frequency of such repetitions can yield an estimate of the total population size. A variant of the mark-and-recapture method, well-known in wildlife ecology, led me to an estimate of 92 six-letter foldable Scrabble words in the straw-wrapper slogan. The actual number turns out to be 106.

Samples and estimates are helpful, but they leave me wondering, What am I missing? What strange and beautiful word has failed to turn up in any of the samples, like the big fish that never takes the bait? I had to have an exhaustive list.

In many word games, the tool of choice for computer-aided playing (or cheating) is the regular expression, or regex. A regex is a pattern defining a set of strings, or character sequences; from a collection of strings, a regex search will pick out those that match the pattern. For example, the regular expression ^.*love.*$ selects from the Scrabble word list all words that have the letter sequence love somewhere within them. There are 137 such words, including some that I would not have thought of, such as rollover and slovenly. The regex ^.*l.*o.*v.*e.*$ finds all words in which l, o, v, and e appear in sequence, whether of not they are adjacent. The set has 267 members, including such secret-lover gems as bloviate, electropositive, and leftovers.

A solution to the foldable words problem could surely be crafted with regular expressions, but I am not a regex wizard. In search of a more muggles-friendly strategy, my first thought was to extend the idea behind the random-sampling procedure. Instead of selecting foldable sequences at random, I’d generate all of them, and check each one against the word list.

The procedure below generates all three-letter strings that can be folded from the given text, and returns the subset of those strings that appear in the Scrabble word list:

def foldableStrings3(lexicon, text):
    normtext = normalize(text)
    n = len(normtext)
    words = []
    for i in range(0, n-2):
        for j in range(i+1, n-1):
            for k in range(j+1, n):
                s = normtext[i] + normtext[j] + normtext[k]
                if s in lexicon:
                    words.append(s)
    return(words)

At the heart of the procedure are three nested loops that methodically step through all the foldable combinations: For any initial letter text[i] we can choose any following letter text[j] with j > i; likewise text[j] can be followed by any text[k] with k > j. This scheme works perfectly well, finding 348 instances of three-letter words. I speak of “instances” because some words appear in the list more than once; for example, pee can be formed in three ways. If we count only unique words, there are 137.

Following this model, we could write a separate routine for each word length from 1 to 15 letters, but that looks like a dreary and repetitious task. Nobody wants to write a procedure with loops nested 15 deep. An alternative is to write a meta-procedure, which would generate the appropriate procedure for each word length. I made a start on that exercise in advanced loopology, but before I got very far I realized there’s an easier way. I was wondering: In a text of n letters, how many foldable substrings exist—whether or not they are recognizable words? There are several ways of answering this question, but to me the most illuminating argument comes from an inclusion/exclusion principle. Consider the first letter of the text, which in our case is the letter I. In the set of all foldable strings, half include this letter and half exclude it. The same is true of the second letter, and the third, and so on. Thus each letter added to the text doubles the number of foldable strings, which means the total number of strings is simply $2^n$. (Included in this count is the empty string, made up of no letters.)

This observation suggests a simple algorithm for generating all the foldable strings in any n-letter text. Just count from $0$ to $2^{n} - 1$, and for each value along the way line up the binary representation of the number with the letters of the text. Then select those letters that correspond to a 1 bit, like so:

                    itsapleasuretoserveyou
                    0000100000110011111000

And so we see that the word preserve corresponds to the binary representation of the number 134392.

Counting is something that computers are good at, so a word-search procedure based on this principle is straightforward:

def foldablesByCounting(lexicon, text):
    normtext = normalize(text)
    n = len(normtext)
    words = []
    for i in range(2**n - 1):
        charSeq = ''
        positions = positionsOf1Bits(i, n)
        for p in positions:
            charSeq += normtext[p]
        if charSeq in lexicon:
            words.append(charSeq)
    return(words)

The outer loop (variable i) counts from $0$ to $2^{n} - 1$; for each of these numbers the inner loop (variable p) picks out the letters corresponding to 1 bits. The program produces the output expected. Unfortunately, it does so very slowly. For every character added to the text, running time roughly doubles. I haven’t the patience to plod through the $2^{22}$ patterns in “itsapleasuretoserveyou”; estimates based on shorter phrases suggest the running time would be more than three hours.

In the middle of the night I realized my approach to this problem was totally backwards. Instead of blindly generating all possible character strings and filtering out the few genuine words, I could march through the list of Scrabble words and test each of them to see if it’s foldable. At worst I would have to try some 270,000 words. I could speed things up even more by making a preliminary pass through the Scrabble list, discarding all words that include characters not present in the normalized text. For the text “It’s a pleasure to serve you,” the character set has just 12 members: aeiloprstuvy. Allowing only words formed from these letters slashes the Scrabble list down to a length of 12,816.

To make this algorithm work, we need a procedure to report whether or not a word can be formed by folding the given text. The simplest approach is to slide the candidate word along the text, looking for a match for each character in turn:

                    taste
                    itsapleasuretoserveyou

                     taste
                    itsapleasuretoserveyou

                     t aste
                    itsapleasuretoserveyou

                     t a    ste
                    itsapleasuretoserveyou

                     t a    s   te
                    itsapleasuretoserveyou

                     t a    s   t  e
                    itsapleasuretoserveyou

If every letter of the word finds a mate in the text, the word is foldable, as in the case of taste, shown above. But an attempt to match tastes would fall off the end of the text looking for a second s, which does not exist.

The following code implements this idea:

def wordIsFoldable(word, text):
    normtext = normalize(text)
    t = 0                      # pointer to positions in normtext
    w = 0                      # pointer to positions in word
    while t < len(normtext):
        if word[w] == normtext[t]:  # matching chars in word and text
            w += 1                  # move to next char in word
        if w == len(word):          # matched all chars in word
            return(True)            # so: thumbs up
        t += 1                 # move to next char in text
    return(False)              # fell off the end: thumbs down

All we need to do now is embed this procedure in a loop that steps through all the candidate Scrabble words, collecting those for which wordIsFoldable returns True.

There’s still some waste motion here, since we are searching letter-by-letter through the same text, and repeating the same searches thousands of times. The source code (available on GitHub as a Jupyter notebook) explains some further speedups. But even the simple version shown here runs in less than two tenths of a second, so there’s not much point in optimizing.

I can now report that there are 778 unique foldable Scrabble words in “It’s a pleasure to serve you” (including the three one-letter words I added to the list). Words that can be formed in multiple ways bring the total count to 899.

And so we come to the tah-dah! moment—the unveiling of the complete list. I have organized the words into groups based on each word’s starting position within the text. (By Python convention, the positions are numbered from 0 through $n-1$.) Within each group, the words are sorted according to the position of their last character; that position is given in the subscript following the word. For example, tapestry is in Group 1 because it begins at position 1 in the text (the t in It’s), and it carries the subscript 19 because it ends at position 19 (the y in you).

This arrangement of the words is meant to aid in contructing multiword phrases. If a word ends at position $m$, the next word in the phrase must come from a group numbered $m+1$ or greater.

Group 0: i₀ it₁ is₂ its₂ ita₃ isle₆ ilea₇ isles₈ itas₈ ire₁₁ issue₁₁ iure₁₁ islet₁₂ io₁₃ iso₁₃ ileus₁₄ ios₁₄ ires₁₄ islets₁₄ isos₁₄ issues₁₄ issuer₁₆ ivy₁₉

Group 1: ta₃ tap₄ tae₆ tale₆ tape₆ te₆ tala₇ talea₇ tapa₇ tea₇ taes₈ talas₈ tales₈ tapas₈ tapes₈ taps₈ tas₈ teas₈ tes₈ tapu₉ tau₉ talar₁₀ taler₁₀ taper₁₀ tar₁₀ tear₁₀ tsar₁₀ taleae₁₁ tare₁₁ tease₁₁ tee₁₁ tapet₁₂ tart₁₂ tat₁₂ taut₁₂ teat₁₂ test₁₂ tet₁₂ tret₁₂ tut₁₂ tao₁₃ taro₁₃ to₁₃ talars₁₄ talers₁₄ talus₁₄ taos₁₄ tapers₁₄ tapets₁₄ tapus₁₄ tares₁₄ taros₁₄ tars₁₄ tarts₁₄ tass₁₄ tats₁₄ taus₁₄ tauts₁₄ tears₁₄ teases₁₄ teats₁₄ tees₁₄ teres₁₄ terts₁₄ tests₁₄ tets₁₄ tres₁₄ trets₁₄ tsars₁₄ tuts₁₄ tasse₁₅ taste₁₅ tate₁₅ terete₁₅ terse₁₅ teste₁₅ tete₁₅ toe₁₅ tose₁₅ tree₁₅ tsetse₁₅ taperer₁₆ tapster₁₆ tarter₁₆ taser₁₆ taster₁₆ tater₁₆ tauter₁₆ tearer₁₆ teaser₁₆ teer₁₆ teeter₁₆ terser₁₆ tester₁₆ tor₁₆ tutor₁₆ tav₁₇ tarre₁₈ testee₁₈ tore₁₈ trove₁₈ tutee₁₈ tapestry₁₉ tapstry₁₉ tarry₁₉ tarty₁₉ tasty₁₉ tay₁₉ teary₁₉ terry₁₉ testy₁₉ toey₁₉ tory₁₉ toy₁₉ trey₁₉ troy₁₉ try₁₉ too₂₀ toro₂₀ toyo₂₀ tatou₂₁ tatu₂₁ tutu₂₁

Group 2: sap₄ sal₅ sae₆ sale₆ sea₇ spa₇ sales₈ sals₈ saps₈ seas₈ spas₈ sau₉ sar₁₀ sear₁₀ ser₁₀ slur₁₀ spar₁₀ spear₁₀ spur₁₀ sur₁₀ salse₁₁ salue₁₁ seare₁₁ sease₁₁ seasure₁₁ see₁₁ sere₁₁ sese₁₁ slae₁₁ slee₁₁ slue₁₁ spae₁₁ spare₁₁ spue₁₁ sue₁₁ sure₁₁ salet₁₂ salt₁₂ sat₁₂ saut₁₂ seat₁₂ set₁₂ slart₁₂ slat₁₂ sleet₁₂ slut₁₂ spart₁₂ spat₁₂ speat₁₂ spet₁₂ splat₁₂ spurt₁₂ st₁₂ suet₁₂ salto₁₃ so₁₃ salets₁₄ salses₁₄ saltos₁₄ salts₁₄ salues₁₄ sapless₁₄ saros₁₄ sars₁₄ sass₁₄ sauts₁₄ sears₁₄ seases₁₄ seasures₁₄ seats₁₄ sees₁₄ seres₁₄ sers₁₄ sess₁₄ sets₁₄ slaes₁₄ slarts₁₄ slats₁₄ sleets₁₄ slues₁₄ slurs₁₄ sluts₁₄ sos₁₄ spaes₁₄ spares₁₄ spars₁₄ sparts₁₄ spats₁₄ spears₁₄ speats₁₄ speos₁₄ spets₁₄ splats₁₄ spues₁₄ spurs₁₄ spurts₁₄ sues₁₄ suets₁₄ sures₁₄ sus₁₄ salute₁₅ saree₁₅ sasse₁₅ sate₁₅ saute₁₅ setose₁₅ slate₁₅ sloe₁₅ sluse₁₅ sparse₁₅ spate₁₅ sperse₁₅ spree₁₅ saeter₁₆ salter₁₆ saluter₁₆ sapor₁₆ sartor₁₆ saser₁₆ searer₁₆ seater₁₆ seer₁₆ serer₁₆ serr₁₆ slater₁₆ sleer₁₆ spaer₁₆ sparer₁₆ sparser₁₆ spearer₁₆ speer₁₆ spuer₁₆ spurter₁₆ suer₁₆ surer₁₆ sutor₁₆ sav₁₇ sov₁₇ salve₁₈ save₁₈ serre₁₈ serve₁₈ slave₁₈ sleave₁₈ sleeve₁₈ slove₁₈ sore₁₈ sparre₁₈ sperre₁₈ splore₁₈ spore₁₈ stere₁₈ sterve₁₈ store₁₈ stove₁₈ salary₁₉ salty₁₉ sassy₁₉ saury₁₉ savey₁₉ say₁₉ serry₁₉ sesey₁₉ sey₁₉ slatey₁₉ slaty₁₉ slavey₁₉ slay₁₉ sleety₁₉ sley₁₉ slurry₁₉ sly₁₉ soy₁₉ sparry₁₉ spay₁₉ speary₁₉ splay₁₉ spry₁₉ spurrey₁₉ spurry₁₉ spy₁₉ stey₁₉ storey₁₉ story₁₉ sty₁₉ suety₁₉ surety₁₉ surrey₁₉ survey₁₉ salvo₂₀ servo₂₀ stereo₂₀ sou₂₁ susu₂₁

Group 3: a₃ al₅ ae₆ ale₆ ape₆ aa₇ ala₇ aas₈ alas₈ ales₈ als₈ apes₈ as₈ alu₉ alar₁₀ aper₁₀ ar₁₀ alae₁₁ alee₁₁ alure₁₁ apse₁₁ are₁₁ aue₁₁ alert₁₂ alt₁₂ apart₁₂ apert₁₂ apt₁₂ aret₁₂ art₁₂ at₁₂ aero₁₃ also₁₃ alto₁₃ apo₁₃ apso₁₃ auto₁₃ aeros₁₄ alerts₁₄ altos₁₄ alts₁₄ alures₁₄ alus₁₄ apers₁₄ apos₁₄ apres₁₄ apses₁₄ apsos₁₄ apts₁₄ ares₁₄ arets₁₄ ars₁₄ arts₁₄ ass₁₄ ats₁₄ aures₁₄ autos₁₄ alate₁₅ aloe₁₅ arete₁₅ arose₁₅ arse₁₅ ate₁₅ alastor₁₆ alerter₁₆ alter₁₆ apter₁₆ aster₁₆ arere₁₈ ave₁₈ aery₁₉ alary₁₉ alay₁₉ aleatory₁₉ apay₁₉ apery₁₉ arsey₁₉ arsy₁₉ artery₁₉ artsy₁₉ arty₁₉ ary₁₉ ay₁₉ aloo₂₀ arvo₂₀ avo₂₀ ayu₂₁

Group 4: pe₆ pa₇ pea₇ plea₇ pas₈ peas₈ pes₈ pleas₈ plu₉ par₁₀ pear₁₀ per₁₀ pur₁₀ pare₁₁ pase₁₁ peare₁₁ pease₁₁ pee₁₁ pere₁₁ please₁₁ pleasure₁₁ plue₁₁ pre₁₁ pure₁₁ part₁₂ past₁₂ pat₁₂ peart₁₂ peat₁₂ pert₁₂ pest₁₂ pet₁₂ plast₁₂ plat₁₂ pleat₁₂ pst₁₂ put₁₂ pareo₁₃ paseo₁₃ peso₁₃ pesto₁₃ po₁₃ pro₁₃ pareos₁₄ pares₁₄ pars₁₄ parts₁₄ paseos₁₄ pases₁₄ pass₁₄ pasts₁₄ pats₁₄ peares₁₄ pears₁₄ peases₁₄ peats₁₄ pees₁₄ peres₁₄ perts₁₄ pesos₁₄ pestos₁₄ pests₁₄ pets₁₄ plats₁₄ pleases₁₄ pleasures₁₄ pleats₁₄ plues₁₄ plus₁₄ pos₁₄ pros₁₄ pures₁₄ purs₁₄ pus₁₄ puts₁₄ parse₁₅ passe₁₅ paste₁₅ pate₁₅ pause₁₅ perse₁₅ plaste₁₅ plate₁₅ pose₁₅ pree₁₅ prese₁₅ prose₁₅ puree₁₅ purse₁₅ parer₁₆ parr₁₆ parser₁₆ parter₁₆ passer₁₆ paster₁₆ pastor₁₆ pater₁₆ pauser₁₆ pearter₁₆ peer₁₆ perter₁₆ pester₁₆ peter₁₆ plaster₁₆ plater₁₆ pleaser₁₆ pleasurer₁₆ pleater₁₆ poser₁₆ pretor₁₆ proser₁₆ puer₁₆ purer₁₆ purr₁₆ purser₁₆ parev₁₇ pav₁₇ perv₁₇ pareve₁₈ parore₁₈ parve₁₈ passee₁₈ pave₁₈ peeve₁₈ perve₁₈ petre₁₈ pore₁₈ preeve₁₈ preserve₁₈ preve₁₈ prore₁₈ prove₁₈ parry₁₉ party₁₉ pastry₁₉ pasty₁₉ patsy₁₉ paty₁₉ pay₁₉ peatery₁₉ peaty₁₉ peavey₁₉ peavy₁₉ peeoy₁₉ peery₁₉ perry₁₉ pervy₁₉ pesty₁₉ plastery₁₉ platy₁₉ play₁₉ ploy₁₉ plurry₁₉ ply₁₉ pory₁₉ posey₁₉ posy₁₉ prey₁₉ prosy₁₉ pry₁₉ pursy₁₉ purty₁₉ purvey₁₉ puy₁₉ parvo₂₀ poo₂₀ proo₂₀ proso₂₀ pareu₂₁ patu₂₁ poyou₂₁

Group 5: la₇ lea₇ las₈ leas₈ les₈ leu₉ lar₁₀ lear₁₀ lur₁₀ lare₁₁ lase₁₁ leare₁₁ lease₁₁ leasure₁₁ lee₁₁ lere₁₁ lure₁₁ last₁₂ lat₁₂ least₁₂ leat₁₂ leet₁₂ lest₁₂ let₁₂ lo₁₃ lares₁₄ lars₁₄ lases₁₄ lass₁₄ lasts₁₄ lats₁₄ leares₁₄ lears₁₄ leases₁₄ leasts₁₄ leasures₁₄ leats₁₄ lees₁₄ leets₁₄ leres₁₄ leses₁₄ less₁₄ lests₁₄ lets₁₄ los₁₄ lues₁₄ lures₁₄ lurs₁₄ laree₁₅ late₁₅ leese₁₅ lose₁₅ lute₁₅ laer₁₆ laser₁₆ laster₁₆ later₁₆ leaser₁₆ leer₁₆ lesser₁₆ lor₁₆ loser₁₆ lurer₁₆ luser₁₆ luter₁₆ lav₁₇ lev₁₇ luv₁₇ lave₁₈ leave₁₈ lessee₁₈ leve₁₈ lore₁₈ love₁₈ lurve₁₈ lay₁₉ leary₁₉ leavy₁₉ leery₁₉ levy₁₉ ley₁₉ lory₁₉ lovey₁₉ loy₁₉ lurry₁₉ laevo₂₀ lasso₂₀ levo₂₀ loo₂₀ lassu₂₁ latu₂₁ lou₂₁

Group 6: ea₇ eas₈ es₈ eau₉ ear₁₀ er₁₀ ease₁₁ ee₁₁ ere₁₁ east₁₂ eat₁₂ est₁₂ et₁₂ euro₁₃ ears₁₄ eases₁₄ easts₁₄ eats₁₄ eaus₁₄ eres₁₄ eros₁₄ ers₁₄ eses₁₄ ess₁₄ ests₁₄ euros₁₄ erose₁₅ esse₁₅ easer₁₆ easter₁₆ eater₁₆ err₁₆ ester₁₆ erev₁₇ eave₁₈ eve₁₈ easy₁₉ eatery₁₉ eery₁₉ estro₂₀ evo₂₀

Group 7: a₇ as₈ ar₁₀ ae₁₁ are₁₁ aue₁₁ aret₁₂ art₁₂ at₁₂ auto₁₃ ares₁₄ arets₁₄ ars₁₄ arts₁₄ ass₁₄ ats₁₄ aures₁₄ autos₁₄ arete₁₅ arose₁₅ arse₁₅ ate₁₅ aster₁₆ arere₁₈ ave₁₈ aery₁₉ arsey₁₉ arsy₁₉ artery₁₉ artsy₁₉ arty₁₉ ary₁₉ ay₁₉ aero₂₀ arvo₂₀ avo₂₀ ayu₂₁

Group 8: sur₁₀ sue₁₁ sure₁₁ set₁₂ st₁₂ suet₁₂ so₁₃ sets₁₄ sos₁₄ sues₁₄ suets₁₄ sures₁₄ sus₁₄ see₁₅ sese₁₅ setose₁₅ seer₁₆ ser₁₆ suer₁₆ surer₁₆ sutor₁₆ sov₁₇ sere₁₈ serve₁₈ sore₁₈ stere₁₈ sterve₁₈ store₁₈ stove₁₈ sesey₁₉ sey₁₉ soy₁₉ stey₁₉ storey₁₉ story₁₉ sty₁₉ suety₁₉ surety₁₉ surrey₁₉ survey₁₉ servo₂₀ stereo₂₀ sou₂₁ susu₂₁

Group 9: ur₁₀ ure₁₁ ut₁₂ ures₁₄ us₁₄ uts₁₄ use₁₅ ute₁₅ ureter₁₆ user₁₆ uey₁₉ utu₂₁

Group 10: re₁₁ ret₁₂ reo₁₃ reos₁₄ res₁₄ rets₁₄ ree₁₅ rete₁₅ roe₁₅ rose₁₅ rev₁₇ reeve₁₈ resee₁₈ reserve₁₈ retore₁₈ rore₁₈ rove₁₈ retry₁₉ rory₁₉ rosery₁₉ rosy₁₉ retro₂₀ roo₂₀

Group 11: et₁₂ es₁₄ ee₁₅ er₁₆ ere₁₈ eve₁₈ eery₁₉ evo₂₀

Group 12: to₁₃ te₁₅ toe₁₅ tose₁₅ tor₁₆ tee₁₈ tore₁₈ toey₁₉ tory₁₉ toy₁₉ trey₁₉ try₁₉ too₂₀ toro₂₀ toyo₂₀

Group 13: o₁₃ os₁₄ oe₁₅ ose₁₅ or₁₆ ore₁₈ oy₁₉ oo₂₀ ou₂₁

Group 14: ser₁₆ see₁₈ sere₁₈ serve₁₈ sey₁₉ servo₂₀ so₂₀ sou₂₁

Group 15: er₁₆ ee₁₈ ere₁₈ eve₁₈ evo₂₀

Group 16: re₁₈ reo₂₀

Group 17:

Group 18:

Group 19: yo₂₀ you₂₁ yu₂₁

Group 20: o₂₀ ou₂₁

Group 21:

Naturally, I’ve tried out the code on a few other well-known phrases.

If Lynn and I had met at a different dining establishment, she might have found a straw with the statement, “It takes two hands to handle a Whopper.” There’s quite a diverse assortment of possible messages lurking in this text, with 1,154 unique foldable words and almost 2,000 word instances. Perhaps she would have chosen the upbeat “Inhale hope.” Or, in a darker mood, “I taste woe.”

If we had been folding dollar bills instead of straw wrappers, “In God We Trust” might have become the forward-looking proclamation, “I go west!” Horace Greeley’s marching order on the same theme, “Go west, young man,” gives us the enigmatic “O, wet yoga!” or, perhaps more aptly, “Gunman.”

Jumping forward from 1967 to 2021—from the Summer of Love to the Winter of COVID—I can turn “Wear a mask. Wash your hands.” into the plaintive, “We ask: Why us?” With “Maintain social distance,” the best I can do is “A nasal dance” or “A sad stance.”

And then there’s “Make America Great Again.” It yields “Meme rage.” Also “Make me ragtag.”

Appendix: The Word-List Problem.

In a project like this one, you might think that getting a suitable list of English words would be the easy part. In fact it seems to be the main trouble spot.

The Scrabble lexicon I’ve been relying on derives from a word list known as SOWPODS, compiled by two associations of Scrabble players starting in the 1980s. Current editions of the list are distributed by a commercial publisher, Collins Dictionaries. If I understand correctly, all versions of the list are subject to copyright (see discussion on Stack Exchange) and cannot legally be distributed without permission. But no one seems to be much bothered by that fact. Copies of the lists in plain-text format, with one word per line, are easy to find on the internet—and not just on dodgy sites that specialize in pirated material.

There are alternative lists without legal encumbrances. Indeed, there’s a good chance you already have one such list pre-installed on your computer. A file called words is included in most distributions of the Unix operating system, including MacOS; my copy of the file lives in usr/share/dict/words. If you don’t have or can’t find the Unix words file, I suggest downloading the Natural Language Toolkit, a suite of data files and Python programs that includes a lexicon almost identical to Unix words, as well as many other linguistic resources.

The Scrabble list has one big advantage over words: It includes plurals and inflected forms of verbs—not just test but also tests, tested, and testing. [Bad example; see comments below.] The words file is more like a list of dictionary head words, with only the stem form explicitly included. On the other hand, words has an abundance of names and other proper nouns, as well as abbreviations, which are excluded from the Scrabble list since they are not legal plays in the board game.

How about combining the two word lists? Their union has just under 400,000 entries—quite a large lexicon. Using this augmented list for the analysis of “It’s a pleasure to serve you,” my program finds an additional 219 foldable words, beyond the 778 found with the Scrabble list alone. Here they are:

aaru aer aerose aes alares alaster alea alerse aleut alo alose alur aly ao apa apar aperu apus aro arry aru ase asor asse ast astor atry aueto aurore aus ausu aute e eastre eer erse esere estre eu ey iao ie ila islay ist isuret itala itea iter ito iyo l laet lao larry larve lastre lasty latro laur leo ler lester lete leto loro lu lue luo lut luteo lutose ly oer ory ovey p parsee parto passo pastose pato pau paut pavo pavy peasy perty peru pess peste pete peto petr plass platery pluto poe poy presee pretry pu purre purry puru r reve ro roer roey roy s sa saa salar salat salay saltee saltery salvy sao sapa saple sapo sare sart saur sauty sauve se seary seave seavy seesee sero sert sesuto sla slare slav slete sloo sluer soe sory soso spary spass spave spleet splet splurt spor spret sprose sput ssu stero steve stre strey stu sueve suto sutu suu t taa taar tal talao talose taluto tapeats tapete taplet tapuyo tarr tarse tartro tarve tasser tasu taur tave tavy teaer teaey teart teasy teaty teave teet teety tereu tess testor toru torve tosy tou treey tsere tst tu tue tur turr turse tute tutory u uro urs uru usee v vu y

Many of the proper nouns in this list are present in the vocabulary of most English speakers: Aleut, Peru, Pluto, Slav; the same is true of personal names such as Larry, Leo, Stu, Tess. But the rest of the words are very unlikely to turn up in the smalltalk of teenage sweethearts. Indeed, the list is full of letter sequences I simply don’t recognize as English words. Please define isuret, ovey, spleet, or sput.

There are even bigger word lists out there. In 2006 Google extracted 13.5 million unique English words from public web pages. (The sheer number implies a very liberal definition of English and word.) A good place to start exploring this archive is Peter Norvig’s website, which offers a file with the 333,333 most frequent words from the corpus. The list begins as you might expect: the, of, and, to, a, in, for…; but the weirdness creeps in early. The single letters c, e, s, and x are all listed among the 100 most common “words,” and the rest of the alphabet turns up soon after. By the time we get to the end of the file, it’s mostly typos (mepquest, halloweeb, scholarhips), run-together words (dietsdontwork, weightlossdrugs), and hundreds of letter strings that have some phonetic or orthographic resemblance to Google or Yahoo! or both (hoogol, googgl, yahhol, gofool, yogol). (I suspect that much of this rubbish was scraped not from the visible text of web pages but from metadata stuffed into headers for purposes of search-engine optimization.)

Applying the Google list to the search for foldable words more than doubles the volume of results, but it contributes almost nothing to the stock of words that might form interesting messages. I found 1,543 new words, beyond those that are also present in the union of the Scrabble and Unix lists. In alphabetical order, the additions begin: aae, aao, aaos, aar, aare, aaro, aars, aart, aarts, aase, aass, aast, aasu, aat, aats, aatsr, aau, aaus, aav, aave, aay, aea, aeae…. I’m not going to be folding up any straw wrappers with those words for my sweetheart.

What we really need, I begin to think, is not a longer word list but a shorter and more discriminating one.

12 Responses to Foldable Words

Jack Kennedy says:

9 February 2021 at 9:31 pm

Another great post.
Curious, my /usr/share/dict/words list does contain plurals and the rest. It has test, tests, tested, and testing. But I’m on Ubuntu, not Mac.
- Brian Hayes says:
  
  10 February 2021 at 1:29 pm
  
  Thanks so much for your comment. You have led the way to some interesting discoveries.
  
  I had thought (without actually thinking!) that the Unix word list would be essentially the same on all Unix-derived systems. I was quite wrong about that. It turns out the Ubuntu list has very little in common with the one distributed on Darwin (the MacOS kernel). The extent of the differences is made obvious just by counting the words: 99,171 for Ubuntu, 235,886 for Darwin. Even though Ubuntu has far fewer words overall, it has more proper nouns as well as words with accented letters and thousands of possessive forms and contractions. (The Darwin words are strictly [a-zA-Z].)
  
  The story gets more interesting: Just as there is no single Unix word list, there is no single Ubuntu word list. The list I’ve been looking at comes from a package called wamerican, but it’s just one of five lists for American English; there are also five each for British and Canadian English, as well as several packages for other languages.
  
  I have learned all this from ubuntu.com, which also reveals that the English-language lists come from a project called SCOWL, which has an abundance of other lists I’m eager to explore.
  
  I also hope to learn more about the history of these various word lists. I’m pretty sure that the Unix list began with early spell utility developed at Bell Labs by Stephen Johnson and M. Douglas McIlroy. But that was more than 40 years ago.
  
  Finally, I must confess to a ridiculous error. The Darwin list omits tests, but it does include tested and testing, contrary to my assertion. My first draft of that passage stated that the Unix list “includes ape but not apes, aped, or aping,” and this was a true statement about the Darwin list. At some point I shifted to a different example word, and failed to follow through. Instead of checking the list directly, I looked at the output of the foldable-words program and noted the absence of tests, tested and testing. But the latter two could not possibly have been there because “It’s a pleasure to serve you” has no d, n, or g. An embarrassing lapse.
unekdoud says:

9 February 2021 at 11:34 pm

I consider “foldable” as the third of a chain of common inclusion-related concepts:
1. Prefix (e.g. I, It, It’s)
2. Substring (e.g. Plea, Sure)
3. Foldable (e.g. Tree, Pester, Sleeve)
4. Spellable (e.g. Atlasses, Stipulate, Pursuers)
5. Spellable with repeated letters (e.g. Peppery, Repetitive, Tattletale)

As binary relations on strings, each of these is transitive and reflexive. It’s possible to deduce from this, for example, that a substring of a foldable word is foldable.
- Brian Hayes says:
  
  10 February 2021 at 1:46 pm
  
  Nice taxonomy!
  
  But what is the rationale for giving prefix a category of its own, while leaving out suffix? Aren’t they both just slightly special cases of substring? Maybe it’s a matter of computational complexity? Enumerating all prefixes (or suffixes) is easier (I think) that finding all substrings. Perhaps that’s a sufficient reason for maintaining the distinction.
  
  What class is the hardest to enumerate? Classes 4 and 5 will have the largest outputs, but there are simple algorithms for picking all the qualifying entries out of a word list. Might it be the case that the foldable words are the hardest to identify?
  - unekdoud says:
    
    13 February 2021 at 5:28 am
    
    To maintain the pattern of the chain of relations, I had to choose between Prefix or Suffix. To me using prefix is more “natural”, but of course both are equally valid.
    
    Checking that a word is in the class should be equally easy for all of them, except that Prefix/Suffix lets you fail early. If we are very strict about not repeating strings, then I believe that Foldable is the hardest class to enumerate, or even draw randomly from.
    
    Even without repeated strings, the size of class 3 is still a few million strings. (It would be much smaller if there are long runs of repeated letters, but that wouldn’t be interesting at all!). So for classes 3 to 5, it’s intuitive that checking each word from the wordlist is faster than checking the whole class against the wordlist.
    
    I was going to conclude that class 3 is closest to the size of the English dictionary, but as you’ve pointed out it’s actually undesirable to use the whole dictionary! You can get better results using frequency lists (Wiktionary has an interesting collection), and it seems that they stop being useful after the most common 20,000 words, or an occurrence of 1 in 1-2 million.
Vlad-Stefan Harbuz says:

12 February 2021 at 6:47 am

Dear Brian,

Thank you for this post! I found it both genuinely sweet and a really interesting read, which I think is a testament to the quality of your writing.

Sending warm greetings from Switzerland,
Vlad
Alec Wilson says:

12 February 2021 at 1:32 pm

This was awesome! Thank you for writing this up!
John says:

20 February 2021 at 1:06 pm

a wonderful story. i’m 38. i can’t see 83. but, i’m taking care now, today, to love on my 3 kids, my wife of 16 years… one more and i’ll beat you..?

thank you for this.
Evangelos Sarmas says:

21 March 2021 at 1:22 pm

I suggest also the words_alpha.txt list in

https://github.com/dwyl/english-words

it seems to be the most complete of all for such experiments
(and many other uses)
- Evangelos Sarmas says:
  
  23 March 2021 at 5:29 am
  
  but not perfect… as I discovered recently
  for example “boyscout” and “astra” and others are missing
  the perfect word-list is still elusive today :D
Jonas Karlsson says:

27 March 2021 at 5:52 pm

This post was absurdly good and should be included in every anthology of mathematical writing.
- Brian Hayes says:
  
  28 March 2021 at 9:29 am
  
  Many thanks!