bit-player

AI and the end of programming

Brian Hayes — Wed, 13 Sep 2023 18:25:43 +0000

Earlier this year Matt Welsh announced the end of programming. He wrote, in Communications of the ACM:

I believe the conventional idea of “writing a program” is headed for extinction, and indeed, for all but very specialized applications, most software, as we know it, will be replaced by AI systems that are trained rather than programmed. In situations where one needs a “simple” program (after all, not everything should require a model of hundreds of billions of parameters running on a cluster of GPUs), those programs will, themselves, be generated by an AI rather than coded by hand.

A few weeks later, in an online talk, Welsh broadened his deathwatch. It’s not only the art of programming that’s doddering toward the grave; all of computer science is “doomed.” (The image below is a screen capture from the talk.)

The bearer of these sad tidings does not appear to be grief-stricken over the loss. Although Welsh has made his career as a teacher and practitioner of computer science (at Harvard, Google, Apple, and elsewhere), he seems eager to move on to the next thing. “Writing code sucks anyway!” he declares.

My own reaction to the prospect of a post-programming future is not so cheery. In the first place, I’m skeptical. I don’t believe we have yet crossed the threshold where machines learn to solve interesting computational problems on their own. I don’t believe we’re even close to that point, or necessarily heading in the right direction. Furthermore, if it turns out I’m wrong about all this, my impulse is not to acquiesce but to resist. I, for one, do not welcome our new AI overlords. Even if they prove to be better programmers than I am, I’ll hang onto my code editor and compiler all the same, thank you. Programming sucks? For me it has long been a source of pleasure and enlightenment. I find it is also a valuable tool for making sense of the world. I’m never sure I understand an idea until I can reduce it to code. To get the benefit of that learning experience, I have to actually write the program, not just say some magic words and summon a genie from Aladdin’s AI lamp.

Large Language Models

The idea that a programmable machine might write its own programs has deep roots in the history of computing. Charles Babbage hinted at the possibility as early as 1836, in discussing plans for his Analytical Engine. When Fortran was introduced in 1957, it was officially styled “The FORTRAN Automatic Coding System.” Its stated goal was to allow a computer “to code problems for itself and produce as good programs as human coders (but without the errors).”

Fortran did not abolish the craft of programming (or the errors), but it made the process less tedious. Later languages and other tools brought further improvements. And the dream of fully automatic programming has never died. Machines seem better suited to programming than most people are. Computers are methodical, rule-bound, finicky, and literal-minded—all traits that we associate (rightly or wrongly) with ace programmers.

Ironically, though, the AI systems now poised to take on programming chores are strangely uncomputerlike. Their personality is more Deanna Troi than Commander Data. Logical consistency, cause-and-effect reasoning, and careful attention to detail are not among their strengths. They have moments of uncanny brilliance, when they seem to be thinking deep thoughts, but they are also capable of spectacular failures—blatant and brazen lapses of rationality. They bring to mind the old quip: To err is human, to really foul things up requires a computer.

The latest cohort of AI systems are known as large language models (LLMs). Like most other recent AI inventions, they are built atop an artificial neural network, a multilayer structure inspired by the architecture of the brain. The nodes of the network are analogous to biological neurons, and the connections between nodes play the role of synapses, the junctions where signals pass from one neuron to another. Training the network adjusts the strength, or weight, of the connections. In a language model, the training is done by force-feeding the network huge volumes of text. When the process is complete, the connection weights encode detailed statistics on linguistic features of the training text. In the largest models the number of weights is 100 billion or more.

The term model in this context can be misleading. The word is not used in the sense of a scale model, or miniature, like a model airplane. It refers instead to a predictive model, like the mathematical models common in the sciences. Just as a model of the atmosphere predicts tomorrow’s weather, a language model predicts the next word in a sentence.

The best known LLM is ChatGPT, which was released to the public last fall, to great fanfare. The initials GPT Gee Pee Tee: My tongue is constantly tripping over those three rhyming syllables. Other AI products have cute names like Bart, Claude, and Llama; I wish I could rename GPT in the same spirit. I would call it Geppetto, which echoes the pattern of consonants. And, given GPT’s dodgy relations with the truth, it seems appropriate to invoke the creator of Pinocchio.stand for Generative Pre-trained Transformer; the Chat version of the system comes equipped with a conversational human interface. ChatGPT was developed by OpenAI, a firm founded in 2015 with the aim of liberating artificial intelligence from the grip of a few wealthy tech companies. OpenAI has succeeded in this mission, so much so that it has become a wealthy tech company.

ChatGPT elicits both admiration and alarm for its way with words, its gift of gab, its easy fluency in English and other languages. The chatbot can imitate famous authors, tell jokes, compose love letters, translate poetry, write junk mail, “help” students with their homework assignments, and fabricate misinformation for political mischief. For better or worse, these linguistic abilities represent a startling technical advance. Computers that once struggled to frame a single intelligible sentence have suddenly become glib wordsmiths. What GPT says may or may not be true, but it’s almost always well-phrased.

Soon after ChatGPT was announced, I was surprised to learn that its linguistic mastery extends to programming languages. It seems the training set for the model included not just multiple natural languages but also quantities of program source code from public repositories such as GitHub. Based on this resource, GPT was able to write new programs on command. I found this surprising because computers are so fussy and unforgiving about their inputs. A human raeder will strive to make sense of an utterance despite small errors such as misspellings, but a computer will barf if it’s given a program with even one stray comma or mismatched parenthesis. A language model, with its underlying statistical or probabilistic nature, seemed unlikely to sustain the required precision for more than a few lines.

I was wrong about this. A key innovation in the new LLMs, known as the attention mechanism, addresses this very issue. When I began my own experiments with ChatGPT, I soon found that it can indeed produce programs free of careless syntax errors.

But other problems cropped up.

Climbing the word ladder

When you sit down to chat with a machine, you immediately face the awkward question: “What shall we talk about?” I was looking for a subject that would serve as a fair measure of ChatGPT’s programming prowess. I wanted a problem that can be solved by computational means, but that doesn’t call for doing much arithmetic, which is known to be one of the weak points of LLMs. I settled on a word puzzle invented 150 years ago by Lewis Carroll and analyzed in depth by Donald E. Knuth in the 1990s.

In the transcript below, my side of each exchange is labeled BR; the rosette , which is an OpenAI logo, designates ChatGPT’s responses.

As I watched these sentences unfurl on the screen—the chatbot types them out letter by letter, somewhat erratically, as if pausing to gather its thoughts—I was immediately struck by the system’s facility with the English language. In plain, sturdy prose, GPT catalogs all the essential features of a word ladder: It’s a game or puzzle, you proceed from word to word by changing one letter at a time, each rung of the ladder must be an English word, the aim is to find the shortest possible sequence from the starting word to the target word. I couldn’t explain it better myself. Most helpful of all is the worked example of COLD -> WARM.

It’s not just the individual sentences that give an impression of linguistic competence. The sentences are organized in paragraphs, and the paragraphs are strung together to form a coherent discourse. Bravo!

Also remarkable is the bot’s ability to handle vague and slapdash inputs. My initial query was phrased as a yes-or-no question, but ChatGPT correctly interpreted it as a request: “Tell me what you know about word ladders.” My second instruction neglects to include any typographic hints showing that lead and gold are to be understood as words, not metals. The chatbot would have been within its rights to offer me an alchemical recipe, but instead it supplied the missing quotation marks.

Setting aside all this linguistic and rhetorical sophistication, however, what I really wanted to test was the program’s ability to solve word-ladder problems. The two examples in the transcript above can both be found on the web, and so they might well have appeared in ChatGPT’s training data. In other words, the LLM may have memorized the solutions rather than constructed them. So I submitted a tougher assignment:

At first glance, it seems ChatGPT has triumphed again, solving an instance of the puzzle that I’m pretty sure it has not seen previously. But look closer. MARSH -> MARIS requires two letter substitutions, and so does PARIL -> APRIL. The status of MARIS and PARIL as “valid words” might also be questioned. I lodged a complaint:

Whoa! The bot offers unctuous confessions and apologies, but the “corrected” word ladder is crazier than ever. It looks like we’re playing Scrabble with Humpty Dumpty, who declares “APRCHI is a word if I say it is!” and then scatters all the tiles.

This is not an isolated, one-of-a-kind failure. All of my attempts to solve word ladders with ChatGPT have gone off the rails, although not always in the same way. In one case I asked for a ladder from REACH to GRASP. The AI savant proposed this solution:

REACH -> TEACH -> PEACH -> PEACE -> PACES -> PARES -> PARSE ->

PARSE -> PARSE -> PARSE -> PARKS -> PARKS -> PARKS -> PARKS ->

PARES -> GRASP.

And there was this one:

SWORD -> SWARD -> REWARD -> REDRAW -> REDREW ->

REDREW -> REDREW -> REDREW -> REDRAW -> REPAID ->

REPAIR -> PEACE

Now we are babbling like a young child just learning to count: “One, two, three, four, three, four, four, four, seven, blue, ten!”

All of the results I have shown so far were recorded with version 3.5 of ChatGPT. I also tried the new-and-improved version 4.0, which was introduced in March. The updated bot exudes the same genial self-assurance, but I’m afraid it also has the same tendency to drift away into oblivious incoherence:

The ladder starts out well, with four steps that follow all the rules. But then the AI mind wanders. From PLAGE to PAGES requires four letter substitutions. Then comes PASES, which is not a word (as far as I know), and in any case is not needed here, since the sequence could go directly from PAGES to PARES. More goofiness ensues. Still, I do appreciate the informative note about PLAGE.

Recently I’ve also had a chance to try out Llama 2, an LLM published by Meta (the Facebook people). Although this model was developed independently of GPT, it seems to share some of the same mental quirks, such as laying down rules and then blithely ignoring them. When I asked for a ladder connecting REACH to GRASP, this is what Llama 2 proposed:

REACH -> TEACH -> DEACH -> LEACH -> SPEECH -> SEAT ->

FEET -> GRASP

What can I say? I guess a bot’s REACH should exceed its GRASP.

Oracles and code monkeys

Matt Welsh mentions two modes of operation for a computing system built atop a large language model. So far we’ve been working in what I’ll call oracle mode, where you ask a question and the computer returns an answer. You supply a pair of words, and the system finds a ladder connecting them, doing whatever computing is needed to get there. You deliver a shoebox full of financial records, and the computer fills out your Form 1040. You compile historical climate data, and the computer predicts the global average temperature in 2050.

The alternative to an AI oracle is an AI code monkey. In this second mode, the machine does not directly answer your question or perform your calculation; instead it creates a program that you can then run on a conventional computer. What you get back from the bot is not a word ladder but a program to generate word ladders, written in the programming language of your choice. Instead of a completed tax return you get tax-preparation software; instead of a temperature prediction, a climate model.

Let’s give it a try with ChatGPT 3.5:

Again, a cursory glance at the output suggests a successful performance. ChatGPT seems to be just as fluent in JavaScript as it is in English. It knows the syntax for if, while, and for statements, as well as all the finicky punctuation and bracketing rules. The machine-generated program appears to bring all these components together to accomplish the specified task. Also note the generous sprinkling of explanatory comments, which are surely meant for our benefit, not for its. Likewise the descriptive variable names (currentWord, newWord, ladder).

ChatGPT has also, on its own initiative, included instructions for running the program on a specific example (MARCH to APRIL), and it has printed out a result, which matches the answer it gave in our earlier exchange. Was that output generated by actually running the program? ChatGPT does not say so explicitly, but it does make the claim that if you run the program as instructed, you will get the result shown (in all its nonsensical glory).

We can test this claim by loading the program into a web browser or some other JavaScript execution environment. The verdict: Busted! The program does run, but it doesn’t yield the stated result. The program’s true output is: MARCH -> AARCH -> APRCH -> APRIH -> APRIL. This sequence is a little less bizarre, in that it abides by the rule of changing only one letter at a time, and all the “words” have exactly five letters. On the other hand, none of the intermediate “words” are to be found in an English dictionary.

There’s an easy algorithm for generating the sequence MARCH -> AARCH -> APRCH -> APRIH -> APRIL. Just step through the start word from left to right, changing the letter at each position so that it matches the corresponding letter in the goal word. Following this rule, any pair of five-letter words can be laddered in no more than five steps. MARCH -> APRIL takes only four steps because the R in the middle doesn’t need to be changed. I can’t imagine an easier way to make word ladders—assuming, of course, that you’re willing to let any and every mishmash of letters count as a word.

The program created by ChatGPT could use this quick-and-dirty routine, but instead it does something far more tedious: It constructs all possible ladders whose first rung is the start word, and continues extending those ladders until it stumbles on one that includes the target word. This is a brute-force algorithm of breathtaking wastefulness. Each letter of the start word can be altered in 25 ways. Thus a five-letter word has 125 possible successors. By the time you get to a five-rung ladder, there are 190 million possibilities. (The examples I have presented here, such as MARCH -> APRIL and REACH -> GRASP, have one unchanging letter, and so the solutions require only four steps. Trying to compute full five-step solutions exhausted my patience.)

Version 4 as code monkey

Let’s try the same code-writing exercise with ChatGPT 4. Given the identical prompt, here’s how the new bot responded:

The program has the same overall structure (a while loop with two nested for loops inside), and it adopts the same algorithmic strategy (generating all strings that differ from a given word at exactly one position). But there’s a big novelty in the GPT-4 version: acknowledgment that a word list is essential. And with this change we finally have some hope of generating a ladder made up of genuine words.

Although GPT-4 recognizes the need for a list, it supplies only a placeholder, namely the 10-word sequence it confected for the REACH -> GRASP example given above. This stub of a word list isn’t good for much—not even for regenerating the bogus REACH-to-GRASP ladder. If you try to do that, the program will report that no ladder exists. This outcome is not incorrect, since the 10 given words do not form a valid pathway that changes just one letter on each step.

Even if the words in the list were well-chosen, a vocabulary of size 10 is awfully puny. Producing a bigger list of words seems like a task that would be easy for a language model. After all, the LLM was trained on a vast corpus of text, in which nearly all English words are likely to appear at least once, and common words would turn up millions of times. Can’t the bot just extract a representative sample of those words? The answer, apparently, is no. Although GPT might be said to have “read” all that text, it has not stored the words in any readily accessible form. (The same is true of a human reader. Can you, by thinking back over a lifetime of reading, list the 10 most common five-letter words in your vocabulary?)

When I asked ChatGPT 4 to produce a word list, it demurred apologetically: “I’m sorry for the confusion, but as an AI developed by OpenAI, I don’t have direct access to a database of words or the ability to fetch data from external sources….” So I tried a bit of trickery, asking the bot to write a 1,000-word story and then sort the story’s words in order of frequency. This ruse worked, but the sample was too small to be of much use. With persistence, I could probably coax an acceptable list from GPT, but instead I took a shortcut. After all, I am not an AI developed by OpenAI, and I have access to external resources. I appropriated the list of 5,757 five-letter English words that Knuth compiled for his experiments with word ladders. Equipped with this list, the program written by GPT-4 finds the following nine-step ladder:

REACH -> PEACH -> PEACE -> PLACE -> PLANE ->

PLANS -> GLANS -> GLASS -> GRASS -> GRASP

This result exactly matches the output of Knuth’s own word-ladder program, which he published 30 years ago in The Stanford Graphbase.

At this point I must concede that ChatGPT, with a little outside help, has finally done what I asked of it. It has written a program that can construct a valid word ladder. But I still have reservations. Although the programs written by GPT-4 and by Knuth produce the same output, the programs themselves are not equivalent, or even similar.

Knuth approaches the problem from the opposite direction, starting not with the set of all possible five-letter strings—there are not quite 12 million of them—but with his much smaller list of 5,757 common English words. He then constructs a graph (or network) in which each word is a node, and two nodes are connected by an edge if and only if the corresponding words differ by a single letter. The diagram below shows a fragment of such a graph.

A fragment of a graph links 16 words according to the word-ladder rule: If one word can be converted into another word by changing a single letter, then those two words are connected by an edge in the graph. A word ladder is a path along the edges of the graph, such as leash -> leach -> reach -> retch. Donald E. Knuth’s graph for five-letter words has 5,757 nodes and 14,135 edges. The largest connected component of the graph includes 4,493 words.

In the graph, a word ladder is a sequence of edges leading from a start node to a goal node. The optimal ladder is the shortest such path, traversing the smallest number of edges. For example, the optimal path from leash to retch is leash -> leach -> reach -> retch, but there are also longer paths, such as leash -> leach -> beach -> peach -> reach -> retch. For finding shortest paths, Knuth adopts an algorithm devised in the 1950s by Edsger W. Dijkstra.

Knuth’s word-ladder program requires an up-front investment to convert a simple list of words into a graph. On the other hand, it avoids the wasteful generation of thousands or millions of five-letter strings that cannot possibly be elements of a word latter. In the course of solving the REACH -> GRASP problem, the GPT-4 program produces 219,180 such strings; only 2,792 of them (a little more than 1 percent) are real words.

If the various word-ladder programs I’ve been describing were student submissions, I would give a failing grade to the versions that have no word list. The GPT-4 program with a list would pass, but I’d give top marks only to the Knuth program, on grounds of both efficiency and elegance.

Why do the chatbots favor the inferior algorithm? You can get a clue just by googling “word-ladder program.” Almost all the results at the top of the list come from websites such as Leetcode, GeeksForGeeks and RosettaCode. These sites, which apparently cater to job applicants and competitors in programming contests, feature solutions that call for generating all 125 single-letter variants of each word, as in the GPT programs. Because sites like these are numerous—there seem to be hundreds of them—they outweigh other sources, such as Knuth’s book (if, indeed, such texts even appear in the training set). Does that mean we should blame the poor choice of algorithm not on GPT but on Leetcode? I would point instead to the unavoidable weakness of a protocol in which the most frequent answer is, by default, the right answer.

A further, related concern nags at me whenever I think about a world in which LLMs are writing all our software. Where will new algorithms come from? An LLM might get creative in remixing elements of existing programs, but I don’t see any way it would invent something wholly new and better.

Enough with the word ladders already!

I admit I’ve gone overboard, torturing ChatGPT with too many variations of one peculiar (and inconsequential) problem. Maybe the LLMs would perform better on other computational tasks. I have tried several, with mixed results. I want to discuss just one of them, where I find ChatGPT’s efforts rather poignant.

Working with ChatGPT 3.5, I ask for the value of the 100th Fibonacci number. Note that my question is phrased in oracle mode; I ask for the number, not for a program to calculate it. Nevertheless, ChatGPT volunteers to write a Fibonacci program, and then presents what purports to be the program’s output.

The algorithm implemented by this program is mathematically correct; it follows directly from the definition of a Fibonacci number as a member of the sequence beginning {0, 1}, with every subsequent element equal to the sum of the previous two terms. The answer given is also correct: 354224848179261915075 is indeed the 100th Fibonacci number. So what’s the problem? It’s the middle statement: “When you run this code, it will output the 100th Fibonacci number.” That’s not true. If you run the code, what you’ll get back is the incorrect value 354224848179262000000. Recent versions of JavaScript provide a BigInt data type that solves this problem, but BigInts must be specified explicitly, and the ChatGPT program does not do so.The source of this anomaly is JavaScript’s use of floating-point arithmetic, even with integer values. Under the IEEE standard for floating point, the largest integer that can be represented without loss of precision is $2^{53} - 1$; the 100th Fibonacci number is roughly $2^{68}$. This is what I mean by poignant: ChatGPT gives the right answer, but the method it claims to use to compute that answer fails to supply it. The bot must have found the correct value in some other way, which it does not disclose.

Giving the same task to ChatGPT 4.0 takes us on an even stranger journey. In the following interaction I had activated Code Interpreter, a ChatGPT plugin that allows the system to test and run some of the code it writes. Apparently making use of this facility, the bot first proposes a program that fails for unknown reasons:

Here ChatGPT is writing in Python, which is the main programming language supported by Code Interpreter. The first attempt at writing a program is based on exponentiation of the Fibonacci matrix:

$$\mathrm{F}_1=\left[\begin{array}{ll}
1 & 1 \\
1 & 0
\end{array}\right].$$

This is a well-known and efficient method, and the program implements it correctly. For mysterious reasons, however, Code Interpreter fails to execute the program. (The code runs just fine in a standard Python environment, and returns correct answers.)

At this point the bot pivots and takes off in a wholly new direction, proposing to calculate the desired Fibonacci value via the mathematical identity known as Binet’s formula. It gets as far as writing out the mathematical expression, but then has another change of mind. It correctly foresees problems with numerical precision: The formula would yield an exact result if it were given an exact value for the square root of 5, but that’s not feasible.

And so now ChatGPT takes yet another tack, resorting to the same iterative algorithm used by version 3.5. This time we get a correct answer, because Python (unlike JavaScript) has no trouble coping with large integers.

I’m impressed by this performance, not just by the correct answer but also by the system’s courageous perseverance. ChatGPT soldiers on even as it flails about, puzzled by unexpected difficulties but refusing to give up. “Hmm, that matrix method should have worked. But, no matter, we’ll try Binet’s formula instead… Oh, wait, I forgot… Anyway, there’s no need to be so fancy-pants about this. Let’s just do it the obvious, plodding way.” This strikes me as a very human approach to problem-solving. It’s weird to see such behavior in a machine.

Scoring successes and failures

My little experiments have left me skeptical of claims that AI oracles and AI code monkeys are about to elbow aside human programmers. I’ve seen some successes, but more failures. And this dismal record was compiled on relatively simple computational tasks for which solutions are well known and widely published.

Others have conducted much broader and deeper evaluations of LLM code generation. In the bibliography at the end of this article I’ve listed five such studies. I’d like to briefly summarize a few of the results they report.

Two years ago, Mark Chen and a roster of 50+ colleagues at OpenAI devoted much effort to measuring the accuracy of Codex, an offshoot of ChatGPT 3 specialized for writing program code. (Codex has since become the engine driving GitHub Copilot, a “programmer’s assistant.”) Chen et al. created a suite of 164 tasks to be accomplished by writing Python programs. The tasks were mainly of the sort found in textbook exercises, programming contests, and the (strangely voluminous) literature on how to ace an interview for a coding job. Most of the tasks can be accomplished with just a few lines of code. Examples: Count the vowels in a given word, determine whether an integer is prime or composite.

The Chen group also gave some thought to defining criteria of success and failure. Because the LLM process is nondeterministic—word choices are based on probabilities—the model might generate a defective program on the first try but eventually come up with a correct response if it’s allowed to keep trying. A parameter called temperature controls the amount of nondeterminism. At zero temperature the model invariably chooses the most probable word at every step; as the temperature rises, randomness is introduced, allowing less probable words to be selected. Chen et al. took account of this potential for variation by adopting three benchmarks of success:

pass@1:	the LLM generates a correct program on the first try
pass@10:	at least one of 10 generated programs is correct
pass@100:	at least one of 100 generated programs is correct

Pass@1 tests were conducted at zero temperature, so that the model would always give its best-guess result. The pass@10 and pass@100 trials were done at higher temperatures, allowing the system to explore a wider variety of potential solutions.

The authors evaluated several versions of Codex on all 164 tasks. For the largest and most capable version of Codex, the pass@1 rate was about 29 percent, the pass@10 rate 47 percent, and pass@100 reached 72 percent. Looking at these numbers, should we be impressed or appalled? Is it cause for celebration that Codex gets it right on the first try almost a third of the time (when the temperature is set to zero)? Or that the success ratio climbs to almost three-fourths if you’re willing to sift through 100 proposed programs looking for one that’s correct? My personal view is this: If you view the current generation of LLMs as pioneering efforts in a long-term research program, then the results are encouraging. But if you consider this technology as an immediate replacement for hand-coded software, it’s quite hopeless. We’re nowhere near the necessary level of reliability.

Other studies have yielded broadly similar results. Federico Cassano et al. assess the performance of several LLMs generating code in a variety of programming languages; they report a wide range of pass@1 rates, but only two exceed 50 percent. Alessio Buscemi tested ChatGPT 3.5 on 40 coding tasks, asking for programs written in 10 languages, and repeating each query 10 times. Of the 4,000 trials, 1,833 resulted in code that could be compiled and executed. Zhijie Liu et al. based their evaluation of ChatGPT on problems published on the Leetcode website. The results were judged by submitting the generated code to the automated Leetcode grading process. The acceptance rate, averaged across all problems, ranged from 31 percent for programs written in C to 50 percent for Python programs. Liu et al. make another interesting observation: ChatGPT scores much worse on problems published after September 2021, which was the cutoff date for the GPT training set. They conjecture that the bot may do better with the earlier problems because it already saw the solutions during training.

A very recent paper by Li Zhong and Zilong Wang goes beyond the basic question of program correctness to consider robustness and reliability. Can the generated program respond appropriately to malformed input or to external errors, as when trying to open a file that doesn’t exist? Even when the prompt to the LLM included an example showing how to handle such issues correctly, Zhong and Wang found that the generated code failed to do so in 30 to 50 percent of cases.

Beyond these discouraging results, I have further misgivings of my own. Almost all the tests were conducted with short code snippets. An LLM that stumbles when writing a 10-line program is likely to have much greater difficulty with 100 lines or 1,000. Also, simple pass/fail grading is a very crude measure of code quality. Consider the primality test that was part of the Chen group’s benchmarking suite. Here is one of the Codex-written programs:

def is_prime(n):
    prime = True 
    if n==1:
        return False
    for i in range(2, n):
        if n % i == 0: 
            prime = False
    return prime

This code is scored as correct—as it should be, since it will never misclassify a prime as composite, or vice versa. However, you may not have the patience—or the lifespan—to wait for a verdict when $n$ is large. The algorithm attempts to divide $n$ by every integer from 2 to $n-1$.

The unreasonable effectiveness of LLMs

It’s still early days for large language models. ChatGPT was published less than a year ago; the underlying technology is only about six years old. Although I feel pretty sure of myself declaring that LLMs are not yet ready to conquer the world of coding, I can’t be quite so confident predicting that they never will be. The models will surely improve, and we’ll get better at using them. Already there is a budding industry offering guidance on “prompt engineering” as a way to get the most out of every query.

Another way to bolster the performance of an LLM might be to form a hybrid with another computational system, one equipped with tools for logic and reasoning rather than purely verbal analysis. Doug Lenat, just before his recent death, proposed combining an LLM with Cyc, a huge database of common knowledge that he had labored to build over four decades. And Stephen Wolfram is working to integrate ChatGPT into Wolfram|Alpha, an online collection of curated data and algorithms.

Still, some of the obstacles that trip up LLM program generators look difficult to overcome.

Language models work their magic through simple means: In the course of composing a sentence or a paragraph, the LLM chooses its next word based on the words that have come before. It’s like writing a text message on your phone: You type “I’ll see you…” and the software suggests a few alternative continuations: “tomorrow,” “soon,” “later.” In the LLM each candidate is assigned a probability, calculated from an analysis of all the text in the model’s training set.

The idea of text generation through this kind of statistical analysis was first explored more than a century ago by the Russian mathematician A. A. Markov. His procedure is now known as an n-gram model, where n is the number of words (or characters, or other symbols) to be considered in choosing the next element of the sequence. I have long been fascinated by the n-gram process, though mostly for its comedic possibilities. (In an essay published 40 years ago, I referred to it as “the fine art of turning literature into drivel.”)

Of course ChatGPT and the other recent LLMs are not merely n-gram models. Their neural networks capture statistical features of language that go well beyond a sequence of n contiguous symbols. Of particular importance is the attention mechanism, which is capable of tracking dependencies between selected symbols at arbitrary distances. In natural languages this device is useful for maintaining agreement of subject and verb, or for associating a pronoun with its referent. In a programming language, the attention mechanism ensures the integrity of multipart syntactic constructs such as if... then... else, and it keeps brackets properly paired and nested.

ChatGPT and other LLMs also benefit from reinforcement learning supervised by human readers. When a reader rates the quality and accuracy of the model’s output, the positive or negative feedback helps shape future responses.

Even with these refinements, however, an LLM remains, at bottom, a device for constructing a new text based on probabilities of word occurrence in existing texts. To my way of thinking, that’s not thinking. It’s something shallower, focused on words rather than ideas. Given this crude mechanism, I am both amazed and perplexed at how much the LLMs can accomplish.

For decades, architects of AI believed that true intelligence (whether natural or artificial) requires a mental model of the world. To make sense of what’s going on around you (and inside you), you need intuition about how things work, how they fit together, what happens next, cause and effect. Lenat insisted that the most important kinds of knowledge are those you acquire long before you start reading books. You learn about gravity by falling down. You learn about entropy when you find that a tower of blocks is easy to knock over but harder to rebuild. You learn about pain and fear and hunger and love—all this in infancy, before language begins to take root. Experiences of this kind are unavailable to a brain in a box, with no direct access to the physical or the social universe.

LLMs appear to be the refutation of these ideas. After all, they are models of language, not models of the world. They have no embodiment, no physical presence that would allow them to learn via the school of hard knocks. Ignorant of everything but mere words, how do they manage to sound so smart, so worldly?

On this point opinions differ. Critics of the technology say it’s all fakery and illusion. A celebrated (or notorious?) paper by Emily Bender, Timnit Gebru, and others dubs the models “stochastic parrots.” Although an LLM may speak clearly and fluently, the critics say, it has no idea what it’s talking about. Unlike a child, who learns mappings between words and things—Look! a cow, a cat, a car—the LLM can only associate words with other words. During the training process it observes that umbrella often appears in the same context as rain, but it has no experience of getting wet. The model’s modus operandi is akin to the formalist approach in mathematics, where you push symbols around on the page, moving them from one side of the equation to the other, without ever asking what the symbols symbolize. To paraphrase Saul Gorn: A formalist can’t understand a theory unless it’s meaningless.

The Jacquet-Droz automaton called “The Writer.” Image credit: Wikimedia user RamaTwo hundred fifty years ago the Swiss watchmaker Pierre Jacquet-Droz built a mechanical automaton that could write with a quill pen. The clockwork device, with hundreds of cams and gears, was dressed up as a small boy seated on a stool. When activated, the boy dipped the pen in ink and wrote out a brief message—most famously the Cartesian epigram “I think, therefore I am.” How droll! But even in the 18th century, no one believed the scribbling doll was really thinking. LLM skeptics would put ChatGPT in the same category.

Proponents of LLMs see it differently. One of their stronger arguments is that language models learn syntax, or grammar, The mechanism may be similar, but children acquire language from a much smaller training set—perhaps thousands of sentences rather than billions.in much the same way that young children do. No one teaches them explicit rules for conjugating verbs, but through exposure to correctly framed sentences they absorb those rules and many others, and thereafter they produce grammatically acceptable sentences of their own composition. If this process works for syntax, why not semantics too? Perhaps words alone, without the aid of sight and sound and smell, are enough to teach the difference between a cow and a cat. Even for human learners, direct experience is not the only path to understanding. We have imagination. Dante wrote vivid descriptions of heaven and hell—places he had never visited. Shakespeare wrote 10 plays about kings of England, even though he had never been a king, or even met one. Is it so hard to believe that ChatGPT might have some sense of what a thundershower looks and feels like?

Ilya Sutskever, the chief scientist at OpenAI, made this point forcefully in a conversation with Jensen Huang of Nvidia:

When we train a large neural network to accurately predict the next word, in lots of different texts from the internet, what we are doing is that we are learning a world model… It may look on the surface that we are just learning statistical correlations in text, but it turns out that to just learn the statistical correlations in text, to compress them really well, what the neural network learns is some representation of the process that produced the text. This text is actually a projection of the world. There is a world out there, and it has a projection on this text. And so what the neural network is learning is more and more aspects of the world, of people, of the human conditions, their hopes, dreams and motivations, their interactions and situations that we are in, and the neural network learns a compressed, abstract, usable representation of that.

Am I going to tell you which of these contrasting theories of LLM mentality is correct? I am not. Neither alternative appeals to me. If Bender et al. are right, then we must face the fact that a gadget with no capacity to reason or feel, no experience of the material universe or social interactions, no sense of self, can do a passable job of writing college essays, composing rap songs, and giving advice to the lovelorn. Knowledge and logic and emotion count for nothing; glibness is all. It’s a subversive proposition. If ChatGPT can fool us with this mindless showmanship, perhaps we too are impostors whose sound and fury signifies nothing.

On the other hand, if Sutskever is right, then much of what we prize as the human experience—the sense of personhood that slowly evolves as we grow up and make our way through life—can be acquired just by reading gobs of text scraped from the internet. If that’s true, then I didn’t actually have to endure the unspeakable agony that is junior high school; I didn’t have to make all those idiotic blunders that caused such heartache and hardship; there was no need to bruise my ego bumping up against the world. I could have just read about all those things, in the comfort of my armchair; mere words would have brought me to a state of clear-eyed maturity without all the stumbling and suffering through the vale of soul-making.

Both the critics and the defenders of LLMs tend to focus on natural-language communication between the machine and a human conversational partner—the chatty part of ChatGPT. In trying to figure out what’s happening inside the neural network, it might be helpful to pay more attention to the LLM in the role of programmer, where both parties to the conversation are computers. I can offer three arguments in support of this suggestion. First, people are too easy to fool. We make allowances, fill in blanks, silently correct mistakes, supply missing context, and generally bend over backwards to make sense of an utterance, even when it makes no sense. Computers, in contrast, are stern judges, with no tolerance for bullshit.

Second, if we are searching for evidence of a mental model inside the machine, it’s worth noting that a model of a digital computer ought to be simpler than a model of the whole universe. Because a computer is an artifact we design and build to our our own specifications, there’s not much controversy about how it works. Mental models of the universe, in contrast, are all over the map. Some of them begin with a big bang followed by cosmic inflation; some are ruled by deities and demons; some feature an epic battle between east and west, Jedi and Sith, red and blue, us and them. Which of these models should we expect to find imprinted on the great matrix of weights inside GPT? They could all be there, mixed up willy-nilly.

Third, programming languages have an unusual linguistic property that binds them tightly to actions inside the computer. The British philosopher J. L. Austin called attention to a special class of words he designated performatives. These are words that don’t just declare or describe or request; they actually do something when uttered. Austin’s canonical example is the statement “I do,” which, when spoken in the right context, changes your marital status. In natural language, performative words are very rare, but in computer programs they are the norm. Writing x = x + 1, in the right context, actually causes the value of x to be incremented. That direct connection between words and actions might be helpful when you’re testing whether a conceptual model matches reality.

I remain of two minds (or maybe more than two!) about the status and the implications of large language models for computer science. The AI enthusiasts could be right. The models may take over programming, along with many other kinds of working and learning. Or they may fizzle, as other promising AI innovations have done. I don’t think we’ll have too long to wait before answers begin to emerge.

Transformers and Large Language Models

Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ?ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA. https://arxiv.org/abs/1706.03762.

Amatriain, Xavier. 2023. Transformer models: an introduction and catalog. https://arxiv.org/abs/2302.07730.

Goldberg,Yoav. 2015. A Primer on neural network models for natural language processing. https://arxiv.org/abs/1510.00726.

Word ladders

Carroll, Lewis. 1879. Doublets: A Word-Puzzle. London: Macmillan and Co. (Available online at Google Books.)

Dewdney, A. K. 1987. Computer Recreations: Word ladders and a tower of Babel lead to computational heights defying assault. Scientific American 257(2):108–111.

Gardner, Martin. 1994. Word Ladders: Lewis Carroll’s Doublets. Math Horizons, November, 1994, pp. 18–19.

Knox, John. 1927. Word-Change Puzzles. Chicago: Laird and Lee. (Available online at archive.org.)

Knuth, Donald E. 1993. The Stanford Graphbase: A Platform for Combinatorial Computing. New York: ACM Press (Addison-Wesley Publishing).

Nabokov, Vladimir. 1962. Pale Fire. New York: G. P. Putnam’s Sons. [See the index under "word golf."]

LLMs for Programming

Dowdell, Thomas, and Hongyu Zhang. 2020. Language modelling for source code with Transformer-XL. https://arxiv.org/abs/2007.15813.

Karpathy, Andrej. 2017. Software 2.0 Medium.

Zhang, Shun, Zhenfang Chen, Yikang Shen, Mingyu Ding, Joshua B. Tenenbaum, and Chuang Gan. 2023. Planning with large language models for code generation. https://arxiv.org/abs/2303.05510.

Evaluations of LLMs as program generators

Buscemi, Alessio. 2023. A comparative study of code generation using ChatGPT 3.5 across 10 programming languages. https://arxiv.org/abs/2308.04477.

Cassano, Federico, John Gouwar, Daniel Nguyen, Sydney Nguyen, Luna Phipps-Costin, Donald Pinckney, Ming-Ho Yee, Yangtian Zi, Carolyn Jane Anderson, Molly Q Feldman, Arjun Guha, Michael Greenberg, Abhinav Jangda. 2022. MultiPL-E: A scalable and extensible approach to benchmarking neural code generation. https://arxiv.org/abs/2208.08227

Chen, Mark, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, Wojciech Zaremba. 2021. Evaluating large language models trained on code. https://arxiv.org/abs/2107.03374.

Liu, Zhijie, Yutian Tang, Xiapu Luo, Yuming Zhou, and Liang Feng Zhang. 2023. No need to lift a finger anymore? Assessing the quality of code generation by ChatGPT. https://arxiv.org/abs/2308.04838.

Zhong, Li, and Zilong Wang. 2023. A study on robustness and reliability of large language model code generation. https://arxiv.org/abs/2308.10335.

Do They Know and Think?

Bubeck, Sébastien, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, Harsha Nori, Hamid Palangi, Marco Tulio Ribeiro, and Yi Zhang. 2023. Sparks of artificial general intelligence: early experiments with GPT-4. https://arxiv.org/abs/2303.12712.

Bayless, Jacob. 2023. It’s not just statistics: GPT does reason. https://jbconsulting.substack.com/p/its-not-just-statistics-gpt-4-does

Arkoudas, Konstantine. 2023. GPT-4 can’t reason. https://arxiv.org/abs/2308.03762.

Yiu, Eunice, Eliza Kosoy, and Alison Gopnik. 2023. Imitation versus innovation: What children can do that large language and language-and-vision models cannot (yet)? https://arxiv.org/abs/2305.07666.

Piantadosi, Steven T., and Felix Hill. 2022. Meaning without reference in large language models. https://arxiv.org/abs/2208.02957.

From word models to world models: Translating from natural language to the probabilistic language of thought. 2023. Wong, Lionel, Gabriel Grand, Alexander K. Lew, Noah D. Goodman, Vikash K. Mansinghka, Jacob Andreas, and Joshua B. Tenenbaum. https://arxiv.org/abs/2306.12672.

The Middle of the Square

Brian Hayes — Mon, 08 Aug 2022 20:03:33 +0000

John von Neumann was a prodigy and a polymath. He made notable contributions in pure mathematics, physics, game theory, economics, and the design of computers. He also came up with the first algorithm for generating pseudorandom numbers with a digital computer. That last invention, however, is seldom counted among his most brilliant accomplishments. For example, Don Knuth leads off Vol. 2 of The Art of Computer Programming with a cautionary tale about the von Neumann scheme. I did the same in my first column for American Scientist, 30 years ago.To the extent his random-number work is remembered at all, it is taken as a lesson in what not to do.

The von Neumann algorithm is known as the middle-square method. You start with an n-digit number called the seed, which becomes the first element of the pseudorandom sequence, designated $x_0$. Squaring $x_0$ yields a number with 2n digits (possibly including leading zeros). Now select the middle n digits of the squared value, and call the resulting number $x_1$. The process can be repeated indefinitely, each time squaring $x_i$ and taking the middle digits as the value of $x_{i+1}$. If all goes well, the sequence of $x$ values looks random, with no obvious pattern or ordering principle. The numbers are called pseudorandom because there is in fact a hidden pattern: The sequence is completely predictable if you know the algorithm and the seed value.

Program 1, below, shows the middle-square method in action, in the case of $n = 4$. Type a four-digit number into the seed box at the top of the panel, press Go, and you’ll see a sequence of eight-digit numbers, with the middle four digits highlighted. (If you press the Go button without entering a seed, the program will choose a random seed for you—using a pseudorandom-number generator other than the one being demonstrated here!)

Program 1.
(Source code on GitHub.)

Program 1 also reveals the principal failing of the middle-square method. If you’re lucky in your choice of a seed value, you’ll see 40 or 50 or maybe even 80 numbers scroll by. The sequence of four-digit numbers highlighted in blue should provide a reasonably convincing imitation of pure randomness, bouncing around willy-nilly among the integers in the range from 0000 to 9999, with no obvious correlation between a number and its nearby neighbors in the list. But this chaotic-seeming behavior cannot last. Sooner or later—and all too often it’s sooner—the sequence becomes repetitive, cycling through some small set of integers, such as 6100, 2100, 4100, 8100, and back to 6100. Sometimes the cycle consists of a single fixed point, such as 0000 or 2500, repeated monotonously. Once the algorithm has fallen into such a trap, it can never escape. In Program 1, the repeating elements are flagged in red, and the program stops after the first full cycle.

The emergence of cycles and fixed points in this process should not be surprising; as a matter of fact, it’s inevitable. The middle-square function maps a finite set of numbers into itself. The set of four-digit decimal numbers has just $10^4 = 10{,}000$ elements, so the algorithm cannot possibly run for more than 10,000 steps without repeating itself. What’s disappointing about Program 1 is not that it becomes repetitive but that it does so prematurely, never coming anywhere near the limit of 10,000 iterations. Enter 6239 in Program 1 to see this sequence.The longest run before a value repeats is 111 steps; the median run length (the value for which half the runs are shorter and half are longer, measured over all possible seeds) is 45.

The middle-square system of Program 1 has three cycles, each of length four, as well as five fixed points. Figure 1 reveals the fates of all 10,000 seed values.

Figure 1.

Each of the numbers printed in a blue lozenge is a terminal value of the middle-square process. Whenever the system reaches one of these numbers, it will thereafter remain trapped forever in a repeating pattern. The arrays of red dots show how many of the four-digit seed patterns feed into each of the terminal numbers. In effect, the diagram divides the space of 10,000 seed values into 17 watersheds, one for each terminal value. Once you know which watershed a seed belongs to, you know where it’s going to wind up.

The watersheds vary in size. The area draining into terminal value 6100 includes more than 3,100 seeds, and the fixed point 0000 gathers contributions from almost 2,000 seeds. At the other end of the scale, the fixed point 3792 has no tributaries at all; it is the successor of itself but of no other number.

For a clearer picture of what’s going on in the watersheds, we need to zoom in close enough to see the individual rills and streams and rivers that drain into each terminal value. Figure 2 provides such a view for the smallest of the three watersheds that terminate in cycles.

Figure 2.

The four terminal values forming the cycle itself are again shown as blue lozenges; all the numbers draining into this loop are reddish brown. The entire set of numbers forms a directed graph, with arrows that lead from each node to its middle-square successor. Apart from the blue cycle, the graph is treelike: A node can have multiple incoming arrows but only one outgoing arrow, so that streams can merge but never split. Thus from any node there is only one path leading to the cycle. This structure has another consequence: Since some nodes have multiple inputs, other nodes must have none at all. They are shown here with a heavy outline.

The ramified nature of this graph helps explain why the middle-square algorithm tends to yield such short run lengths. The component of the graph shown in Figure 2 has 86 nodes (including the four members of the repeating cycle). But the longest pathways following arrows through the graph cover just 15 nodes before repetition sets in. The median run length is just 10 nodes.

A four-digit random number generator is little more than a toy; even if it provided the full run length of 10,000 numbers, it would be too small for modern applications. Working with “wider” numerals allows the middle-square algorithm to generate longer runs, but the rate of increase is still disappointing. For example, with eight decimal digits, the maximum potential cycle length is 100 million, but the actual median run length is only about 2,700. More generally, if the maximum run length is $N$, the observed median run length seems to be $c \sqrt{N}$, where $c$ is a constant less than 1. This relationship is plotted in Figure 3, on the left for decimal numbers of 4 to 12 digits and on the right for binary numbers of 8 to 40 bits.Only even widths are considered. With an odd number of digits, the concept of “middle” becomes ambiguous.

Figure 3.

The graphs make clear that the middle-square algorithm yields only a puny supply of random numbers compared to the size of the set from which those numbers are drawn. In the case of 12-digit decimal numbers, for example, the pool of possibilities has $10^{12}$ elements, yet a typical 12-digit seed yields only about 300,000 random numbers before getting stuck in a cycle. That’s only a third of a millionth of the available supply. Close examination of Figure 3 suggests that the formula $c \sqrt{N}$ is only an approximation. It seems the constant $c$ isn’t really constant; it ranges from less than 0.25 to more than 0.5. I don’t know what causes the variations. They are not just statistical fluctuations. It seems we’re investing considerable effort for a paltry return.

Von Neumann came up with the middle-square algorithm circa 1947, while planning the first computer simulations of nuclear fission. This work was part of a postwar effort to build ever-more-ghastly nuclear weapons. Random numbers were needed for a simulation of neutrons diffusing through various components of a bomb assembly. When a neutron strikes an atomic nucleus, it can either bounce off or be absorbed; in some cases absorption triggers a fission event, splitting the nucleus and releasing more neutrons. The bomb explodes only if the population of neutrons grows rapidly enough. The simulation technique, in which random numbers determined the fates of thousands of individual neutrons, came to be known as the Monte Carlo method.

The simulations were run on ENIAC, the pioneering vacuum-tube computer built at the University of Pennsylvania and installed at the Army’s Aberdeen Proving Ground in Maryland. The machine had just been converted to a new mode of operation. ENIAC was originally programmed by plugging patch cords into switchboard-like panels; the updated hardware allowed a sequence of instructions to be read into computer memory from punched cards, and then executed.

The first version of the Monte Carlo program, run in the spring of 1948, used middle-square random numbers with eight decimal digits; as noted above, this scheme yields a median run length of about 2,700. A second series of simulations, later that year, had a 10-digit pseudorandom generator, for which the median run length is about 30,000.

When I first read about the middle-square algorithm, I assumed it was used for these early ENIAC experiments and then supplanted as soon as something better came along. I was wrong about this.

One of ENIAC’s successors was a machine at the Los Alamos Laboratory called MANIAC. Such funny guys, those bomb builders. At bitsavers.org I stumbled on a manual for programmers and operators of MANIAC, written by John B. Jackson and Nicholas Metropolis, denizens of Los Alamos who were among the designers of the machine. Toward the end of the manual, Jackson and Metropolis discuss some pre-coded subroutines made available for use in other programs. Subroutine S-251.1 is a pseudorandom number generator. It implements the middle-square algorithm. The manual was first issued in 1951, but the version available online was revised in 1954. Thus it appears the middle-square method was still in use six years after its introduction.

Whereas ENIAC did decimal arithmetic, MANIAC was a binary machine. Each of its internal registers and memory cells could accommodate 40 binary digits, which Jackson and Metropolis called bigits.The term bit had been introduced in the late 1940s by John Tukey and Claude Shannon, but evidently it had not yet vanquished all competing alternatives. The middle-square program produced random values of 38 bigits; the squares of these numbers were 76 bigits wide, and had to be split across two registers.

Here is the program listing for the middle-square subroutine:

The entries in the leftmost column are line numbers; the next column shows machine instructions in symbolic form, with arrows signifying transfers of information from one place to another. The items in the third column are memory addresses or constants associated with the instructions, and the notations to the right are explanatory comments. In line 3 the number $x_i$ is squared, with the two parts of the product stored in registers R2 and R4. Then a sequence of left and right shifts (encoded L(1), R(22), and L(1), isolate the middle bigits, numbered 20 through 57, which become the value of $x_{i+1}$.

MANIAC’s standard subroutine was probably still in use at Los Alamos as late as 1957. In that year the laboratory published A Practical Manual on the Monte Carlo Method for Random Walk Problems by E. D. Cashwell and C. J. Everett, which describes a 38-bit middle-square random number generator that sounds very much like the same program. Cashwell and Everett indicate that standard practice was always to use a particular seed known to yield a run length of “about 750,000.” My tests show a run length of 717,728.

Another publication from 1957 also mentions a middle-square program in use at Los Alamos. The authors were W. W. Wood and F. R. Parker, who were simulating the motion of molecules confined to a small volume; this was one of the first applications of a new form of the Monte Carlo protocol, known as Markov chain Monte Carlo. Wood and Parker note: “The pseudo-random numbers were obtained as appropriate portions of 70 bit numbers generated by the middle square process.” This work was done on IBM 701 and 704 computers, which means the algorithm must have been implemented on yet another machine architecture.

John W. Mauchly, one of ENIAC’s principal designers, also spread the word of the middle-square method. In a 1949 talk at a meeting of the American Statistical Association he presented a middle-square variant for the BINAC, a one-off predecessor of the UNIVAC line of computers. I have also seen vague, anecdotal hints suggesting that some form of the middle-square algorithm was used on the UNIVAC 1 in the early 1950s at the Lawrence Livermore laboratory. The hints come from an interview with George A. Michael of Livermore. See pp. 111–112. There are even sketchier intimations in a conversation between Michael and Bob Abbott.

I have often wondered how it came about that one of the great mathematical minds of the 20th century conceived and promoted such a lame idea. It’s even more perplexing that the middle-square algorithm, whose main flaw was recognized from the start, remained in the computational toolbox for at least a decade. How did they even manage to get the job done with such a scanty supply of randomness?

With the benefit of further reading, I think I can offer some plausible guesses in answer to these questions. It helps to keep in mind the well-worn adage of L. P. Hartley: “The past is a foreign country. They do things differently there.”

If you are writing a Monte Carlo program today, you can take for granted a reliable and convenient source of high-quality random numbers. Whenever you need one, you simply call the function random(). The syntax differs from one programming language to another, but nearly every modern language has such a function built in. Moreover, it can be counted on to produce vast quantities of randomness—cosmic cornucopias of the stuff. A generator called the Mersenne Twister, popular since the 1990s and the default choice in several programming languages, promises $2^{19997} - 1$ values before repeating itself.

Of course no such software infrastructure existed in 1948. In those days, the place to find numbers of all kinds—logarithms, sines and cosines, primes, binomial coefficients—was a lookup table. Compiling such tables was a major mathematical industry. Indeed, the original purpose of ENIAC was to automate the production of tables for aiming artillery. The reliance on precomputed tables extended to random numbers. In the 1930s two pairs of British statisticians created tables of 15,000 and 100,000 random decimal digits; a decade later the RAND Corporation launched an even larger endeavor, generating a million random digits. In 1949 the RAND table became available on punched cards (50 digits to a card, 20,000 cards total); in 1955 it was published in book form. (The exciting opening paragraphs are reproduced in Figure 4.)

Figure 4.

It would have been natural for the von Neumann group to adapt one of the existing tables to meet their needs. They avoided doing so because there was no room to store the table in computer memory; they would have to read a punched card every time a number was needed, which would have been intolerably slow. What they did instead was to use the middle-square procedure as if it were a table of a few thousand random numbers. This document has been brought to light by Thomas Haigh, Mark Priestly, and Crispin Rope in their recent book ENIAC in Action.The scheme is explained in a document titled “Actual Running of the Monte Carlo Problems on the ENIAC”, written mainly by Klara von Neumann (spouse of John), who spent weeks in Aberdeen preparing the programs and tending the machine.

In the preliminary stages of the project, the group chose a specific seed value (not disclosed in any documents I’ve seen), then generated 2,000 random iterates from this seed. They checked to be sure the program had not entered a cycle, and they examined the numbers to confirm that they satisfied various tests of randomness. At the start of a Monte Carlo run, the chosen seed was installed in the middle-square routine, which then supplied random numbers as needed until the 2,000th request was satisfied. At that point the algorithm was reset and restarted with the same seed. In this way the simulation could continue indefinitely, cycling through the same set of numbers in the same order, time after time. The second round of simulations worked much the same way, but with a block of 3,000 numbers from the 10-digit generator.

Reading about this reuse of a single, small block of numbers, I could not suppress a tiny gasp of horror. Even if those 2,000 numbers pass basic tests of randomness, the concatenation of the sequence with multiple copies of itself is surely ill-advised. Suppose you are using the random numbers to estimate the value of $\pi$ by choosing random points in a square and determining whether they fall inside an inscribed circle. Let each random number specify the two coordinates of a single point. After running through the 2,000 middle-square values you will have achieved a certain accuracy. But repeating the same values is pointless: It will never improve the accuracy because you will simply be revisiting all the same points.

We can learn more about the decisions that led to the middle-square algorithm from the proceedings of a 1949 symposium on Monte Carlo methods. The symposium was convened in Los Angeles by a branch of the National Bureau of Standards. It served as the public unveiling of the work that had been going on in the closed world of the weapons labs. Von Neumann’s contribution was titled “Various Techniques Used in Connection With Random Digits.” The heading of the published version reads “By John von Neumann. Summary written by George E. Forsythe.” I don’t know why von Neumann’s own manuscript could not be published, or what Forsythe’s role in the composition might have been. Forsythe was then a mathematician with the Bureau of Standards; he later founded Stanford’s department of computer science.

Von Neumann makes clear that the middle-square algorithm was not a spur-of-the-moment, off-the-cuff choice; he considered serveral alternatives. Hardware was one option. “We . . . could build a physical instrument to feed random digits directly into a high-speed computing machine,” he wrote. “The real objection to this procedure is the practical need for checking computations. If we suspect that a calculation is wrong, almost any reasonable check involves repeating something done before. . . . I think that the direct use of a physical supply of random digits is absolutely inacceptable for this reason and for this reason alone.”

The use of precomputed tables of random numbers would have solved the replication problem, but, as noted above, von Neumann dismissed it as too slow if the numbers had to be supplied by punch card. Such a scheme, he said, would rely on “the weakest portion of presently designed machines—the reading organ.”

Von Neumann also mentions one other purely arithmetic procedure, originally suggested by Stanislaw Ulam: iterating the function $f(x) = 4x(1 – x)$, with the initial value $x_0$ lying between 0 and 1. This formula is a special case of a function called the logistic map, which 25 years later would become the jumping off point for the field of study known as chaos theory. The connection with chaos suggests that the function might well be a good source of randomness, and indeed if $x_0$ is an irrational real number, the successive $x_i$ values are guaranteed to be uniformly distributed on the interval (0, 1), and will never repeat. But that guarantee applies only if the computations are performed with infinite precision. A 2012 review of logistic-map random number generators by K. J. Persohn and R. J. Povinelli is no more enthusiastic.In a finite machine, von Neumann observes, “one is really only testing the random properties of the round-off error,” a phenomenon he described as “very obscure, very imperfectly understood.”

Two other talks at the Los Angeles symposium also touched on sources of randomness. Preston Hammer reported on an evaluation of the middle-square algorithm performed by a group at Los Alamos, using IBM punch-card equipment. Starting from the ten-digit (decimal) seed value 1111111111, they produced 10,000 random values, checking along the way for signs of cyclic repetition; they found none. (The sequence runs for 17,579 steps before descending into the fixed point at 0.) They also applied a few statistical tests to the first 3,284 numbers in the sequence. “These tests indicated nothing seriously amiss,” Hammer concluded—a rather lukewarm endorsement.

Another analysis was described by George Forsythe (the amanuensis for von Neumann’s paper). His group at the Bureau of Standards studied the four-digit version of the middle-square algorithm (the same as Program 1), tracing 16 trajectories. Twelve of the sequences wound up in the 6100-2100-4100-8100 loop. The run lengths ranged from 11 to 104, with an average of 52. All of these results are consistent with those shown in Figure 1. Presumably, the group stopped after investigating just 16 sequences because the computational process—done with punch cards and desk calculators—was too arduous for a larger sample. (With a modern laptop, it takes about 40 milliseconds to determine the fates of all 10,000 four-digit seeds.)

Forsythe and his colleagues also looked at a variant of the middle-square process. Instead of setting $x_{i+1}$ to the middle digits of $x_i^2$, the variant sets $x_{i+2}$ to the middle digits of $x_i \times x_{i+1}$. One might call this the middle-product method. In a variant of the variant, the two factors that form the product are of different sizes. Runs are somewhat longer for these routines than for the squaring version, but Forsythe was less than enthusiastic about the results of statistical tests of uniform distribution.

Von Neumann was undeterred by these ambivalent reports. Indeed, he argued that a common failure mode of the middle-square algorithm—“the appearance of self-perpetuating zeros at the ends of the numbers $x_i$”—was not a bug but a feature. Because this event was easy to detect, it might guard the Monte Carlo process against more insidious forms of corruption.

Amid all these criticisms of the middle-square method, I wish to insert a gripe of my own that no one else seems to mention. The middle-square algorithm mixes up numbers and numerals in a way that I find disagreeable. In part this is merely a matter of taste, but the conflating of categories does have consequences when you sit down to convert the algorithm into a computer program.

I first encountered the distinction between numbers and numerals at a tender age; it was the subject of chapter 1 in my seventh-grade “New Math” textbook. I learned that a number is an abstraction that counts the elements of a set or assigns a position within a sequence. A numeral is a written symbol denoting a number. The decimal numeral 6 and the binary numeral 110 both denote the same number; so does the Roman numeral VI and the Peano numeral S(S(S(S(S(S(0)))))). Mathematical operations apply to numbers, and they give the same results no matter how those numbers are instantiated as numerals. For example, if $a + b = c$ is a true statement mathematically, then it will be true in any system of numerals: decimal 4 + 2 = 6, binary 100 + 10 = 110, Roman IV + II = VI.

In the middle-square process, squaring is an ordinary mathematical operation: $n^2$ or $n \times n$ should produce the same number no matter what kind of numeral you choose for representing $n$. But “extract the middle digits” is a different kind of procedure. There is no mathematical operation for accessing the individual digits of a number, because numbers don’t have digits. Digits are components of numerals, and the meaning of “middle digits” depends on the nature of those numerals. Even if we set aside exotica such as Roman numerals and consider only ordered sequences of digits written in place-value notation, the outcome of the middle-digits process depends on the radix, or base. The decimal numeral 12345678 and the binary numeral 101111000110000101001110 denote the same number, but their respective middle digits 3456 and 000110000101 are not equal.

It appears that von Neumann was conscious of the number-numeral distinction. In a 1948 letter to Alston Householder he referred to the products of the middle-square procedure as “digital aggregates.” What he meant by this, I believe, is that we should not think of the $x_i$ as magnitudes or positions along the number line but rather as mere collections of digits, like letters in a written word.

This nice philosophical distinction has practical consequences when we write a computer program to implement the middle-square method. What is the most suitable data type for the $x_i$? The squaring operation makes numbers an obvious choice. Multiplication and raising to integer powers are basic functions offered by almost any programming language—but they apply only to numbers. For extracting the middle digits, other data structures would be more convenient. If we collect a sequence of digits in an array or a character string—an embodiment of von Neumann’s “digital aggregate”—we can then easily select the middle elements of the sequence by their position, or index. But squaring a string or an array is not something computer systems know how to do.

The JavaScript code that drives Program 1 resolves this conflict by planting one foot in each world. An $x$ value is initially represented as a string of four characters, drawn from the digits 0 through 9. When it comes time to calculate the square of $x$, the program invokes a built-in JavaScript procedure, parseInt, to convert the string into a number. Then $x^2$ is converted back into a string (with the toString method) and padded on the left with as many 0 characters as needed to make the length of the string eight digits. Extracting the middle digits of the string is easy, using a substring method. The heart of the program looks like this:

function mid4(xStr) {
	const xNum = parseInt(xStr, 10);       // convert string to number
	const sqrNum = xNum * xNum;            // square the number
	let sqrStr = sqrNum.toString(10);      // back to decimal string
	sqrStr = padLeftWithZeros(sqrStr, 8);  // pad left to 8 digits
	const midStr = sqrStr.substring(2, 6); // select middle 4 digits
	return midStr;
}

The program continually shuttles back and forth between arithmetic operations and methods usually associated with textual data. In effect, we abduct a number from the world of mathematics, do some extraterrestrial hocus-pocus on it, and set it back where it came from.

I don’t mean to suggest it’s impossible to implement the middle-square rule with purely mathematical operations. Here is a version of the function written in the Julia programming language:

function mid4(x)
	return (x^2 % 1000000) ÷ 100
end

In Julia the % operator computes the remainder after integer division; for example, 12345678 % 1000000 yields the result 345678. The ÷ operator returns the quotient from integer division: 345678 ÷ 100 is 3456. Thus we have extracted the middle digits just by doing some funky grade-school arithmetic.

We can even create a version of the procedure that works with numerals of any width and any radix, although those parameters have to be specified explicitly:The procedure as written assumes the width is an even number. Allowing odd widths is doable but messy.

function midSqr(x, radix, width)            # width must be even
	modulus = radix^((3 * width) ÷ 2)
	divisor = radix^(width ÷ 2)
	return (x^2 % modulus) ÷ divisor
end

This program is short, efficient, and versatile—but hardly a model of clarity. If I encountered it out of context, I would have a hard time figuring out what it does.

There are still more ways to accomplish the task of plucking out the middle digits of a numeral. The mirror image of the all-arithmetic version is a program that eschews numbers entirely and performs the computation in the realm of digit sequences. To make that work, we need to write a multiplication procedure for indexable strings or arrays of digits.

The MANIAC program shown above takes yet another approach, using a single data type that can be treated as a number one moment (during multiplication) and as a sequence of ones and zeros the next (in the left-shift and right-shift operations).

Let us return to von Neumann’s talk at the 1949 Los Angeles symposium. Tucked into his remarks is the most famous of all pronouncements on pseudorandom number generators:

Any one who considers arithmetical methods of producing random digits is, of course, in a state of sin. For, as has been pointed out several times, there is no such thing as a random number—there are only methods to produce random numbers, and a strict arithmetic procedure of course is not such a method.

Today we live under a new dispensation. Faking randomness with deterministic algorithms no longer seems quite so naughty, and the concept of pseudorandomness has even shed its aura of paradox and contradiction. Now we view devices such as the middle-square algorithm not as generators of de novo randomness but as amplifiers of pre-existing randomness. All the genuine randomness resides in the seed you supply; the machinations of the algorithm merely stir up that nubbin of entropy and spread it out over a long sequence of numbers (or numerals).

When von Neumann and his colleagues were designing the first Monte Carlo programs, they were entirely on their own in the search for an algorithmic source of randomness. There was no body of work, either practical or theoretical, to guide them. But that soon changed. Just three months after the Los Angeles symposium, a larger meeting on computation in the sciences was held at Harvard. The proceedings, published in 1951, are available online at bitsavers.org.Ulam spoke on the Monte Carlo method but did not mention random-number generators. That subject was left to Derrick H. Lehmer, an ingenious second-generation number theorist from Berkeley. Lehmer introduced the idea of linear congruential generators, which remain an important family today.

After reviewing the drawbacks of the middle-square method, Lehmer proposed the following simple function for generating a stream of pseudorandom numbers, each of eight decimal digits:A note for the nitpicky: One possible value of this sequence has nine decimal digits. A machine strictly limited to eight digits will need some provision to handle that case.

\[x_{i+1} = 23x_i \bmod 10000001\]

For any seed value in the range $0 \lt x_0 \lt 10000001$ this formula generates a cyclic pseudorandom sequence with a period of 5,882,352, more than 2,000 times the median length of eight-digit middle-square sequences. And there are advantages apart from the longer runs. The middle-square algorithm produces lollipop loops, with a stem traversed just once before the system enters a short loop or becomes stuck at a fixed point. Only the stem portion is useful for generating random numbers. Lehmer’s algorithm, in contrast, yields long, simple, stemless loops; the system follows the entire length of 5,882,352 numbers before returning to its starting point.

Figure 5.

Simple loops also make it far easier to detect when the system has in fact begun to repeat itself. All you have to do is compare each $x_i$ with the seed value, $x_0$, since that initial value will always be the first to repeat. With the stem-and-loop topology, cycle-finding is tricky. The best-known technique, attributed to Robert W. Floyd, is the hare-and-tortoise algorithm. It requires $3n$ evaluations of the number-generating function to identify a cycle with run length $n$.

Lehmer’s algorithm is an example of a linear congruential generator, with the general formula:

\[x_{i+1} = (ax_i + b) \bmod m.\]

Linear congruential algorithms have the singular virtue of being governed by well-understood principles of number theory. Just by inspecting the constants in the formula, you can know the complete structure of cycles in the output of the generator. Lehmer’s specific recommendation ($a = 23$, $b = 0$, $m = 10000001$) is no longer considered best practice, but other members of the family are still in use. With a suitable choice of constants, the generator achieves the maximum possible period of $m - 1$, so that the algorithm marches through the entire range of numbers in a single long cycle.

At this point we come face to face with the big question: If the middle-square method is so feeble and unreliable, why did some very smart scientists persist in using it, even after the flaws had been pointed out and alternatives were on offer? Part of the answer, surely, is habit and inertia. For the veterans of the early Monte Carlo studies, the middle-square method was a known quantity, a path of least resistance. For those at Los Alamos, and perhaps a few other places, the algorithm became an off-the-shelf component, pretested and optimized for efficiency. People tend to go with the flow; even today, most programmers who need random numbers accept whatever function their programming language provides. That’s certainly my own practice, and I can offer an excuse beyond mere laziness: The default generator may not be ideal, but it’s almost certainly better than what I would cobble together on my own.

Another part of the answer is that the middle-square method was probably good enough to meet the needs of the computations in which it was used. After all, the bombs they were building did blow up.

Many simulations work just fine even with a crude randomness generator. In 2005 Richard Procassini and Bret Beck of the Livermore lab tested this assertion on Monte Carlo problems of the same general type as the neutron-diffusion studies done on ENIAC in the 1940s. They didn’t include the middle-square algorithm in their tests; all of their generators were of the linear congruential type. But their main conclusion was that “periods as low as $m = 2^{16} = 65{,}536$ are sufficient to produce unbiased results.”

Not all simulations are so easy-going, and it’s not always easy to tell which ones are finicky and which are tolerant. Simple tasks can have stringent requirements, as with the estimation of $\pi$ mentioned above. And even with repetitions, subtle biases or correlations can skew the results of Monte Carlo studies in ways that are hard to foresee. Thirty years ago an incident that came to be known as the Ferrenberg affair revealed flaws in random number generators that were then considered the best of their class. So I’m not leading a movement to revive the middle-square method.

I would like to close with two oddities.

First, in discussing Figure 1, I mentioned the special status of the number 3792, which is an isolated fixed point of the middle-square function. Like any other fixed point, 3792 is its own successor: $3792^2$ yields 14,379,264, whose four middle digits are 3792 again. Unlike other fixed points, 3792 is not the successor of any other number among the 10,000 possible seeds. The only way it will ever appear in the output of the process is if you supply it as the input—the seed value.

In 1954 Nicholas Metropolis encountered the same phenomenon when he plotted the fates of all 256 seeds in the eight-bit binary middle-square generator. Here is his graph of the trajectories:

Figure 6.

At the lower left is the isolated fixed point 165. In binary notation this number is 10100101, and its square is 0110101001011001, so it is indeed a fixed point. Metropolis comments: “There is one number, $x = 165$, which, when expressed in binary form and iterated, reproduces itself and is never reproduced by any other number in the series. Such a number is called ‘samoan.’” The diagram also includes this label.

The mystery is: Why “samoan”? Metropolis offers no explanation. I suppose the obvious hypothesis is that an isolated fixed point is like an island, but if that’s the intended sense, why not pick some lonely island off by itself in the middle of an ocean? St. Helena, maybe. Samoa is a group of islands, and not particularly isolated. I have scoured the internets for other uses of “samoan” in a mathematical context, and come up empty. I’ve looked into whether it might refer not to the island nation but to the Girl Scout cookies called Samoas (but they weren’t introduced until the 1970s). I’ve searched for some connection with Margaret Mead’s famous book Coming of Age in Samoa. No clues. Please share if you know anything!

On a related note, when I learned that both the eight-bit binary and the four-digit decimal systems each have a single “samoan,” I began to wonder if this was more than coincidence. I found that the two-digit decimal function also has a single isolated fixed point, namely $x_0 = 50$. Perhaps all middle-square functions share this property? Wouldn’t that be a cute discovery? Alas, it’s not so. The six-digit decimal system is a counterexample.

The second oddity concerns a supposed challenge to von Neumann’s priority as the inventor of the middle-square method. In The Broken Dice, published in 1993, the French mathematician Ivar Ekeland tells the story of a 13th-century monk, “a certain Brother Edvin, from the Franciscan monastery of Tautra in Norway, of whom nothing else is known to us.” Ekeland reports that Brother Edvin came up with the following procedure for settling disputes or making decisions randomly, as an alternative to physical devices such as dice or coins:

The player chooses a four-digit number and squares it. He thereby obtains a seven- or eight-digit number of which he eliminates the two last and the first one or two digits, so as to once again obtain a four-digit number. If he starts with 8,653, he squares this number (74,874,409) and keeps the four middle digits (8,744). Repeating the operation, he successively obtains:

8,653 8,744 4,575 9,306 6,016

Thus we have a version of the middle-square method 700 years before von Neumann, and only 50 years after the introduction of Indo-Arabic numerals into Europe. The source of the tale, according to Ekeland, is a manuscript that has been lost to the world, but not before a copy was made in the Vatican archives at the direction of Jorge Luis Borges, who passed it on to Ekeland.

The story of Brother Edvin has not yet made its way into textbooks or the scholarly literature, as far as I know, but it has taken a couple of small steps in that direction. In 2010 Bill Gasarch retold the tale in a blog post titled “The First pseudorandom generator- probably.” And the Wikipedia entry for the middle-square method presents Ekeland’s account as factual—or at least without any disclaimer expressing doubts about its veracity.

To come along now and insist that Brother Edvin’s algorithm is a literary invention, a fable, seems rather unsporting. It’s like debunking the Tooth Fairy. So I won’t do that. You can make up your own mind. But I wonder if von Neumann might have welcomed this opportunity to shuck off responsibility for the algorithm.

Jotto

Brian Hayes — Wed, 15 Jun 2022 17:50:45 +0000

A day or two after publishing my TL;DR on Wordle algorithms, I stumbled on a remarkable paper that neatly summarizes all the main ideas. The remarkable part is that the paper was written 50 years before Wordle was invented!

The paper is “Information Theory and the Game of Jotto,” issued in August of 1971 as Artificial Intelligence Memo No. 28 from the AI Lab at MIT. The author was Michael D. Beeler, known to me mainly as one of the three principal authors of HAKMEM (the others were Bill Gosper and Rich Schroeppel). Beeler later worked at Bolt, Baranek, and Newman, an MIT spinoff.

Wikipedia tells me that Jotto was invented in 1955 by Morton M. Rosenfeld as a game for two players. As in Wordle, you try to discover a secret word by submitting guess words and getting feedback about how close you have come to the target. The big difference is that JOTTO’s feedback offers only a crude measure of closeness. You learn the number of letters in your guess word that match one of the letters in the target word. You get no indication of which letters match, or whether they are in the correct positions.

The unit of measure for closeness is the jot. Beeler gives the example of playing GLASS against SMILE, which earns a closeness score of two jots, since there are matches for the letter L and for one S. Unlike the Wordle feedback rule, this scoring scheme is symmetric: The score remains the same if you switch the roles of guess and target word.

A defect of the game, in my view, is that you can max out the score at five jots and still not know the target word. For example, when a five-jot score tells you that the letters of the target are {A, E, G, L, R}, the word could be GLARE, LAGER, LARGE, or REGAL. Your only way to pin down the answer is to guess them in sequence.

Beeler’s main topic is not how the game proceeds between human players but how a computer can be programmed to take the role of a player. He reports that “A JOTTO program has existed for a couple of years at MIT’s A.I. Lab,” meaning it was created sometime in the late 1960s. He says nothing about who wrote this program. I’m going to make the wild surmise that Beeler himself might have been the author, particularly given his intimate knowledge of the program’s innards.

Here’s the crucial passage, lifted directly from the memo:

The strategy described here—maximizing the information gain from each guess—is exactly what’s recommended for Wordle. But where Wordle divvies up the potential target words into $3^5 = 243$ subsets, the JOTTO scoring rule defines only six categories (0 through 5 jots). As a result, the maximum possible information gain is only about 2.6 bits in JOTTO, compared with almost 8 bits in Wordle.

Beeler also recognized a limitation of this “greedy” strategy. “It is conceivable that the test word with the highest expectation at the current point in the game has a good chance of getting us to a point where we will NOT have any particularly good test words available . . . I am indebted to Bill Gosper for pointing out this possibility; the computation required, however, is impractical, and besides, the program seems to do acceptably as is.”

The JOTTO program was written in the assembly language of the PDP-6 and PDP-10 family of machines from the Digital Equipment Corporation, which were much loved in that era at MIT. (Beeler praises the instruction set as “very symmetrical, complete, powerful and easy to think in.”) But however elegant the architecture, physical resources were cramped, with a maximum memory capacity of about one megabyte. Nevertheless, Beeler found room for the program itself, for a dictionary of about 7,000 words, and for tables of precomputed responses to the first two or three guesses.

Humbling.

Words for the Wordle-Weary

Brian Hayes — Wed, 01 Jun 2022 19:19:48 +0000

When the Wordle wave washed over the world some months ago, I played along like everybody else, once a day collecting my rows of gray and gold and green letters. But my main interest was not in testing my linguistic intuitions; I wanted to write a computer program to solve the puzzle. Could I create something that would play a stronger game than I do? It’s now clear the answer to that question is yes, but I can’t say whether it’s because I’m such a hotshot programmer or such a mediocre Wordler.

I hasten to add that my motivation in this project is not to cheat. Josh Wardle, the creator of Wordle, took all the fun out of cheating by making it way too easy. Anybody can post impressive results like this one:

That lonely row of green squares indicates that you’ve solved the puzzle with a single brilliant guess. The Wordle app calls that performance an act of “Genius,” but if it happens more than once every few years, your friends may suspect another explanation.

The software I’ve written will not solve today’s puzzle for you. You can’t run the program unless you already know the answer. Thus the program won’t tell you, in advance, how to play; but it may tell you, retrospectively, how you should have played.

The aim in Wordle is to identify a hidden five-letter word. When you make a guess, the Wordle app gives you feedback by coloring the letter tiles. A green tile indicates that you have put the right letter in the right place. The rules are not quite as simple as they seem, as I’ll explain below.Gold means the letter is present in the target word, but not where you put it. Letters absent from the target word are marked with a gray tile.

Figure 1.

The sequence of grids above records the progress of a game I played several weeks ago. The initial state of the puzzle is a blank grid with space for six words of five letters each. My first guess, CRATE (far left), revealed that the letter T is present in the target word, but not in the fourth position; the other letters, CRAE, are absent. I then played BUILT and learned where the T goes, and that the letters I and L are also present. The third guess, LIMIT, got three letters in their correct positions, which was enough information to rule out all but one possibility. The four green tiles in row four confirm that LIGHT is the target word.

The official Wordle app gives you six chances to guess the Wordle-of-the-day. If you fill the grid with six wrong answers, the game ends and the app reveals the word you failed to find.

It’s important to bear in mind that Wordle is played in a closed universe. The word lists used throughout this essay come from the original version of the game published by Josh Wardle at powerlanguage.co.uk. In February, when the game moved to nytimes.com, about two dozen words were shuffled around or removed. Note 1 has further discussion of the word lists and the Times‘s revisions.The target words are drawn from a list of 2,315 possibilities, which are meant to be words familiar to anyone with a broad English vocabulary. All of these target words are also valid guess words. An additional 10,657 words are acceptable as guesses but will never appear as the Wordle-of-the-day. Many of these latter words are quite obscure, familiar only to the most serious Scrabble players. I call them the arcana.

Both of the word lists are downloaded to your web browser whenever you play the game, and you can examine or copy them by peeking at the JavaScript source code with your browser’s developer tools.

Conceptually, a Wordle-playing program conducts a dialogue between two computational agents: the Umpire and the Player. The Umpire knows the target word, and responds to submitted guess words in much the same way the Wordle app does. It marks each of the five letters in the guess as either correct, present, or absent, corresponding to the Wordle color codes of green, gold, and gray.

The Player component of the program does not know the target word, but it has access to the lists of common and arcane words. On each turn, the Player selects a word from one of these lists, submits it to the Umpire, and receives feedback classifying each of the letters as correct, present, or absent. This information can then be used to help choose the next guess word. The guessing process continues until the Player identifies the target word or exhausts its six turns.

Here’s a version of such a program. Type in a start word and a target word, then press Go. Briefly displayed in gray letters are the words the program evaluates as possible next guesses. The most promising candidate is then submitted to the Umpire and displayed with the green-gold-gray color code.

Program 1.

Feel free to play with the buttons below the grid. The “?” button will pop up a brief explanation. If you’d like to peek behind the curtain, the source code for the project is available on GitHub. Also, a standalone version of the program may be more convenient if you’re reading on a small screen.

This program is surely not the best possible Wordler, but it’s pretty good. Most puzzle instances are solved in three or four guesses. Failures to finish within six guesses are uncommon if you choose a sensible starting word. The algorithms behind Program 1 will be the main subject of this essay, but before diving into them I would like to consider some simpler approaches to playing the game.

When I first set out to write a Wordler program, I chose the easiest scheme I could think of. At the start of a new game, the Player chooses a word at random from the list of 2,315 potential target words. (The arcana will not be needed in this program.) After submitting this randomly chosen word as a first guess, the Player takes the Umpire’s feedback report and uses it to sift through the list of potential targets, retaining those words that are still viable candidates and setting aside all those that conflict in some way with the feedback. The Player then selects another word at random from among the surviving candidates, and repeats the process. On each round, the list of candidates shrinks further.

The winnowing of the candidate list is the heart of the algorithm. Suppose the first guess word is COULD, and the Umpire’s evaluation is equivalent to this Wordle markup: . You can now go through the list of candidate words and discard all those that have a letter other than L in the fourth position. You can also eliminate all words that include any instance of the letters C, U, or D. The gold letter O rules out all words in two disjoint classes: those that don’t have an O anywhere in their spelling, and those that do have an O in the second position. After this winnowing process, only seven words remain as viable candidates.

One aspect of these rules often trips me up when I’m playing. Wordle is not Wheel of Fortune: The response to a guess word might reveal that the target word has an L, but it doesn’t necessarily show you all the Ls. If you play COULD and get the feedback displayed above, you should not assume that the green L in the fourth position is the only L. The target word could be ATOLL, HELLO, KNOLL, or TROLL. (When both the guess and the target words have multiple copies of the same letter, the rules get even trickier. If you want to know the gory details, see Note 2.)

The list-winnowing process is highly effective. Even with a randomly chosen starter word, the first guess typically eliminates more than 90 percent of the target words, leaving only about 220 viable candidates, on average. In subsequent rounds the shrinkage is not as rapid, but it’s usually enough to identify a unique solution within the allotted six guesses.

I wrote the random Wordler as a kind of warmup exercise, and I didn’t expect much in the way of performance. Plucking words willy-nilly from the set of all viable candidates doesn’t sound like the shrewdest strategy. However, it works surprisingly well. The graph below shows the result of running the program on each of the 2,315 target words, with the experiment repeated 10,000 times to reduce statistical noise.

Figure 2.Guess criterion: random
Guess pool: remaining target words
Start word: randomly chosen

The average number of guesses needed to find the solution is 4.11, and only 2 percent of the trials end in failure (requiring seven or more guesses).

Incidentally, this result is quite robust, in the sense that the outcome doesn’t depend at all on the composition of the words on the target list. If you replace the official Wordle list with 2,315 strings of random letters, the graph looks the same.

The surprising strength of a random player might be taken as a sign that Wordle isn’t really a very hard game. If you are guided by a single, simple rule—play any word that hasn’t already been excluded by feedback from prior guesses—you will win most games. From another point of view the news is not so cheering: If a totally mindless strategy can often solve the puzzle in four moves, you may have to work really hard to do substantially better.

Still another lesson might be drawn from the success of the random player: The computer’s complete ignorance of English word patterns is not a major handicap and might even be an advantage. A human player’s judgment is biased by ingrained knowledge of differences in word frequency. If I see the partial solution BE _ _ _, I’m likely to think first of common words such as BEGIN and BENCH, rather than the rarer BELLE and BEZEL. In the Wordle list of potential targets, however, each of these words occurs with exactly the same frequency, namely 1/2315. A policy favoring common words over rare ones is not helpful in this game.

But a case can be made for a slightly different strategy, based on a preference not for common words but for common letters. A guess word made up of high-frequency letters should elicit more information than a guess consisting of rare letters. By definition, the high-frequency letters are more likely to be present in the target word. Even when they are absent, that fact becomes valuable knowledge. If you make JIFFY your first guess and thereby learn that the target word does not contain a J, you eliminate 27 candidates out of 2,315. Playing EDICT and learning that the target has no E rules out 1,056 words.

The table below records the number of occurrences of each letter from A to Z in all the Wordle target words, broken down according to position within the word. For example, A is the first letter of 141 target words, and Z is the last letter of four words.

Figure 3.

The same information is conveyed in the heatmap below, where lighter colors indicate more common letters.

Figure 4.

These data form the basis of another Wordle-playing algorithm. Given a list of candidate words, we compute the sum of each candidate’s letter frequencies, and select the word with the highest score. For example, the word SKUNK gets 366 points for the initial S, then 10 points for the K in the second position, and so on through the rest of the letters for an aggregate score of 366 + 10 + 165 + 182 + 113 = 836. SKUNK would win over KOALA, which has a score of 832, but lose to PIGGY (851).

Testing the letter-frequency algorithm against all 2,315 target words yields this distribution of results:

Figure 5.Guess criterion: maximize letter frequency
Guess pool: remaining target words
Start word: determined by algorithm (SLATE)

The mean number of guesses is 3.83, noticeably better than the random-choice average of 4.11. The failure rate—words that aren’t guessed within six turns—is pushed down to 1.1 percent (28 out of 2,315).

There’s surely room for improvement in the letter-frequency algorithm. One weakness is that it treats the five positions in each word as independent variables, whereas in reality there are strong correlations between each letter and its neighbors. For example, the groups CH, SH, TH, and WH are common in English words, and Q is constantly clinging to U. Other pairs such as FH and LH are almost never seen. These attractions and repulsions between letters play a major role in human approaches to solving the Wordle puzzle (or at least they do in my approach). The letter-frequency program could be revised to take advantage of such knowledge, perhaps by tabulating the frequencies of bigrams (two-letter combinations). I have not attempted anything along these lines. Other ideas hijacked my attention.

Like other guessing games, such as Bulls and Cows, Mastermind, and Twenty Questions, Wordle is all about acquiring the information needed to unmask a hidden answer. Thus if you’re looking for guidance on playing the game, an obvious place to turn is the body of knowledge known as information theory, formulated 75 years ago by Claude Shannon.

Shannon’s most important contribution (in my opinion) was to establish that information is a quantifiable, measurable substance. The unit of measure is the bit, defined as the amount of information needed to distinguish between two equally likely outcomes. If you flip a fair coin and learn that it came up heads, you have acquired one bit of information.

This scheme is easy to apply to the game of Twenty Questions, where the questions are required to have just two possible answers—yes or no. If each answer conveys one bit of information, 20 questions yield 20 bits, which is enough to distinguish one item among $2^{20} \approx 1$ million equally likely possibilties. In general, if a collection of things has $N$ members, then the quantity of information needed to distinguish one of its members is the base-2 logarithm of $N$, written $\log_2 N$.

In Wordle we ask no more than six questions, but answers are not just yes or no. Each guess word is a query submitted to the Umpire, who answers by coloring the letter tiles. There are more than two possible answers; in fact, there are 243 of them. As a result, a Wordle guess has the potential to yield much more than one bit of information. But there’s also the possibility of getting less than one bit.

Where does that curious number 243 come from? In the Umpire’s response to a guess, each letter is assigned one of three colors, and there are five letters in all. Hence the set of all possible responses has $3^5 = 243$ members. Here they are, in all their polychrome glory:

Figure 6.

These color codes represent every feedback message you could ever possibly receive when playing Wordle. (And then some! The five codes outlined in pink can never occur. Note 2 explains why.) Each color pattern can be represented as a five-digit numeral in ternary (base-3) notation, with the digit 0 signifying gray or absent, 1 indicating gold or present, and 2 corresponding to green or correct. Because these are five-digit numbers, I’ve taken to calling them Zip codes. The all-gray 00000 code at the upper left appears in the Wordle grid when your guess has no letters in common with the target word. A solid-green 22222, at the bottom right, marks the successful conclusion of a game. In the middle of the grid is the all-gold 11111, which is reserved for “deranged anagrams”: Also OCEAN CANOE, REGAL GLARE, and NIGHT THING.pairs of words composed of the same letters but all in different positions, such as BOWEL and ELBOW, or BRUTE and TUBER.

When you play a guess word in Wordle, you can’t know in advance which of the 243 Zip codes you’ll receive as feedback, but you can know the spectrum of possibilities for that particular word. Program 2, Source code on GitHub.below, displays the spectrum graphically. When you enter a five-letter guess word, the program will pair it with each of the 2,315 target words and show how many of the pairings fall into each of the Zip-code categories.

Program 2.

Each of the slender bars that grow from the baseline of the chart represents a single Zip-code, from all-gray at the far left, through all-gold in the middle, and on to all-green at the extreme right. The coloring of the bars interpolates between gray, gold, and green, based on the number of tiles with the corresponding color, but ignoring the position of those tiles within the word. Thus codes such as and get the same color in the bar chart.The height of each bar indicates (on a logarithmic scale) how often the guess receives the corresponding color code. If you hover the mouse pointer over a bar, the five letters in the tiles at the top of the display will be colored accordingly. (The meaning of the notations that appear in the “Statistics” box will be explained below.)

The pattern of short and tall bars in Program 2 is something like an atomic spectrum for Wordle queries. Each guess word generates a unique pattern. A word like MOMMY or JIFFY yields a sparse distribution, with a few very tall bars and wide empty spaces between them. The sparsest of all is QAJAQ (a word on the arcane list that seems to be a variant spelling of KAYAK): All but 18 of the 243 Zip codes are empty. Words such as SLATE, CRANE, TRACE, and RAISE, on the other hand, produce a dense array of shorter bars, like a lush carpet of grass, where most of the categories are occupied by at least one target word.

Can these spectra help us solve Wordle puzzles? Indeed they can! The guiding principle is simple: Choose a guess word that has a “flat” spectrum, distributing the target words uniformly across the 243 Zip codes. The ideal is to divide the set of candidate words into 243 equal-size subsets. Then, when the Umpire’s feedback singles out one of these categories for further attention, the selected Zip code is sure to have relatively few occupants, making it easier to determine which of those occupants is the target word. In this way we maximize the expected information gain from each guess.

This advice to spread the target words broadly and thinly across the space of Zip codes may seem obvious and in no need of further justification. If you feel that way, read on. Those who need persuading (as I did) should consult Note 3.

The ideal of distributing words with perfect uniformity across the Wordle spectrum is, lamentably, unattainable. None of the 12,972 available query words achieves this goal. The spectra all have lumps and bumps and bare spots. The best we can do is choose the guess that comes closest to the ideal. But how are we to gauge this kind of closeness? What mathematical criterion should we adopt to compare those thousands of spiky spectra?

Information theory again waves its hand to volunteer an answer, promising to measure the number of bits of information one can expect to gain from any given Wordle spectrum. The tool for this task is an equation that appeared in Shannon’s 1948 paper “A Mathematical Theory of Communication.” I have taken some liberties with the form of the equation, adapting it to the peculiarities of the Wordle problem and also trying to make its meaning more transparent. For the mathematical details see Note 4.

Here’s the equation in my preferred format:

\[H_w = \sum_{\substack{i = 1\\n_i \ne 0}}^{242} \frac{n_i}{m} (\log_2 m - \log_2 n_i) .\]

$H_w$, the quantity we are computing, is the amount of information we can expect to acquire by playing word $w$ as our next guess. According to an oft-told tale, Shannon adopted the term entropy at the suggestion of John von Neumann, who pointed out that “no one knows what entropy really is, so in a debate you will always have the advantage.” The letter $H$ (which might be a Greek Eta) was introduced by Ludwig Boltzmann.Shannon named this quantity entropy, in analogy with the measure of disorder in physics. A Wordle spectrum has higher entropy when the target words are more broadly dispersed among the Zip codes.

On the righthand side of the equation we sum contributions from all the occupied Zip codes in $w$’s Wordle spectrum. The index $i$ on the summation symbol $\Sigma$ runs from 0 to 242 (which is 00000 to 22222 in ternary notation), enumerating all the Zip codes in the spectrum. The variable $n_i$ is the number of target words assigned to Zip code $i$, and $m$ is the total number of target words included in the spectrum. At the start of the game, $m = 2{,}315$, but after each guess it gets smaller.

Now let’s turn to the expression in parentheses. Here $\log_2 m$ is the amount of information—the number of bits—needed to distinguish the target word among all $m$ candidates. Likewise $\log_2 n_i$ is the number of bits needed to pick out a single word from among the $n_i$ words in Zip code $i$. The difference between these quantities, $\log_2 m - \log_2 n_i$, is the amount of information we gain if Zip code $i$ turns out to be the correct choice—if it harbors the target word and is therefore selected by the Umpire.

Perhaps a numerical example will help make all this clearer. If $m = 64$, we have $\log_2 m = 6$: It would take 6 bits of information to single out the target among all 64 candidates. If the target is found in Zip code $i$ and $n_i = 4$, we still need $\log_2 4 = 2$ bits of information to pin down its identity. Thus in going from $m = 64$ to $n_i = 4$ we have gained $6 - 2 = 4$ bits of information.

So much for what we stand to gain if the target turns out to live in Zip code $i$. We also have to consider the probability of that event. Intuitively, if there are more words allocated to a Zip code, the target word is more likely to be among them. The probability is simply $n_i / m$, the fraction of all $m$ words that land in code $i$. Hence $n_i / m$ and $\log_2 m - \log_2 n_i$ act in opposition. Piling more words into Zip code $i$ increases the probability of finding the target there, but it also increases the difficulty of isolating the target among the $n_i$ candidates.

Each Zip code makes a contribution to the total expected information gain; the contribution of Zip code $i$ is equal to the product of $n_i / m$ and $\log_2 m - \log_2 n_i$. Summing the contributions coming from all 243 categories yields $H_w$, the amount of information we can expect to gain from playing word $w$.

In case you find programming code more digestible than either equations or words, here is a JavaScript function for computing $H_w$. The argument spectrum is an array of 243 numbers representing counts of words assigned to all the Zip codes.

function entropy(spectrum) {
  const nzspectrum = spectrum.filter(x => x > 0);
  const m = sum(spectrum);
  let H = 0;
  for (let n of spectrum) {
    H += n/m * (log2(m) - log2(n));
  }
  return H;
}

In Program 2, this code calculates the entropy value labeled $H$ in the Statistics panel. The same function is invoked in Program 1 when the “Maximize entropy” algorithm is selected.

Before leaving this topic behind, it’s worth pausing to consider a few special cases. If $n_i = 1$ (i.e., there’s just a single word in Zip code $i$), then $\log_2 n_i = 0$, and the information gain is simply $\log_2 m$; we have acquired all the information we need to solve the puzzle. This result makes sense: If we have narrowed the choice down to a single word, it has to be the target. Now suppose $n_i = m$, meaning that all the candidate words have congregated in a single Zip code. Then we have $\log_2 m - \log_2 n_i = 0$, and we gain nothing by choosing word $w$. We had $m$ candidates before the guess, and we still have $m$ candidates afterward.

What if $n_i = 0$—that is, Zip code $i$ is empty? The base-2 logarithm is defined by the equation $2^{\lambda} = N$, where $\lambda$ is the logarithm of $N$. But there is no number $\lambda$ such that $2^{\lambda} = 0$.That’s trouble, because the logarithm of 0 is undefined, and attempting to calculate it in a computer program will raise an error signal. In the entropy equation the subscript $n_i \ne 0$ excludes all empty Zip codes from the summation. In the JavaScript function the expression spectrum.filter(x => x > 0) does the same thing. This exclusion does no harm because if $n_i$ is zero, then $n_i / m$ is also zero, meaning there’s no chance that category $i$ holds the winner. (If a Zip code has no words at all, it can’t have the winning word.)

$H_w$ attains its maximum value when all the $n_i$ are equal. As mentioned above, there’s no word $w$ that achieves this ideal, but we can certainly calculate how many bits such a word would produce if it did exist. In the case of a first guess, each $n_i$ must equal $2315 / 243 \approx 9.5$, and the total information gain is $\log_2 2315 - \log_2 9.5 \approx 7.9$ bits. This is an upper bound for a Wordle first guess; it’s the most we could possibly get out of the process, even if we were allowed to play any arbitrary string of five letters as a guess. As we’ll see below, no real guess gains as much as six bits.

These mathematical tools of information theory suggest a simple recipe for a Wordle-playing computer program. At any stage of the game the word to play is the one that elicits the most information about the target. We can estimate the information yield of each potential guess by computing its Wordle spectrum and applying the $H_w$ equation to the resulting sequence of numbers. That’s what the JavaScript code below does.

function maxentropy(guesswordlist, targetwordlist) {
	maxH = 0
	bestguessword = ""
	for (g in guesswordlist) {      // outer loop
		spectrum = []
		for t in targetwordlist {   // inner loop
			zipcode = score(g, t)
			spectrum(zipcode) += 1
		}
		H = entropy(spectrum)
		if (H > maxH) {
			maxH = H
			bestguessword = g
		}
	}
	return bestguessword
}

In this function the outer loop visits all the guess words; then for each of these words the inner loop sifts through all the potential target words, constructing a spectrum. When the spectrum for guess word g is complete, the entropy procedure computes the information H to be gained by playing word w. The maxH and bestguessword variables keep track of the best results seen so far, and ultimately bestguessword is returned as the result of the function.

This procedure can be applied at any stage of the Wordling process, from the first guess to the last. When choosing the first guess—the word to be entered in the blank grid at the start of the game—all 2,315 common words are equally likely potential targets. We also have 12,972 guess words to consider, drawn from both the common and arcane lists. Calculating the information gain for each such starter word reveals that the highest score goes to SOARE, at 5.886 bits. (SOARE is apparently either a variant spelling of SORREL, a reddish-brown color, or an obsolete British term for a young hawk.) ROATE, a variant of ROTE, and RAISE are not far behind. At the bottom of the list is the notorious QAJAQ, providing just 1.892 bits of information.

Adopting SOARE as a starter word, we can then play complete games of Wordle with this algorithm, recording the number of guesses required to find all 2,315 target words. The result is a substantial improvement over the random and the letter-frequency methods.

Figure 7.Guess criterion: maximize entropy
Guess pool: common and arcane
Start word: SOARE

The average number of guesses per game is down to about 3.5, and almost 96 percent of all games are won in either three or four guesses. Only one target word requires six guesses, and there are no lost games, requiring seven or more guesses. As a device for Wordling, Shannon’s information theory is quite a success!

Shannon’s entropy equation was explicitly designed for the function it performs here—finding the distribution with maximum entropy. But if we consider the task more broadly as looking for the most widely dispersed and most nearly uniform distribution, other approaches come to mind. For example, a statistician might suggest variance or standard deviation as an alternative to Shannon entropy. The standard deviation is defined as:

\[\sigma = \sqrt{\frac{\Sigma_i (n_i - \mu)^2}{N}},\]

where $\mu$ is the average of the $n_i$ values and $N$ is the number of values. In other words, we are measuring how far the individual elements of the spectrum differ from the average value. If the target words were distributed uniformly across all the Zip codes, every $n_i$ would be equal to $\mu$, and the standard deviation $\sigma$ would be zero. A large $\sigma$ indicates that many $n_i$ differ greatly from the mean; some Zip codes must be densely populated and others empty or nearly so. Our goal is to find the guess word whose spectrum has the smallest possible standard deviation.

In the Wordling program, it’s an easy matter to substitute minimum standard deviation for maximum entropy as the criterion for choosing guess words. The JavaScript code looks like this:In this case we don’t exclude empty Zip codes. Doing so would badly skew the results. A spectrum with all words crowded into a single Zip code would have $\sigma = 0$, making it seem the most—rather than the least—desirable configuration.

function stddev(spectrum) {
  const mu = sum(spectrum) / 243;
  const diffs = spectrum.map(x => x - mu);
  const variance = sum(diffs.map(x => x * x)) / 243;
  return Math.sqrt(variance);
}

In Program 1, you can see the standard deviation algorithm in action by selecting the button labeled “Minimum std dev.” In Program 2, standard deviation values are labeled $\sigma$ in the Statistics panel.

Testing the algorithm with all possible combinations of a starting word and a target word reveals that the smallest standard deviation is 22.02, and this figure is attained only by the spectrum of the word ROATE. Not far behind are RAISE, RAILE, and SOARE. At the bottom of the list, the worst choice by this criterion is IMMIX, with $\sigma = 95.5$.

Using ROATE as the steady starter word, I ran a full set of complete games, surveying the standard-deviation program’s performance across all 2,315 target words. I was surprised at the outcome. Although the chart looks somewhat different—more 4s, fewer 3s—the average number of guesses came within a whisker of equalling the max-entropy result: 3.54 vs. 3.53. Of particular note, there are fewer words requiring five guesses, and a few more are solved with just two guesses.

Figure 8.Guess criterion: minimize standard deviation
Guess pool: common and arcane
Start word: ROATE

Why was I surprised by the strength of the standard-deviation algorithm? Well, as I said, Shannon’s $H$ equation is a tool designed specifically for this job. Its mathematical underpinnings assert that no other rule can extract information with greater efficiency. That property seems like it ought to promise superior performance in Wordle. Standard deviation, on the other hand, is adapted to problems that take a different form. In particular, it is meant to measure dispersion in distributions with a normal or Gaussian shape. There’s no obvious reason to expect Wordle spectra to follow the normal law. Nevertheless, the $\sigma$ rule is just as successful in choosing winners.

Following up on this hint that the max-entropy algorithm is not the only Wordle wiz, I was inspired to try a rule even simpler than standard deviation. In this algorithm, which I call “max scatter,” we choose the guess word whose spectrum has the largest number of occupied Zip codes. In other words, we count the bars that sprout up in Program 2, but we ignore their height. In the Statistics panel of Program 2, the max-scatter results are designated by the letter $\chi$ (Greek chi), which I chose by analogy with the indicator function of set theory. In Program 1, choose “Maximize scatter” to Wordle this way.

If we adopt the $\chi$ standard, the best first guess in Wordle is TRACE, which scatters target words over 150 of the 243 Zip codes. Other strong choices are CRATE and SALET (148 codes each) and SLATE and REAST (147). The bottom of the heap is good ole QAJAQ, with 18.

Using TRACE as the start word and averaging over all target words, the performance of the scatter algorithm actually exceeds that of the max-entropy program. The mean number of guesses is 3.49. There are no failures requiring 7+ guesses, and only one target word requires six guesses.

Figure 9.Guess criterion: maximize scatter
Guess pool: common and arcane
Start word: TRACE

What I find most noteworthy about these results is how closely the programs are matched, as measured by the average number of guesses needed to finish a game. It really looks as if the three criteria are all equally good, and it’s a matter of indifference which one you choose. This (tentative) conclusion is supported by another series of experiments. Instead of starting every game with the word that appears best for each algorithm, I tried generating random pairs of start words and target words, and measured each program’s performance for 10,000 such pairings. Having traded good start words for randomly selected ones, it’s no surprise that performance is somewhat weaker across the board, with the average number of guesses hovering near 3.7 instead of 3.5. But all three algorithms continue to be closely aligned, with figures for average outcome within 1 percent. And in this case it’s not just the averages that line up; the three graphs look very similar, with a tall peak at four guesses per game.

Figure 10.

As I looked at these results, it occurred to me that the programs might be so nearly interchangeable for a trivial reason: because they are playing identical games most of the time. Perhaps the three criteria are similar enough that they often settle on the same sequence of guess words to solve a given puzzle. This does happen: There are word pairs that elicit exactly the same response from all three programs. It’s not a rare occurrence. But there are also plenty of examples like the trio of game grids in Figure 11, where each program discovered a unique pathway from FALSE to VOWEL.

Figure 11.

A few further experiments show that the three programs arrive at three distinct solutions in about 35 percent of random games; in the other 65 percent, at least two of the three solutions are identical. In 20 percent of the cases all three are alike. Figure 12.Figure 12 shows the relevant statistics for a sample of 5,000 games with random pairings of start and target words. (The notation $H =\sigma =\chi$ refers to outcomes in which all programs yield the same result. In $H \ne\sigma \ne\chi$ the three solutions are all distinct. The other three bars count individual pairwise matches.) The result for $\sigma = \chi$ is a curiosity I don’t understand. It seems those two programs almost never agree unless $H$ also concurs.

I puzzled over these observations for some time. If the algorithms discover wholly different paths through the maze of words, why are those paths so often the same length? I now have a clue to an answer, but it may not be whole story, and I would welcome alternative explanations.

My mental model of what goes on inside a Wordling program runs like this: The program computes the spectrum of each word to be considered as a potential guess, then computes some function—$H$, $\sigma$, or $\chi$—that reduces the spectrum to a single number. The number estimates the expected quality or efficiency of the word if it is taken as the next guess in the game. Finally we choose the word that either maximizes or minimizes this figure of merit (the word with largest value of $H$ or $\chi$, or the smallest value of $\sigma$).

So far so good, but there’s an unstated assumption in that scheme: I take it for granted that evaluating the spectra will always yield a single, unique extreme value of $H$, $\sigma$, or $\chi$. What happens if two words are tied for first place? One could argue, of course, that if the words earn the same score, they must be equally good choices, and we should pick one of them at random or by some arbitrary rule. Even if there are three or four or a dozen co-leaders, the same reasoning should apply. But when there are two thousand words all tied for first place, I’m not so sure.

Can such massive traffic jams at the finish line actually happen in a real game of Wordle? At first I didn’t even consider the possibility. When I rated all possible starting words for the Shannon max-entropy algorithm, the ranking turned out to be a total order: In the list of 12,792 words there were no ties—no pairs of words whose spectra have the same $H$ value. Hence there’s no ambiguity about which word is best, as measured by this criterion.

But this analysis applies only to the opening play—the choice of a first guess, when all the common words are equally likely to be the target. In the endgame the situation is totally different. Suppose the list of 2,315 common words has been whittled down to just five viable candidates for Wordle-of-the-day. When the five target words are sorted into Zip-code categories and the entropy of these patterns is calculated (excluding empty Zip codes), there are only seven possible outcomes, as shown in Figure 13. The patterns correspond to the seven ways of partitioning the number 5 (namely 5, 4+1, 3+2, 3+1+1, 2+2+1, 2+1+1+1, and 1+1+1+1+1).

Figure 13.The bars in these graphs could be rearranged in various ways, but the $H$, $\sigma$, and $\chi$ measures give the same result for all permutations.

In this circumstance, the 12,972 guess words cannot all have unique values of $H$. On the contrary, with only seven distinct values to go around, there must be thousands of tie scores in the ranking of the words. Thus the max-entropy algorithm cannot pick a unique winner; instead it assembles a large class of words that, from the program’s point of view, are all equally good choices. Which of those words ultimately becomes the next guess is more or less arbitrary. For the most part, my programs pick the one that comes first in alphabetical order.

The same arguments apply to the minimum-standard-deviation algorithm. As for the max-scatter function, that has numerous tied scores even when the number of candidates is large. Because the variable $\chi$ takes on integer values in the range from 1 to 243 (and in practice never strays outside the narrower range 18 to 150), there’s no avoiding an abundance of ties.

The presence of all these co-leaders offers an innocent explanation of how the three algorithms might arrive at solutions that are different in detail but generally equal in quality. Although the programs pick different words, those words come from the same equivalence class, and so they yield equally good outcomes.

But a doubt persists. Can it really be true that hundreds or thousands of words are all equally good guesses in some Wordle positions? Can you choose any one of those words and expect to finish the game in the same number of plays?

Let’s go back to Figure 11, where the maximum-entropy program follows the opening play of FALSE with DETER and then BINGO. Some digging through the entrails of the program reveals that BINGO is not the uniquely favored guess at this point in the progress of the game; VENOM, VINYL, and VIXEN are tied with BINGO at $H = 3.122$ bits. The program chooses BINGO simply because it comes first alphabetically. As it happens, the choice is an unfortunate one. Any of the other three words would have concluded the game more quickly, in four guesses rather than five.

Does that mean we now have unequivocal evidence that the program is suboptimal? Not really. At the stage of the game where BINGO was chosen, there were still 10 viable possibilities for the target word. The true target might have been LIBEL, for example, in which case BINGO would have been superior to VENOM or VIXEN.

Human players see Wordle as a word game. What else could it be? To solve the puzzle you scour the dusty corners of your vocabulary, searching for words that satisfy certain constraints or fit a given template. You look for patterns of vowels and consonants and other curious aspects of English orthography.

For the algorithmic player, on the other hand, Wordle is a numbers game. What counts is maximizing or minimizing some mathematical function, such as entropy or standard deviation. The words and letters all but disappear.

The link between words and numbers is the Umpire’s scoring rule, with the spectrum of Zip codes that comes out of it. Every possible combination of a guess word and a target word gets reduced to a five-digit ternary number. Instead of computing each of these numbers as the need arises, we can precompute the entire set of Zip codes and store it in a matrix. The content of the matrix is determined by the letters of the words we began with, but once all the numbers have been filled in, we can dispense with the words themselves. Operations on the matrix depend only on the numeric indices of the columns and rows. We can retrieve any Zip code by a simple table lookup, without having to think about the coloring of letter tiles.

In exploring this matrix, let’s set aside the arcana for the time being and work only with common words. With 2,315 possible guesses and the same number of possible targets, we have $2{,}315^2 \approx 5.4$ million pairings of a guess and a target. Each such pairing gets a Zip code, which can be represented by a decimal integer in the range from 0 to 242. Because these small numbers fit in a single byte of computer memory, the full matrix occupies about five-and-a-half megabytes.

Figure 14 is a graphic representation of the matrix. Each column corresponds to a guess word, and each row to a target word. The words are arranged in alphabetical order from left to right and top to bottom. The matrix element where a column intersects a row holds the Zip code for that combination. The color scheme is the same as in Program 2. The main diagonal (upper left to lower right) is a bright green stripe because each word when played against itself gets a score of 22222, equivalent to five green tiles. The blocks of lighter green all along the diagonal mark regions where guess words and target words share the same first letter and hence have scores with at least one green tile. The largest of these blocks is for words beginning with the letter S. A few other blocks are quite tiny, reflecting the scarcity of words beginning with J, K, Q, Y, and Z. A box for X is lacking entirely; the Wordle common list has no words beginning with X.

Figure 14.Mouseover to magnify

Gazing deeply into the tweedy texture of Figure 14 reveals other curious structures, but some features are misleading. The matrix appears to be symmetric across the main diagonal: The point at (a, b) always has the same color as the point at (b, a). But the symmetry is an artifact of the graphic presentation, where colors are assigned based only on the number of gray, gold, and green tiles, ignoring their positions within a word. The underlying mathematical matrix is not symmetric.

I first built this matrix as a mere optimization, a way to avoid continually recomputing the same Zip codes in the course of a long series of Wordle games. But I soon realized that the matrix is more than a technical speed boost. It encapsulates pretty much everything one might want to know about the entire game of Wordle.

Figure 15 presents a step-by-step account of how matrix methods lead to a Wordle solution. In the first stage (upper left) the player submits an initial guess of SLATE, which singles out column 1778 in the matrix (1778 being the position of SLATE in the alphabetized list of common words). The second stage (upper right) reveal’s the Umpire’s feedback, coloring the tiles as follows: . To the human solver this message would mean: “Look for words that have an S (but not in front) and a T (but not in fourth position).” To the computer program the message says: “Look for rows in the matrix whose 1778th entry is equal to 10010 (base 3) or 84 (base 10). It turns out there are 24 rows meeting this condition, which are highlighted in blue. (Some lines are too closely spaced to be distinguished.) I have labeled the highlighted rows with the words they represent, but the labels are for human consumption only; the program has no need of them.

Figure 15.

In the third panel (lower left), the program has chosen a second guess word, TROUT, which earns the feedback . I should mention that this is not a word I would have considered playing at this point in the game. The O and U make sense, because the first guess eliminated A and E and left us in need of a vowel; but the presence of two Ts seems wasteful. We already know the target word has a T, and pinning down where it is seems less important than identifying the target word’s other constituent letters. I might have tried PROUD here. Yet TROUT is a brilliant move! It eliminates 7 words that start with T, 12 words that don’t have an R, 17 words that have either an O or a U or both, and 4 words that don’t end in T. In the end only one word remains in contention: FIRST.

All this analysis of letter positions goes on only in my mind, not in the algorithm. The computational process is simpler. The guess TROUT designates column 2118 of the matrix. Among the 24 rows identified by the first guess, SLATE, only row 743 has the Zip code value 01002 (or 29 decimal) at its intersection with column 2118. Row 743 is the row for FIRST. Because it is the only remaining viable candidate, it must be the target word, and this is confirmed when it is played as the third guess (lower right).

There’s something disconcerting—maybe even uncanny—about a program that so featly wins a word game while ignoring the words. Indeed, you can replace all the words in Wordle with utter gibberish—with random five-letter strings like SCVMI, AHKZB, and BOZPA—and the algorithm will go on working just the same. It has to work a little harder—the average number of guesses is near 4—but a human solver would find the task almost impossible.

The square matrix of Figure 14 includes only the common words that serve as both guesses and targets in Wordle. Including the arcane words expands the matrix to 12,972 columns, with a little more than 30 million elements. At low resolution this wide matrix looks like this:

Figure 16.

If you’d like to see all 30 million pixels up close and personal, download the 90-megabyte TIFF image.

Commentators on Wordle have given much attention to the choice of a first guess: the start word, the sequence of letters you type when facing a blank grid. At this point in the game you have no clues of any kind about the composition of the target word; all you know is that it’s one of 2,315 candidates. Because this list of candidates is the same in every game, it seems there might be one universal best starter word—an opening play that will lead to the smallest average number of guesses, when the average is taken over all the target words. Furthermore, because 2,315 isn’t a terrifyingly large number, a brute-force search for that word might be within the capacity of a less-than-super computer.

Start-word suggestions from players and pundits are a mixed lot. A website called Polygon offers some whimsical ideas, such as FARTS and OUIJA. GameRant also has oddball options, including JUMBO and ZAXES. Tyler Glaiel offers better advice, reaching deep into the arcana to come up with SOARE and ROATE. Grant Sanderson skillfully deduces that CRANE is the best starter, then takes it all back.

For each of the algorithms I have mentioned above (except the random one) the algorithm itself can be pressed into service to suggest a starter. For example, if we apply the max-entropy algorithm to the entire set of potential guess words, it picks out the one whose spectrum has the highest $H$ value. As I’ve already noted, that word is SOARE, with $H = 5.886$. Table 1 gives the 15 highest-scoring choices, based on this kind of analysis, for the max-entropy, min–standard deviation and max-scatter algorithms.

First-Move Starter-Word Rankings

Table 1.

Max Entropy
SOARE	5.886
ROATE	5.883
RAISE	5.878
RAILE	5.866
REAST	5.865
SLATE	5.858
CRATE	5.835
SALET	5.835
IRATE	5.831
TRACE	5.831
ARISE	5.821
ORATE	5.817
STARE	5.807
CARTE	5.795
RAINE	5.787

Min Std Dev
ROATE	22.02
RAISE	22.14
RAILE	22.22
SOARE	22.42
ARISE	22.72
IRATE	22.73
ORATE	22.76
ARIEL	23.05
AROSE	23.20
RAINE	23.41
ARTEL	23.50
TALER	23.55
RATEL	23.97
AESIR	23.98
ARLES	23.98

Max Scatter
TRACE	150
CRATE	148
SALET	148
SLATE	147
REAST	147
PARSE	146
CARTE	146
CARET	145
PEART	145
CARLE	144
CRANE	142
STALE	142
EARST	142
HEART	141
REIST	141

Perusing these lists reveals that they all differ in detail, but many of the same words appear on two or more lists, and certains patterns turn up everywhere. A disproportionate share of the words have the letters A, E, R, S, and T; and half the alphabet never appears in any of the lists.

Figure 17 looks at the distribution of starter quality ratings across the entire range of potential guess words, both common and arcane. The 12,972 starters have been sorted from best to worst, as measured by the max-entropy algorithm. The plot gives the number of bits of information gained by playing each of the words as the initial guess, in each case averaging over all 2,315 target words.

Figure 17.

The peculiar shape of the curve tells us something important about Wordling strategies. A small subset of words in the upper left corner of the graph make exceptionally good starters. In the first round of play they elicit almost six bits of information, which is half of what’s needed to finish the game. At the other end of the curve, a slightly larger cohort of words produce really awful results as starters, plunging down to the 1.8 bits of QAJAQ. In between these extremes, the slope of the curve is gradual, and there are roughly 10,000 words that don’t differ greatly in their performance.

These results might be taken as the last word on first words, but I would be a cautious about that. The methodology makes an implicit “greedy” assumption: that the strongest first move will always lead to the best outcome in the full game. It’s rather like assuming that the tennis player with the best serve will always win the match. Experience suggests otherwise. Although a strong start is usually an advantage, it’s no guarantee of victory.

We can test the greedy assumption in a straightforward if somewhat laborious way: For each proposed starter word, we run a complete set of 2,315 full games—one for each of the target words—and we keep track of the average number of guesses needed to complete a game. Playing 2,315 games takes a few minutes for each algorithm; doing that for 12,792 starter words exceeds the limits of my patience. But I have compiled full-game results for 30 starter words, all of them drawn from near the top of the first-round rankings.

Table 2 gives the top-15 results for each of the algorithms. Comparing these lists with those of Table 1 reveals that first-round supremacy is not in fact a good predictor of full-game excellence. None of the words leading the pack in the first-move results remain at the head of the list when we measure the outcomes of complete games.

Full-Game Starter-Word Rankings

Table 2.

Max Entropy
REAST	3.487
SALET	3.492
TRACE	3.494
SLATE	3.495
CRANE	3.495
CRATE	3.499
SLANE	3.499
CARTE	3.502
CARLE	3.503
STARE	3.504
CARET	3.505
EARST	3.511
SNARE	3.511
STALE	3.512
TASER	3.514

Min Std Dev
TRACE	3.498
CRATE	3.503
REAST	3.503
SALET	3.508
SLATE	3.508
SLANE	3.511
CARTE	3.512
CARLE	3.516
CRANE	3.517
CARSE	3.523
STALE	3.524
CARET	3.524
STARE	3.525
EARST	3.527
SOREL	3.528

Max Scatter
SALET	3.428
REAST	3.434
CRANE	3.434
SLATE	3.434
CRATE	3.435
TRACE	3.435
CARLE	3.437
SLANE	3.438
CARTE	3.444
STALE	3.450
TASER	3.450
CARET	3.451
EARST	3.452
CARSE	3.453
STARE	3.454

What’s the lesson here? Do I recommend that when you get up in the morning to face your daily Wordle, you always start the game with REAST or TRACE or SALET or one of the other words near the top of these lists? That’s not bad advice, but I’m not sure it’s the best advice. One problem is that each of these algorithms has its own favored list of starting words. Your own personal Wordling algorithm—whatever it may be—might respond best to some other, idiosyncratic, set of starters.

Moreover, my spouse, who Wordles and Quordles and Octordles and Absurdles, reminds me gently that it’s all a game, meant to be fun, and some people may find that playing the same word day after day gets boring.

Figure 18 presents the various algorithms at their shiny best, each one using the starter word that brings out its best performance. For comparison I’ve also included my own Wordling record, based on 126 games I’ve played since January. I’m proud to say that I Wordle better than a random number generator.

Figure 18.

I turn now from the opening move to the endgame, which I find the most interesting part of Wordle—but also, often, the most frustrating. It seems reasonable to say that the endgame begins when the list of viable targets has been narrowed down to a handful, or when most of the letters in the target are known, and only one or two letters remain to be filled in.

Occasonally you might find yourself entering the endgame on the very first move. There’s the happy day when your first guess comes up all green—an event I have yet to experience. Or you might have a close call, such as playing the starter word EIGHT and getting this feedback: . Having scored four green letters, it looks like you’ve got an easy win. With high hopes, you enter a second guess of FIGHT, but again you get the same four greens and one gray. So you type out LIGHT next, and then MIGHT. After two more attempts, you are left with this disappointing game board:

Figure 19.

What seemed like a sure win has turned into a wretched loss. You have used up your six turns, but you’ve not found the Wordle-of-the-day. Indeed, there are still three candidate targets yet to be tried: SIGHT, TIGHT, and WIGHT.

The _ IGHT words are by no means the only troublemakers. Similar families of words match the patterns _ ATCH, _ OUGH, _ OUND, _ ASTE and _ AUNT. You can get into even deeper endgame woes when an initial guess yields three green tiles. For example, 12 words share the template S_ _ ER, 25 match _ O_ ER, and 29 are consistent with _ A_ ER.

These sets of words are challenging not only for the human solver but also, under some circumstances, for Wordling algorithms. In Program 1, choose the word list “Candidates only” and then try solving the puzzle for target words such as FOYER, WASTE, or VAUNT. Depending on your starter word, you are likely to see the program struggle through several guesses, and it may fail to find the answer within six tries.

The “Candidates only” setting requires the program to choose each guess from the set of words that have not yet been excluded from consideration as the possible target. For example, if feedback from an earlier guess has revealed that the target ends in T and has an I, then every subsequent guess must end in T and have an I. (Restricting guesses to candidates only is similar to the Wordle app’s “hard mode” setting, but a little stricter.)

Compelling the player to choose guesses that might be winners doesn’t seem like much of a hardship or handicap. However, trying to score a goal with every play is seldom the best policy. Other guesses, although they can’t possibly be winners, may yield more information.

An experienced and wordly-wise human Wordler, on seeing the feedback , would know better than to play for an immediate win. The prudent strategy is play a word that promises to reveal the identity of the one missing letter. Here’s an example of how that works.

Figure 20.

At left the second-guess FLOWN detects the presence of a W, which means the target can only be WIGHT. In the middle, FLOWN reveals only absences, but the further guess MARSH finds an S, which implies the target must be SIGHT. At right, the second and third guesses have managed to eliminate F, L, W, N, M, R, and S as initial letters, and all that’s left is the T of TIGHT.

This virtuoso performance is not the work of some International Grandmaster of Wordling. It is produced by the max-entropy algorithm, when the program is allowed a wider choice of potential guesses. The standard-deviation and max-scatter algorithms yield identical results. There is no special logic built into any of these programs for deciding when to play to win and when to hunker down and gather more information. It all comes out of the Zip code spectrum: FLOWN and MARSH are the words that maximize $H$ and $\chi$, and that minimize $\sigma$. And yet, when you watch the game unfold, it looks mysterious or magical.

The cautious strategy of accumulating intelligence before committing to a line of play yields better results on average, but it comes with a price. All of the best algorithms achieve their strong scores by reducing the number of games that linger for five or six rounds of guessing. However, those algorithms also reduce the number of two-guess games, an effect that has to be counted as collateral damage. Two-guess triumphs make up less than 3 percent of the games played by the Zip code–based programs. Contrast that with the letter-frequency algorithm: In most respects it is quite mediocre, but it wins more than 6 percent of its games in two guesses. And among my personal games, 7 percent are two-guess victories. (I don’t say this to brag; what it suggests is that my style of play is a tad reckless.)

The Zip code–based programs would have even fewer two-guess wins without a heuristic that improves performance in a small fraction of cases. Whenever the list of candidate target words has dwindled down to a length of two, there’s no point in seeking more information to distinguish between the two remaining words. Suppose that after the first guess you’re left with the words SWEAT and SWEPT as the only candidates. For your second guess you could play a word such as ADEPT, where either the A or the P would light up to indicate which candidate is the target. You would then have a guaranteed win in three rounds. But if you simply played one of the candidate words as your second guess, you would have a 50 percent chance of winning in two rounds, and otherwise would finish in three, for an average score of 2.5.

This heuristic is already implemented in the programs discussed above. It makes a noticeable difference. Removing it from the max-entropy program drops the number of two-guess games from 51 down to 35 (out of 2,315).

Can we go further with this idea? Suppose there are three candidates left as you’re preparing for the second round of guessing. The cautious, information-gathering strategy would bring consistent victory on the third guess. Playing one of the candidates leads to a game that lasts for two, three, or four turns, each with probability one-third, so the average is again three guesses. The choice appears to be neutral. In practice, playing one of the candidates brings a tiny improvement in average score—too tiny to be worth the bother.

Another optimization says you should always pick one of the remaining candidates for your sixth guess. Gathering additional information is pointless in this circumstance, because you’ll never have a chance to use it. However, the better algorithms almost never reach the sixth guess, so this measure has no payoff in practice.

Apart from minor tricks and tweaks like these, is there any prospect of building significantly better Wordling programs? I have no doubt that improvement is possible, even though all my own attempts to get better results have failed miserably.

Getting to an average performance of 3.5 guesses per game seems fairly easy; getting much beyond that level may require new ideas. My impression is that existing methods work well for choosing the first guess and perhaps the second, but are less effective in closing out the endgame. When the number of candidates is small, the Zip code–based algorithms cannot identify a single best next guess; they merely divide the possibilities into a few large classes of better and worse guesses. We need finer means of discrimination. We need tiebreakers.

I’ll briefly mention two of my failed experiments. I thought I would try going beyond the Zip code analysis and computing for each combination of a guess word and a potential target word how much the choice would shrink the list of candidates. After all, the point of the game is to shrink that list down to a single word. But the plague of multitudinous ties afflicts this algorithm too. Besides, it’s computationally costly.

Another idea was to bias the ranking of the Zip code spectra, favoring codes that have more gold and green letter tiles, on the hypothesis that we learn more when a letter is present or correct. The hypothesis is disproved! Even tiny amounts of bias are detrimental.

My focus has been on reducing the average number of guesses, but maybe there are other goals worth pursuing. For example, can we devise an algorithm that will solve every Wordle with no more than four guesses? It’s not such a distant prospect. Already there are algorithms that exceed four guesses only in about 2 percent of the cases.

Perhaps progress will come from another quarter. I’ve been expecting someone to put one of the big machine-learning systems to work on Wordle. All I’ve seen so far is a small-scale study, based on the technique of reinforcement learning, done by Benton J. Anderson and Jesse G. Meyer. Their results are not impressive, and I am led to wonder if there’s something about the problem that thwarts learning techniques just as it does other algorithmic approaches.

Wordle falls into the class of combinatorial word games. All you need to know is how letters go together to make a word; meaning is beside the point. Most games of this kind are highly susceptible to computational force majeure. A few lines of code and a big word list will find exhaustive solutions in milliseconds. For example, another New York Times game called Spelling Bee asks you to make words out of seven given letters, with one special letter required to appear in every word. I’m not very good at Spelling Bee, but my computer is an ace. The same code would solve the Jumble puzzles on the back pages of the newspaper. With a little more effort the program could handle Lewis Carroll’s Word Links (better known today as Don Knuth’s Word Ladders). And it’s a spiffy tool for cheating at Scrabble.

In this respect Wordle is different. One can easily write a program that plays a competent game. Even a program that chooses words at random can turn in a respectable score. But this level of proficiency is nothing like the situation with Spelling Bee or Jumble, where the program utterly annihilates all competition, leaving no shred of the game where the human player could cling to claims of supremacy. In Wordle, every now and then I beat my own program. How can that happen, in this day and age?

The answer might be as simple and boring as computational complexity. If I want my program to win, I’ll have to invest more CPU cycles. Or there might be a super-clever Wordle-wrangling algorithm, and I’ve just been too dumb to find it. Then again, there might be something about Wordle that sets it apart from other combinatorial word games. That would be interesting.

Notes

Note 1. History of the game and of the word lists.

The charming story of Wordle’s creation was told last January in the New York Times. “Wordle Is a Love Story” read the headline. Josh Wardle, a software developer formerly at Reddit, created the game as a gift to his partner, Palak Shah, a fan of word games. The Times story, written by Daniel Victor, marvelled at the noncommercial purity of the website: “There are no ads or flashing banners; no windows pop up or ask for money.” Three weeks later the Times bought the game, commenting in its own pages, “The purchase . . . reflects the growing importance of games, like crosswords and Spelling Bee, in the company’s quest to increase digital subscriptions to 10 million by 2025.” So much for love stories.

As far as I can tell, the new owners have not fiddled with the rules of the game, but there have been a few revisions to the word lists. Here’s a summary based on changes observed between February and May.

Six words were removed from the list of common words (a.k.a. target words), later added to the list of arcane words, then later still removed from that list as well, so that they are no longer valid as either a guess or a target:

AGORA, PUPAL, LYNCH, FIBRE, SLAVE, WENCH

Twenty-two words were moved from various positions in the common words list to the end of that list (effectively delaying their appearance as Wordle-of-the-day until sometime in 2027):

BOBBY, ECLAT, FELLA, GAILY, HARRY, HASTY, HYDRO,

LIEGE, OCTAL, OMBRE, PAYER, SOOTH, UNSET, UNLIT,

VOMIT, FANNY, FETUS, BUTCH, STALK, FLACK, WIDOW,

AUGUR

Two words were moved forward from near the end of the common list to a higher position where they replaced FETUS and BUTCH:

SHINE, GECKO

Two words were removed from the arcane list and not replaced:

KORAN, QURAN

Almost all the changes to the common list affect words that would have been played at some point in 2022 if they had been left in place. I expect further purges when the editors get around to vetting the rest of the list.

Note 2. The Umpire’s scoring rule.

When I first started playing Wordle, I had a simple notion of what the tile colors meant. If a tile was colored green, then that letter of the guess word appeared in the same position in the target word. A gold tile meant the letter would be found elsewhere in the target word. A gray tile indicated the letter was entirely absent from the target word. For my first Wordler program I wrote a procedure implementing this scheme, although its output consisted of numbers rather than colors: green = 2, gold = 1, gray = 0.

function scoreGuess(guess, target)
	score = [0, 0, 0, 0, 0]
	for i in 1:5
		if guess[i] == target[i]
			score[i] = 2
		elseif occursin(guess[i], target)
			score[i] = 1
		end
	end
	return score
end

This rule and its computer implementation work correctly as long as we never apply them to words with repeated letters. But suppose the target word is MODEM and you play the guess MUDDY. Following the rule above, the Umpire would offer this feedback: . Note the gold coloring of the second D. It’s the correct marking according to the stated rule, because the target word has a D not in the fourth position. But that D in MODEM is already “spoken for”; it is matched with the green D in the middle of MUDDY. The gold coloring of the second D could be misleading, suggesting that there’s another D in the target word.

The Wordle app would color the second D gray, not gold: . The rule, apparently, is that each letter in the target word can be matched with only one letter in the guess word. Green matches take precedence.

There remains some uncertainty about which letter gets a gold score when there are multiple options. If MUDDY is the target word and ADDED is the guess word, we know that middle D will be colored green, but which of the other two Ds in ADDED gets a gold tile? I have not been able to verify how the Wordle app handles this situation, but my program assigns priority from left to right: .

This minor amendment to the rules brings a considerable cost in complexity to the code. We need an extra data structure (the array tagged) and an extra loop. The first loop (with index i) finds and marks all the green matches; the second loop (indices j and k) identifies gold tiles that have not been preempted by earlier green or gold matches.

function scoreGuess(guess, target)
  score  = [0, 0, 0, 0, 0]
  tagged = [0, 0, 0, 0, 0]
  for i in 1:5
    if guess[i] == target[i]
      score[i] = 2
      tagged[i] = 2
    end
  end
  for j in 1:5
    for k in 1:5
      if guess[j] == target[k] && score[j] == 0 && tagged[k] == 0
        score[j] = 1
        tagged[k] = 1
      end
    end
  end
  return score
end

It is this element of the scoring rules that forbids Zip codes such as and its permutations in Figure 6. Under the naive rules, the guess STALL played against the target STALE would have been scored . The refined rule renders it as .

Note 3. The virtues of a uniform distribution.

Let’s think about a game that’s simpler than Wordle, though doubtless less fun to play. You are given a deck of 64 index cards, each with a single word written on it; one of the words, known only to the Umpire, is the target. Your instructions are to “cut” the deck, dividing it into two heaps. Then the all-seeing Umpire will tell you which heap includes the target word. For the next round of play, you set aside the losing heap, and divide the winning heap into two parts; again the Umpire indicates which pile holds the winning word. The process continues until the heap approved by the Umpire consists of a single card, which must necessarily be the target. The question is: How many rounds of play will it take to reach this decisive state?

If you always divide the remaining stack of cards into two equal heaps, this question is easy to answer. On each turn, the number of cards and words remaining in contention is cut in half: from 64 to 32, then on down to 16, 8, 4, 2, 1. It takes six halvings to resolve all uncertainty about the identity of the target. This bisection algorithm is a central element of information theory, where the fundamental unit of measure, the bit, is defined as the amount of information needed to choose between two equally likely alternatives. Here the choices are the two heaps of equal size, which have the same probability of holding the target word. When the Umpire points to one heap or the other, that signal provides exactly one bit of information. To go all the way from 64 equally likely candidates to one identified target takes six rounds of guessing, and six bits of information.

Bisection is known to be an optimal strategy for many problem-solving tasks, but the source of its strength is sometimes misunderstood. What matters most is not that we split the deck into two parts but that we divide it into equal subsets. Suppose we cut a pack of 64 cards into four equal heaps rather than two. When the Umpire points to the pile that includes the target, we get twice as much information. The search space has been reduced by a factor of four, from 64 cards to 16. We still need to acquire a total of six bits to solve the problem, but because we are getting two bits per round, we can do it in three splittings rather than six. In other words, we have a tradeoff between more simple decisions and fewer complex decisions.

Figure 21 illustrates the nature of this tradeoff by viewing a decision process as traversing a tree from the root node (at the top) to one of 16 leaf nodes (at the bottom). For the binary tree on the left, finding your way from the root to a leaf requires making four decisions, in each case choosing one of two paths. In the quaternary tree on the right, only two decisions are needed, but there are four options at each level. Assuming a four-way choice “costs” twice as much as a two-way choice, the information content is the same in both cases: four bits.

Figure 21.

But what if we decide to split the deck unevenly, producing a larger and a smaller heap? For example, a one-quarter/three-quarter division would yield a pile of 16 cards on the left and 48 cards on the right. If the target happens to lie in the smaller heap, we are better off than we would be with an even split: We’ve gained two bits of information instead of one, since the search space has shrunk from 64 cards to 16. However, the probability of this outcome is only 1/4 rather than 1/2, and so our expected gain is only one bit. When the target card is in the larger pile, we acquire less than one bit of information, since the search space has fallen from 64 cards only as far as 48, and the probability of this event is 3/4. Averaging across the two possible outcomes, the loss outweighs the gain.

Figure 22 shows the information budget for every possible cut point in a deck of 64 cards. The red curves labeled left and right show the number of bits obtained from each of the two piles as their size varies. (The x axis labels the size of the left pile; wherever the left pile has $n$ cards, the right pile has $64 - n$.) The overarching green curve is the sum of the left and right values. Note that the green curve has its peak in the center, where the deck has been split into two equal subsets of 32 cards each. This is the optimum strategy.

Figure 22.The equation for the left curve is $y = (n/m)(\log_2 m - \log_2 n)$, where $m = 64$ and $n$ is the size of the left subset. For the right curve, substitute $m - n$ for $n$.

I have made this game go smoothly by choosing 64 as the number of cards in the deck. Because 64 is a power of 2, you can keep dividing by 2 and never hit an odd number until you come to 1. With decks of other sizes the situation gets messy; you can’t always split the deck into two equal parts. Nevertheless, the same principles apply, with some compromises and approximations. Suppose we have a deck of 2,315 cards, each inscribed with one of the Wordle common words. (The deck would be about two feet high.) We repeatedly split it as close to the middle as possible, allowing an extra card in one pile or the other when necessary. Eleven bisections would be enough to distinguish one card out of $2^{11} = 2{,}048$, which falls a little short of the goal. With 12 bisections we could find the target among $2^{12} = 4{,}096$ cards, which is more than we need. There’s a mathematical function that interpolates between these fenceposts: the base-2 logarithm. Specifically, $\log_2 2{,}315$ is about 11.18, which means the amount of information we need to solve a Wordle puzzle is 11.18 bits. (Note that $2^{11.18} \approx 2315$. Of the 2,315 positions where the target card might lie within the deck, 1,783 are resolved in 11 bisections, and 532 require a 12th cut.)

Note 4. Understanding the Shannon entropy equation.

The foundational document of information theory, Claude Shannon’s 1948 paper “A Mathematical Theory of Communication,” gives the entropy equation is this form:

\[H = -\sum_{i} p_i \log_2 p_i .\]

The equation’s pedigree goes back further, to Josiah Willard Gibbs and Ludwig Boltzmann, who introduced it in a different context, the branch of physics called statistical mechanics.

Over the years I’ve run into this equation many times, but I do not count it among my close friends. Whenever I come upon it, I have to stop and think carefully about what the equation means and how it works. The variable $p_i$ represents a probability. We are asked to multiply each $p_i$ by the logarithm of $p_i$, then sum up the products and negate the result. Why would one want to do such a thing? What does it accomplish? How do these operations reveal something about entropy and information?

When I look at the righthand side of the equation, I see an expression of the general form $n \log n$, which is a familiar trope in several areas of mathematics and computer science. It generally denotes a function that grows faster than linear but slower than quadratic. For example, sorting $n$ items might require $n \log n$ comparison operations. (For any $n$ greater than 2, $n \log_2 n$ lies between $n$ and $n^2$.)

That’s not what’s going on here. Because $p$ represents a probability, it must lie in the range $0 \le p \le 1$. The logarithm of a number in this range is negative or at most zero. This gives $p \log p$ a different complexion. It also draws attention to that weird minus sign hanging out in front of the summation symbol.

I believe another form of the equation is easier to understand, even though the transformation introduces more bristly mathematical notation. Let’s begin by moving the minus sign from outside to inside the summation, attaching it to the logarithmic factor:

\[H = \sum_{i} p_i (-\log_2 p_i) .\]

The rules for logarithms and exponents tell us that $-\log x = \log x^{-1}$, and that $x^{-1} = 1/x$. We can therefore rewrite the equation again as

\[H = \sum_{i} p_i \log_2 \frac{1}{p_i} .\]

Now the nature of the beast becomes a little clearer. Figure 23.The factors $p$ and $\log (1/p)$ pull in opposite directions. If you make $p$ larger, you make $\log (1/p)$ smaller, and vice versa. At both extremes of the allowed range, near $p = 0$ and $p = 1$, the value of $p \log (1/p)$ approaches 0. In between these limits, $p \log (1/p)$ is always positive, and should have a maximum for some value of $p$. Figure 23 shows that this is true. It’s no coincidence that the curve has the same form as one of those in Figure 22. (If you’re curious about the location of the peak, I’ll give you a hint: It lies where $p$ is equal to the reciprocal of a famous number.)

We can now bring in some of the particulars of the Wordle problem. The probability $p_i$ is in fact the probability that the target word is found in Zip code $i$. Note that $n_i / m$ has the proper behavior for a probability. It always lies between 0 and 1, and the sum of $n_i / m$ for all $i$ is equal to 1.This probability is equal to $n_i / m$, where $n_i$ is the number of words assigned to Zip code $i$, and $m$ is the total number of words in all the Zip codes. Making the substitutions $p_i \rightarrow n_i / m$ and $1 / p_i \rightarrow m / n_i$, we get the new equation

\[H = \sum_{i} \frac{n_i}{m} \log_2 \frac{m}{n_i} .\]

And now for the final transformation. The laws of logarithms state that $\log x/y$ is equal to $\log x - \log y$, so we can rewrite the Shannon equation as

\[H = \sum_{i} \frac{n_i}{m} (\log_2 m - \log_2 n_i) .\]

In this form the equation is easy to relate to the problem of solving a Wordle puzzle. The quantity $\log_2 m$ is the total amount of information (measured in bits) that we need to acquire in order to identify the target word; $\log_2 n_i$ is the amount of information we will still need to ferret out if Zip code $i$ turns out the hold the target word. The difference of these two quantities is what we gain if the target is in Zip code $i$. The coefficient $n_i / m$ is the probability of this event.

One further emendation is needed. The logarithm function is undefined at 0, as $\log x$ diverges toward negative infinity as $x$ approaches 0 from above. Thus we have to exclude from the summation all terms with $n_i = 0$. We can also fill in the limits of the index $i$.

\[H = \sum_{\substack{i = 1\\n_i \ne 0}}^{242} \frac{n_i}{m} (\log_2 m - \log_2 n_i) \]

Shannon’s discussion of the entropy equation is oddly diffident. He introduces it by listing a few assumptions that the $H$ function should satisfy, and then asserting as a theorem that $H = -\Sigma p \log p$ is the only equation meeting these requirements. In an appendix to the 1948 paper he proves the theorem. But he also goes on to write, “This theorem, and the assumptions required for its proof, are in no way necessary for the present theory [i.e., information theory]. It is given chiefly to lend a certain plausibility to some of our later definitions.” I’m not sure what to make of Shannon’s cavalier dismissal of his own theorem, but in the case of Wordle analysis he seems to be right. Other measures of dispersion work just as well as the entropy function.

Does having prime neighbors make you more composite?

Brian Hayes — Thu, 04 Nov 2021 19:35:40 +0000

Lately I’ve been thinking about the number 60.

Babylonian accountants and land surveyors did their arithmetic in base 60, presumably because sexagesimal numbers help with wrangling fractions. When you organize things in groups of 60, you can divide them into halves, thirds, fourths, fifths, sixths, tenths, twelfths, fifteenths, twentieths, thirtieths, and sixtieths. No smaller number has as many divisors, a fact that puts 60 in an elite class of “highly composite numbers.” (The term and the definition were introduced by Srinivasan Ramanujan in 1915.)

There’s something else about 60 that I never noticed until a few weeks ago—although the Babylonians might well have known it, and Ramanujan surely did. The number 60, with its extravagant wealth of divisors, sits wedged between two other numbers that have no divisors at all except for 1 and themselves. Both 59 and 61 are primes. Such pairs of primes, separated by a single intervening integer, are known as twin primes. Other examples are (5, 7), (29, 31), and (1949, 1951). Over the years twin primes have gotten a great deal of attention from number theorists. Less has been said about the number in the middle—the interloper that keeps the twins apart. At the risk of sounding slightly twee, I’m going to call such middle numbers twin tweens, or more briefly just tweens.

Is it just a fluke that a number lying between two primes is outstandingly unprime? Is 60 unusual in this respect, or is there a pattern here, common to all twin primes and their twin tweens? One can imagine some sort of fairness principle at work: If $n$ is flanked by divisor-poor neighbors, it must have lots of divisors to compensate, to balance things out. Perhaps every pair of twin primes forms a chicken salad sandwich, with two solid slabs of prime bread surrounding a squishy filling that gets chopped up into many small pieces.

As a quick check on this hypothesis, let’s plot the number of divisors $d(n)$ for each integer in the range from $n = 1$ to $n = 75$:

Figure 1.

In Figure 1 twin primes are marked by light blue dots, and their associated tweens are dark blue. Highly composite numbers—those $n$ that have more divisors than any smaller $n$—are distinguished by golden haloes. Note that 1and 2 are listed as highly composite numbers even though they are not composite at all. Go figure. The graph reveals that several twin tweens (4, 6, 12, and 60) are indeed record-setting highly composite numbers, but other tweens are not (18, 30, 42, and 72). And some highly composite numbers (24, 36, 48) do not lie between twin primes. Still, all the dark blue dots float somewhere near the upper margin of the plot, leaving the clear impression that twin tweens tend to have a lot of divisors, more than a typical integer of the same size.

The interval 1–75 is a very small sample of the natural numbers, and an unusual one, since twin primes are abundant among small integers but become quite rare farther out along the number line. For a broader view of the matter, consider the number of divisors for all positive integers up to $n = 10^8$. The champion of divisibility among these numbers is $n =$ 73,513,440, which has 768 divisors. It is not a twin tween.The average value of $d(n)$ over this range is 18.5751. But if you look only at the twin tweens—there are 440,312 of them less than $10^8$—the average divisor count is almost three times as large: 51.5889. Numbers that have a single prime neighbor (either $n + 1$ or $n - 1$ but not both) also have a high average $d(n)$: 32.1199.

Figure 2 plots $d(n)$ over the same range, breaking up the sequence of 100 million numbers into 500 blocks of size 200,000, and taking the average value of $d(n)$ within each block.

Figure 2.

A glance at the graph leaves no doubt that numbers living next door to primes have many more divisors, on average, than numbers without prime neighbors. It’s as if the primes were heaving all their divisors over the fence into the neighbor’s yard. Or maybe it’s the twin tweens who are the offenders here, vampirishly sucking all the divisors out of nearby primes.

Allow me to suggest a less-fanciful explanation. All primes (with one notorious exception) are odd numbers, which means that all nearest neighbors of primes (again with one exception) are even. In other words, the neighbors of primes have 2 as a divisor, which gives them an immediate head start in the race to accumulate divisors. Twin tweens have a further advantage: All of them (with one exception) are divisible by 3 as well as by 2. Why? Among any three consecutive integers, one of them must be a multiple of 3, and it can’t be either of the primes, so it must be the tween.

Being divisible by both 2 and 3, a twin tween is also divisible by 6. Any other prime factors of the tween combine with 2 and 3 to produce still more divisors. For example, a number divisible by 2, 3, and 5 is also divisible by 10, 15, and 30. Figure 3.Given this multiplicative effect, it seems possible that divisibility by 2 and 3 is all that’s needed to explain the tweens’ distinctive abundance of divisors. According to this hypothesis, the proximity of primes has nothing to do with it; tweens are divisor-rich simply because they are multiples of 6. Figure 3 supports this idea. For $n \le 10^8$, integers congruent to zero modulo 6 have more than twice as many divisors as any other residue class. (The primes are in classes 1 and 5.)

However, a closer look at Figure 3 gives reason for caution. In the graph the mean $d(n)$ for numbers divisible by 6 is about 43, but we already know that for tweens—for the subset of numbers divisible by 6 that happen to live between twin primes—$d(n)$ is greater than 51. That further enhancement argues that nearby primes do, after all, have some influence on divisor abundance.

Further evidence comes from another plot of $d(n)$ for numbers with and without prime neighbors, but this time limited to integers divisible by 6. Thus all members of the sample population have the same “head start.” Figure 4 presents the results of this experiment.

Figure 4.

If prime neighbors had no effect (other than ensuring divisibility by 6), the blue, green, and red curves would all trace the same trajectory, but they do not. Although the tweens’ lead in the divisor race is narrowed somewhat, it is certainly not abolished. Numbers with two prime neighbors have about 20 percent more divisors than the overall average for numbers divisible by 6. Numbers with one prime neighbor are also slightly above average. Thus factors of 2 and 3 can’t be the whole story.

Here’s a hand-wavy attempt to explain what might be going on. In the same way that any three consecutive integers must include one that’s a multiple of 3, any five consecutive integers must include a multiple of 5. If you choose an integer $n$ at random, you can be certain that exactly one member of the set $\{n - 2, n - 1, n, n + 1, n + 2\}$ is divisible by 5. Since $n$ was chosen randomly, all members of the set are equally likely to assume this role, and so you can plausibly say that 5 divides $n$ with probability 1/5.

But suppose $n$ is a twin tween. Then $n - 1$ and $n + 1$ are known to be prime, and neither of them can be a multiple of 5. You must therefore redistribute the probability over the remaining three members of the set. Now it seems that $n$ is divisible by 5 with probability 1/3. You can make a similar argument about divisibility by 7, or 11, or any other prime. In each case the probability is enhanced by the presence of nearby primes.

The same argument works just as well if you turn it upside down. Knowing that $n$ is even tells you that $n - 1$ and $n + 1$ are odd. If $n$ is also divisible by 3, you know that $n - 1$ and $n + 1$ do not have 3 as a factor. Likewise with 5, 7, and so on. Thus finding an abundance of divisors in $n$ raises the probability that $n$’s neighbors are prime.

Does this scheme make sense? There’s room for more than a sliver of doubt. Probability has nothing to do with the distribution of divisors among the integers. The process that determines divisibility is as simple as counting, and there’s nothing random about it. Imagine you are dealing cards to players seated at a very long table, their chairs numbered in sequence from 1 to infinity. First you deal a 1 card to every player. Then, starting with player 2, you deal a 2 card to every second player. Then a 3 card to player 3 and to every third player thereafter, and so on. When you finish (if you finish!), each player holds cards for all the divisors of his or her chair number, and no other cards.

This card-dealing routine sounds to me like a reasonable description of how integers are constructed. Adding a probabilistic element modifies the algorithm in a crucial way. As you are dealing out the divisors, every now and then a player refuses to accept a card, saying “Sorry, I’m prime; please give it to one of my neighbors.” You then select a recipient at random from the set of neighbors who lie within a suitable range.

Building a number system by randomly handing out divisors like raffle tickets might make for an amusing exercise, but it will not produce the numbers we all know and love. Primes do not wave off the dealer of divisors; on the contrary, a prime is prime because none of the cards between 1 and $n$ land at its position. And integers divisible by 5 are not scattered along the number line according to some local probability distribution; they occur with absolute regularity at every fifth position. Introducing probability in this context seems misleading and unhelpful.

And yet… And yet…! It works.

Figure 5.

Figure 5 shows the proportion of all $n \le 10^8$ that are divisible by 5, classified according to the number of primes adjacent to $n$. The overall average is 1/5, as it must be. But among twin tweens, with two prime neighbors, the proportion is close to 1/3, as the probabilistic model predicts. And about 1/4 of the numbers with a single prime neighbor are multiples of 5, which again is in line with predictions of the probabilistic model. And note that values of $n$ with no prime neighbors have a below-average fraction of multiples of 5. In one sense this fact is unsurprising and indeed inescapable: If the average is fixed and one subgroup has an excess, then the complement of that subgroup must have a deficit. Nevertheless, it seems strange. How can an absence of nearby primes depress the density of multiples of 5 below the global average? After all, we know that multiples of 5 invariably come along at every fifth integer.

In presenting these notes on the quirks of tweens, I don’t mean to suggest there is some deep flaw or paradox in the theory of numbers. The foundations of arithmetic will not crumble because I’ve encountered more 5s than I expected in prime-rich segments of the number line. No numbers have wandered away from their proper places in the sequence of integers; we don’t have to track them down and put them back where they belong. What needs adjustment is my understanding of their distribution. In other words, the question is not so much “What’s going on?” but “What’s the right way to think about this?”

I know several wrong ways. The idea that twin primes repel divisors and tweens attract them is a just-so story, like the one about the crocodile tugging on the elephant’s nose. It might be acceptable as a metaphor, but not as a mechanism. There are no force fields acting between integers. Numbers cannot sense the properties of their neighbors. Nor do they have personalities; they are not acquisitive or abstemious, gregarious or reclusive.

The probabilistic formulation seems better than the just-so story in that it avoids explicit mention of causal links between numbers. But that idea still lurks beneath the surface. What does it mean to say “The presence of prime neighbors increases a number’s chance of being divisible by 5”? Surely not that the prime is somehow influencing the outcome of a coin flip or the spin of a roulette wheel. The statement makes sense only as an empirical, statistical observation: In a survey of many integers, more of those near primes are found to have 5 as a divisor than those far from primes. This is a true statement, but it doesn’t tell us why it’s true. (And there’s no assurance the observation holds for all numbers.)

Probabilistic reasoning is not a novelty in number theory. In 1936 Harald Cramér wrote:

With respect to the ordinary prime numbers, it is well known that, roughly speaking, we may say that the chance that a given integer $n$ should be a prime is approximately $1 / \log n$.

Cramér went on to build a whole probabalistic model of the primes, ignoring all questions of divisibility and simply declaring each integer to be prime or composite based on the flip of a coin (biased according to the $1 / \log n$ probability). In some respects the model works amazingly well. As Figure 6 shows, it not only matches the overall trend in the distribution of primes, but it also gives a pretty good estimate of the prevalence of twin primes.

Figure 6.

However, much else is missing from Cramér’s random model. In particular, it totally misses the unusual properties of twin tweens. In the region of the number line shown in Figure 6, from 1 to $10^7$, true tweens have an average of 44 divisors. The Cramér tweens are just a random sampling of ordinary integers, with about 16 divisors on average.

Fundamentally, the distribution of twin tweens is inextricably tangled up with the distribution of twin primes. You can’t have one without the other. And that’s not a helpful development, because the distribution of primes (twin and otherwise) is one of the deepest enigmas in modern mathematics.

Throughout these musings I have characterized the distinctive properties of tweens by counting their divisors. There are other ways of approaching the problem that yield similar results. For example, you can calculate the sum of the divisors, $\sigma(n)$ rather than the count $d(n)$, a technique that leads to the classical notions of abundant, perfect, and deficient numbers. A number $n$ is abundant if $\sigma(n) > 2n$, perfect if $\sigma(n) = 2n$ and deficient if $\sigma(n) \lt 2n$. When I wrote a program to sort tweens into these three categories, I was surprised to discover that apart from 4 and 6, every twin tween is an abundant number. Such a definitive result seemed remarkable and important. But then I learned that every number divisible by 6, other than 6 itself, is abundant.

Another approach is to count the prime factors of $n$, rather than the divisors. These two quantities are highly correlated, although the number of divisors is not simply a function of the number of factors; it also depends on the diversity of the factors.

Figure 7.

As Figure 7 shows, counting primes tells a story that’s much the same as counting divisors. A typical integer in the range up to $10^8$ has about four factors, whereas a typical tween in the same range has more than six.

We could also look at the size of $n$’s largest prime factor, $f_{\max}(n)$, which is connected to the concept of a smooth number. A number is smooth if all of its prime factors are smaller than some stated bound, which might be a fixed constant or a function of $n$, such as $\sqrt n$. One measure of smoothness is $\log n\, / \log f_{\max}(n)$. Computations show that by this definition tweens are smoother than average: The ratio of logs is about 2.0 for the tweens and about 1.7 for all numbers.

One more miscellaneous fact: No tween except 4 is a perfect square. Proof: Suppose $n = m^2$ is a tween. Then $n - 1 = m^2 - 1$, which has the factors $m - 1$ and $m + 1$, and so it cannot be prime. An extension of this argument rules out cubes and all higher perfect powers as twin tweens.

When I first began to ponder the tweens, I went looking to see what other people might have said on the subject. I didn’t find much. Although the literature on twin primes is immense, it focuses on the primes themselves, and especially on the question of whether there are infinitely many twins—a conjecture that’s been pending for 170 years. The numbers sandwiched between the primes are seldom mentioned.

The many varieties of highly composite numbers also have an enthusiastic fan club, but I have found little discussion of their frequent occurrence as neighbors of primes.

Could it be that I’m the first person ever to notice the curious properties of twin tweens? No. I am past the age of entertaining such droll thoughts, even transiently. If I have not found any references, it’s doubtless because I’m not looking in the right places. (Pointers welcome.)

I did eventually find a handful of interesting articles and letters. The key to tracking them down, unsurprisingly, was the Online Encyclopedia of Integer Sequences, which more and more seems to function as the Master Index to Mathematics. I had turned there first, but the entry on the tween sequence, titled “average of twin prime pairs,” has only one reference, and it was not exactly a gold mine of enlightenment. It took me to a 1967 volume of Eureka, the journal of the Cambridge Archimedians. All I found there (on page 16) was a very brief problem, asking for the continuation of the sequence 4, 6, 12, 18, 30, 42,…

There the matter rested for a few weeks, but eventually I came back to the OEIS to follow cross-links to some related sequences. Under “highly composite numbers” I found a reference to “Prime Numbers Generated From Highly Composite Numbers,” by Benny Lim (Parabola, 2018). Lim looked at the neighbors of the first 1,000 highly composite numbers. At the upper end of this range, the numbers are very large $(10^{76})$ and primes are very rare—but they are not so rare among the nearest neighbors of highly composite numbers, Lim found.

Another cross reference took me off to sequence A002822, labeled “Numbers $m$ such that $6m-1$, $6m+1$ are twin primes.” In other words, this is the set of numbers that, when multiplied by 6, yield the twin tweens. The first few terms are: 1, 2, 3, 5, 7, 10, 12, 17, 18, 23, 25, 30, 32, 33, 38. The OEIS entry includes a link to a 2011 paper by Francesca Balestrieri, which introduces an intriguing idea I have not yet fully absorbed. Balestrieri shows that $6m + 1$ is composite if $m$ can be expressed as $6xy + x - y$ for some integers $x$ and $y$, and otherwise is prime. There’s a similar but slightly more complicated rule for $6m - 1$. She then proceeds to prove the following theorem:

The Twin Prime Conjecture is true if, and only if, there exist infinitely many $m \in N$ such that $m \ne 6xy + x - y$ and $m \ne 6xy + x + y$ and $m \ne 6xy - x - y$, for all $x, y \in N$.

Other citations took me to three papers by Antonie Dinculescu, dated 2012 to 2018, which explore related themes. But the most impressive documents were two letters written to Neil J. A. Sloane, the founder and prime mover of the OEIS. In 1984 the redoubtable Solomon W. Golomb wrote to point out several publications from the 1950s and 60s that mention the connection between $6xy \pm x \pm y$ and twin primes. The earliest of these appearances was a problem in the American Mathematical Monthly, proposed and solved by Golomb himself. He was 17 when he made the discovery, and this was his first mathematical publication. To support his claim of priority, he offered a $100 reward to anyone who could find an earlier reference.

The second letter, from Matthew A. Myers of Spruce Pine, North Carolina, supplies two such prior references. One is a well-known history of number theory by L. E. Dickson, published in 1919. The other is an “Essai sur les nombres premiers” by Wolfgang Ludwig Krafft, a colleague of Euler’s at the St. Petersburg Academy of Sciences. The essay was read to the academy in 1798 and published in Volume 12 of the Nova Acta Academiae Scientiarum Imperialis Petropolitanae. It deals at length with the $6xy \pm x \pm y$ matter. This was 50 years before the concept of twin primes, and their conjectured infinite supply, was introduced by Alphonse de Polignac.

Myers reported these archeological findings in 2018. Sadly, Golomb had died two years earlier.

Riding the Covid coaster

Brian Hayes — Wed, 18 Aug 2021 22:00:22 +0000

Figure 1

Peaks and troughs, lumps and slumps, wave after wave of surge and retreat: I have been following the ups and downs of this curve, day by day, for a year and a half. The graph records the number of newly reported cases of Covid-19 in the United States for each day from 21 January 2020 through 20 July 2021. That’s 547 days, and also exactly 18 months. The faint, slender vertical bars in the background give the raw daily numbers; the bold blue line is a seven-day trailing average. (In other words, the case count for each day is averaged with the counts for the six preceding days.)

I struggle to understand the large-scale undulations of that graph. If you had asked me a few years ago what a major epidemic might look like, I would have mumbled something about exponential growth and decay, and I might have sketched a curve like this one:

Figure 2

My imaginary epidemic is so much simpler than the real thing! The number of daily infections goes up, and then it comes down again. It doesn’t bounce around like a nervous stock market. It doesn’t have seasonal booms and busts.

The graph tracing the actual incidence of the disease makes at least a dozen reversals of direction, along with various small-scale twitches and glitches. The big mountain in the middle has foothills on both sides, as well as some high alpine valleys between craggy peaks. I’m puzzled by all this structural embellishment. Is it mere noise—a product of random fluctuations—or is there some driving mechanism we ought to know about, some switch or dial that’s turning the infection process on and off every few months?

I have a few ideas about possible explanations, but I’m not so keen on any of them that I would try to persuade you they’re correct. However, I do hope to persuade you there’s something here that needs explaining.

Before going further, I want to acknowledge my sources. The data files I’m working with are curated by The New York Times, based on information collected from state and local health departments. Compiling the data is a big job; the Times lists more than 150 workers on the project. They need to reconcile the differing and continually shifting policies of the reporting agencies, and then figure out what to do when the incoming numbers look fishy. (Back in June, Florida had a day with –40,000 new cases.) The entire data archive, now about 2.3 gigabytes, is freely available on GitHub. Figure 1 in this article is modeled on a graph updated daily in the Times.

I must also make a few disclaimers. In noodling around with this data set I am not trying to forecast the course of the epidemic, or even to retrocast it—to develop a model accurate enough to reproduce details of timing and magnitude observed over the past year and a half. I’m certainly not offering medical or public-health advice. I’m just a puzzled person looking for simple mechanisms that might explain the overall shape of the incidence curve, and in particular the roller coaster pattern of recurring hills and valleys.

So far, four main waves of infection have washed over the U.S., with a fifth wave now beginning to look like a tsunami. Although the waves differ greatly in height, they seem to be coming at us with some regularity. Eyeballing Figure 1, I get the impression that the period from peak to peak is pretty consistent, at roughly four months.

Periodic oscillations in epidemic diseases have been noticed many times before. The classical example is measles in Great Britain, for which there are weekly records going back to the early 18th century. In 1917 John Brownlee studied the measles data with a form of Fourier analysis called the periodogram. He found that the strongest peak in the frequency spectrum came at a period of 97 weeks, reasonably consistent with the widespread observation that the disease reappears every second year. But Brownlee’s periodograms bristle with many lesser peaks, indicating that the measles rhythm is not a simple, uniform drumbeat. Of particular note is the work of M. S. Bartlett in the 1950s, which includes an early instance of computer modeling in epidemiology, using the Manchester University Computer.Later work, using different methods, suggested that the dynamics of measles epidemics may actually be chaotic, with no long-term order.

The mechanism behind the oscillatory pattern in measles is easy to understand. The disease strikes children in the early school years, and the virus is so contagious that it can run through an entire community in a few weeks. Afterwards, another outbreak can’t take hold until a new cohort of children has reached the appropriate age. No such age dependence exists in Covid-19, and the much shorter period of recurrence suggests that some other mechanism must be driving the oscillations. Nevertheless, it seems worthwhile to try applying Fourier methods to the data in Figure 1.

The Fourier transform decomposes any curve representing a function of time into a sum of simple sine and cosine waves of various frequencies. In textbook examples, the algorithm works like magic. Take a wiggly curve like this one:

Figure 3

Feed it into the Fourier transformer, turn the crank, and out comes a graph that looks like this, showing the coefficients of various frequency components:

Figure 4

Technical details: The classical Fourier transform yields complex coefficients, with real and imaginary parts. I am using a variant called the discrete cosine transform, which produces real coefficients. The input curve is generated by the function $\cos(x) + \cos(3x)$ over the interval from $0$ to $4\pi$.Only two coefficients are substantially different from zero, corresponding to waves that make two or six full cycles over the span of the input curve. Those two coefficients capture all the information needed to reconstruct the input. If you draw the curves specified by the two coefficients, then add them up point by point, you get back a replica of the original.

It would be illuminating to have such a succinct encoding of the Covid curve—a couple of numbers that explain its comings and goings. Alas, that’s not so easy. When I poured the Covid data into the Fourier machine, this is what came out:

Figure 5

More than a dozen coefficients have significant magnitude; some are positive and some are negative; no obvious pattern leaps off the page. This spectrum, like the simpler one in Figure 4, holds all the information needed to reconstruct its input. I confirmed that fact with a quick computational experiment. But looking at the jumble of coefficients doesn’t help me to understand the structure of the Covid curve. The Fourier-transformed version is even more baffling than the original.

One lesson to be drawn from this exercise is that the Fourier transform is indeed magic: If you want to make it work, you need to master the dark arts. I am no wizard in this department; as a matter of fact, most of my encounters with Fourier analysis have ended in tears and trauma. No doubt someone with higher skills could tease more insight from the numbers than I can. But I doubt that any degree of Fourier finesse will lead to some clear and concise description of the Covid curve. Even with $200$ years of measles records, Brownlee wasn’t able to isolate a clear signal; with just a year and a half of Covid data, success is unlikely.

Yet my foray into the Fourier realm was not a complete waste of time. Applying the inverse Fourier transform to the first 13 coefficients (for wavenumbers 0 through 6) yields this set of curves:

Figure 6

It looks a mess, but the sum of these 13 sinusoidal waves yields quite a handsome, smoothed version of the Covid curve. In Figure 7 below, the pink area in the background shows the Times data, smoothed with the seven-day rolling average. The blue curve, much smoother still, is the waveform reconstructed from the 13 Fourier coefficients.

Figure 7

The reconstruction traces the outlines of all the large-scale features of the Covid curve, with serious errors only at the end points (which are always problematic in Fourier analysis). The Fourier curve also fails to reproduce the spiky triple peak atop the big surge from last winter, but I’m not sure that’s a defect.

Let’s take a closer look at that triple peak. The graph below is an expanded view of the two-month interval from 20 November 2020 through 20 January 2021. The light-colored bars in the background are raw data on new cases for each day; the dark blue line is the seven-day rolling average computed by the Times.

Figure 8

The peaks and valleys in this view are just as high and low as those in Figure 1; they look less dramatic only because the horizontal axis has been stretched ninefold. My focus is not on the peaks but on the troughs between them. (After all, there wouldn’t be three peaks if there weren’t two troughs to separate them.) Three data points marked by pink bars have case counts far lower than the surrounding days. Note the dates of those events. November 26 was Thanksgiving Day in the U.S. in 2020; December 25 is Christmas Day, and January 1 is New Years Day. It looks like the virus went on holiday, but of course it was actually the medical workers and public health officials who took a day off, so that many cases did not get recorded on those days.

There may be more to this story. Although the holidays show up on the chart as low points in the progress of the epidemic, they were very likely occasions of higher-than-normal contagion, because of family gatherings, religious services, revelry, and so on. (I commented on the Thanksgiving risk last fall.) Those “extra” infections would not show up in the statistics until several days later, along with the cases that went undiagnosed or unreported on the holidays themselves. Thus each dip appears deeper because it is followed by a surge.

All in all, it seems likely that the troughs creating the triple peak are a reporting anomaly, rather than a reflection of genuine changes in the viral transmission rate. Thus a curve that smooths them away may give a better account of what’s really going on in the population.

There’s another transformation—quite different from Fourier analysis—that might tell us something about the data. The time derivative of the Covid curve gives the rate of change in the infection rate—positive when the epidemic is surging, negative when it’s retreating. Because we’re working with a series of discrete values, computing the derivative is trivially easy: It’s just the series of differences between successive values.

Figure 9

The derivative of the raw data (blue) looks like a seismograph recording from a jumpy day along the San Andreas. The three big holiday anomalies—where case counts change by 100,000 per day—produce dramatic excursions. The smaller jagged waves that extend over most of the 18-month interval are probably connected with the seven-day cycle of data collection, which typically show case counts increasing through the work week and then falling off on the weekend.

The seven-day trailing average is designed to suppress that weekly cycle, and it also smooths over some larger fluctuations. The resulting curve (red) is not only less jittery but also has much lower amplitude. (I have expanded the vertical scale by a factor of two for clarity.)

Finally, the reconstituted curve built by summing 13 Fourier components yields a derivative curve (green) whose oscillations are mere ripples, even when stretched vertically by a factor of four.

The points where the derivative curves cross the zero line—going from positive to negative or vice versa—correspond to peaks or troughs in the underlying case-count curve. Each zero crossing marks a moment when the epidemic’s trend reversed direction, when a growing daily case load began to decline, or a falling one turned around and started gaining again. The blue raw-data curve has 255 zero crossings, and the red averaged curve has 122. Even the lesser figure implies that the infection trend is reversing course every four or five days, which is not plausible; most of those sign changes must result from noise in the data.

The silky smooth green curve has nine zero crossings, most of which seem to signal real changes in the course of the epidemic. I would like to understand what’s causing those events.

You catch a virus. (Sorry about that.) Some days later you infect a few others, who after a similar delay pass the gift on to still more people. This is the mechanism of exponential (or geometric) growth. With each link in the chain of transmission the number of new cases is multiplied by a factor of $R$, which is the natural growth ratio of the epidemic—the average number of cases spawned by each infected individual. Starting with a single case at time $t = 0$, the number of new infections at any later time $t$ is $R^t$. If $R$ is greater than 1, even very slightly, the number of cases increases without limit; if $R$ is less than 1, the epidemic fades away.

The average delay between when you become infected and when you infect others is known as the serial passage time, which I am going to abbreviate T_SP and take as the basic unit for measuring the duration of events in the epidemic. For Covid-19, one T_SP is probably about five days.

Exponential growth is famously unmanageable. If $R = 2$, the case count doubles with every iteration: $1, 2, 4, 8, 16, 32\dots$. It increases roughly a thousandfold after 10 T_SP, and a millionfold after 20 T_SP. The rate of increase becomes so steep that I can’t even graph it except on a logarithmic scale, where an exponential trajectory becomes a straight line.

Figure 10

What is the value of $R$ for the SARS-CoV-2 virus? No one knows for sure. The number is difficult to measure, and it varies with time and place. Another number, $R_0$, is often regarded as an intrinsic property of the virus itself, an indicator of how easily it passes from person to person. The Centers for Disease Control and Prevention (CDC) suggests that $R_0$ for SARS-CoV-2 probably lies between 2.0 and 4.0, with a best guess of 2.5. That would make it catchier than influenza but less so than measles. However, the CDC has also published a report arguing that $R_0$ is “easily misrepresented, misinterpreted, and misapplied.” I’ve certainly been confused by much of what I’ve read on the subject.

Whatever numerical value we assign to $R$, if it’s greater than 1, it cannot possibly describe the complete course of an epidemic. As $t$ increases, $R^t$ will grow at an accelerating pace, and before you know it the predicted number of cases will exceed the global human population. For $R = 2$, this absurdity arrives after about 33 T_SP, which is less than six months.

What we need is a mathematical model with a built-in limit to growth. As it happens, the best-known model in epidemiology features just such a mechanism. Introduced almost 100 years ago by W. O. Kermack and A. G. McKendrick of the Royal College of Physicians in Edinburgh, Recent descriptions of the SIR model usually say $(\mathcal{R})$ stands for removed, acknowledging that recovery is not the only way an infection can end. But I don’t want to be grim today. Also note that I’m using a calligraphic font for $\mathcal{S},\mathcal{I}$, and $\mathcal{R}$ to avoid confusion between the growth rate $R$ and the recovered group $\mathcal{R}$.it is now called the SIR model because it partitions the human population into three subsets called susceptible $(\mathcal{S})$, infective $(\mathcal{I})$, and recovered $(\mathcal{R})$. Initially (before a pathogen enters the population), everyone is of type $\mathcal{S}$. Susceptibles who contract the virus become infectives—capable of transmitting the disease to other susceptibles. Then, after each infective’s illness has run its course, that person joins the recovered class. Having acquired immunity through infection, the recovereds will never be susceptible again.

A SIR epidemic can’t keep growing indefinitely for the same reason that a forest fire can’t keep burning after all the trees are reduced to ashes. At the beginning of an epidemic, when the entire population is susceptible, the case count can grow exponentially. But growth slows later, when each infective has a harder time finding susceptibles to infect. Kermack and McKendrick made the interesting discovery that the epidemic dies out before it has reached the entire population. That is, the last infective recovers before the last susceptible is infected, leaving a residual $\mathcal{S}$ population that has never experienced the disease.

The SIR model itself has gone viral in the past few years. There are tutorials everywhere on the web, as well as scholarly articles and books. (I recommend Epidemic Modelling: An Introduction, by Daryl J. Daley and Joseph Gani. Or try Mathematical Modelling of Zombies if you’re feeling brave.) Most accounts of the SIR model, including the original by Kermack and McKendrick, are presented in terms of differential equations. I’m instead going to give a version with discrete time steps—$\Delta t$ rather than $dt$—because I find it easier to explain and because it translates line for line into computer code. In the equations that follow, $\mathcal{S}$, $\mathcal{I}$, and $\mathcal{R}$ are real numbers in the range $[0, 1]$, representing proportions of some fixed-size population.

\[\begin{align}
\Delta\mathcal{I} & = \beta \mathcal{I}\mathcal{S}\\[0.8ex]
\Delta\mathcal{R} & = \gamma \mathcal{I}\\[1.5ex]
\mathcal{S}_{t+\Delta t} & = \mathcal{S}_{t} - \Delta\mathcal{I}\\[0.8ex]
\mathcal{I}_{t+\Delta t} & = \mathcal{I}_{t} + \Delta\mathcal{I} - \Delta\mathcal{R}\\[0.8ex]
\mathcal{R}_{t+\Delta t} & = \mathcal{R}_{t} + \Delta\mathcal{R}\\[0.8ex]
\end{align}\]

The first equation, with $\Delta\mathcal{I}$ on the left hand side, describes the actual contagion process—the recruitment of new infectives from the susceptible population. The number of new cases is proportional to the product of $\mathcal{I}$ and $\mathcal{S}$, since the only way to propagate the disease is to bring together someone who already has it with someone who can catch it. The constant of proportionality, $\beta$, is a basic parameter of the model. It measures how often (per T_SP) an infective person encounters others closely enough to communicate the virus.

The second equation, for $\Delta\mathcal{R}$, similarly describes recovery. For epidemiological purposes, you don’t have to be feeling tiptop again to be done with the disease; recovery is defined as the moment when you are no long capable of infecting other people. The model takes a simple approach to this idea, withdrawing a fixed fraction of the infectives in every time step. The fraction is given by the parameter $\gamma$.

After the first two equations calculate the number of people who are changing their status in a given time step, the last three equations update the population segments accordingly. The susceptibles lose $\Delta\mathcal{I}$ members; the infectives gain $\Delta\mathcal{I}$ and lose $\Delta\mathcal{R}$; the recovereds gain $\Delta\mathcal{R}$. The total population $\mathcal{S} + \mathcal{I} + \mathcal{R}$ remains constant throughout.

InIn the recent literature, the ratio $\beta / \gamma$ is commonly presented not just as analogous to $R_0$ but as a definition of $R_0$. I resist this practice because $R_0$ has too many definitions already. In adopting the symbol $\rho$ I am following the precedent of David G. Kendall in a 1956 paper. this version of the SIR model, the ratio $\rho = \beta / \gamma$ determines a natural growth rate, closely analogous to $R_0$. Higher $\beta$ means faster recruitment of infectives; lower $\gamma$ means they remain infective longer. Either of those adjustments increases the growth rate $\rho$, although the rate also depends on $\mathcal{S}$ and $\mathcal{I}$.

Here’s what happens when you put the model in motion. For this run I set $\beta = 0.6$ and $\gamma = 0.2$, which implies that $\rho = 3.0$. Another number that needs to be specified is the initial proportion of infectives; I chose $10^{-6}$, or in other words one in a million. The model ran for $100$ T_SP, with a time step of $\Delta t = 0.1$ T_SP; thus there were $1{,}000$ iterations overall.

Figure 11

Let me call your attention to a few features of this graph. At the outset, nothing seems to happen for weeks and weeks, and then all of a sudden a huge blue wave rises up out of the calm waters. Starting from one case in a population of a million, it takes $18$ T_SP to reach one case in a thousand, but just $12$ more T_SP to reach one in $10$.

Note that the population of infectives reaches a peak near where the susceptible and recovered curves cross—that is, where $\mathcal{S} = \mathcal{R}$. This relationship holds true over a wide range of parameter values. That’s not surprising, because the whole epidemic process acts as a mechanism for converting susceptibles into recovereds, via a brief transit through the infective stage. But, as Kermack and McKendrick predicted, the conversion doesn’t quite go to completion. At the end, about $6$ percent of the population remains in the susceptible category, and there are no infectives left to convert them. This is the condition called herd immunity, where the population of susceptibles is so diluted that most infectives recover before they can find someone to infect. It’s the end of the epidemic, though it comes only after $90+$ percent of the people have gotten sick. (That’s not what I would call a victory over the virus.)

The $\mathcal{I}$ class in the SIR model can be taken as a proxy for the tally of new cases tracked in the Times data. The two variables are not quite the same—infectives remain in the class $\mathcal{I}$ until they recover, whereas new cases are counted only on the day they are reported—but they are very similar and roughly proportional to one another. And that brings me to the main point I want to make about the SIR model: In Figure 11 the blue curve for infectives looks nothing like the corresponding new-case tally in Figure 1. In the SIR model, the number of infectives starts near zero, rises steeply to a peak, and thereafter tapers gradually back to zero, never to rise again. It’s a one-hump camel. The roller coaster Covid curve is utterly different.

The detailed geometry of the $\mathcal{I}$ curve depends on the values assigned to the parameters $\beta$ and $\gamma$. Changing those variables can make the curve longer or shorter, taller or flatter. But no choice of parameters will give the curve multiple ups and downs. There are no oscillatory solutions to these equations.

The SIR model strikes me as so plausible that it—or some variant of it—really must be a correct description of the natural course of an epidemic. But that doesn’t mean it can explain what’s going on right now with Covid-19. A key element of the model is saturation: the spread of the disease stalls when there are too few susceptibles left to catch it. That can’t be what caused the steep downturn in Covid incidence that began in January of this year, or the earlier slumps that began in April and July of 2020. We were nowhere near saturation during any of those events, and we still aren’t now. (For the moment I’m ignoring the effects of vaccination. I’ll take up that subject below.)

In Figure 11 there comes a dramatic triple point where each of the three categories constitutes about a third of the total population. If we projected that situation onto the U.S., we would have (in very round numbers) $100$ million active infections, another $100$ million people who have recovered from an earlier bout with the virus, and a third $100$ million who have so far escaped (but most of whom will catch it in the coming weeks). That’s orders of magnitude beyond anything seen so far. The cumulative case count, which combines the $\mathcal{I}$ and $\mathcal{R}$ categories, is approaching $37$ million, or $11$ percent of the U.S. population. Figure 12Even if the true case count is double the official tally, we are still far short of the model’s crucial triple point. Judged from the perspective of the SIR model, we are still in the early stages of the epidemic, where case counts are too low to see in the graph. (If you rescale Figure 1 so that the y axis spans the full U.S. population of 330 million, you get the flatline graph at right.)

If we are still on the early part of the curve, in the regime of rampant exponential growth, it’s easy to understand the surging accelerations we’ve seen in the worst moments of the epidemic. The hard part is explaining the repeated slowdowns in viral transmission that punctuate the Covid curve. In the SIR model, the turnaround comes when the virus begins to run out of victims, but that’s a one-time phenomenon, and we haven’t gotten there yet. What can account for the deep valleys in the Times Covid curve?

Among the ideas that immediately come to mind, one strong contender is feedback. We all have access to real-time information on the status of the epidemic. It comes from governmental agencies, from the news media, from idiots on Facebook, and from the personal communications of family, friends, and neighbors. Most of us, I think, respond appropriately to those messages, modulating our anti-contagion precautions according to the perceived severity of the threat. When it’s scary outside, we hunker down and mask up. When the risk abates, it’s party time again! I can easily imagine a scenario where such on-again, off-again measures would trigger oscillations in the incidence of the disease.

If this hypothesis turns out to be true, it is cause for both hope and frustration. Hope because the interludes of viral retreat suggest that our tools for fighting the epidemic must be reasonably effective. Frustration because the rebounds indicate we’re not deploying those tools as well as we could. Look again at the Covid curve of Figure 1, specifically at the steep downturn following the winter peak. In early February, the new case rate was dropping by 30,000 per week. Over the first three weeks of that month, the rate was cut in half. Whatever we were doing then, it was working brilliantly. If we had just continued on the same trajectory, the case count would have hit zero in early March. Instead, the downward slope flattened, and then turned upward again.

We had another chance in June. All through April and May, new cases had been falling steadily, from 65,000 to 17,000, a pace of about –800 cases a day. If we’d been able to sustain that rate for just three more weeks, we’d have crossed the finish line in late June. But again the trend reversed course, and by now we’re back up well above 100,000 cases a day.

Are these pointless ups and downs truly caused by feedback effects? I don’t know. I am particularly unsure about the “if only” part of the story—the idea that if only we’d kept the clamps on for just a few more weeks, the virus would have been eradicated, or nearly so. But it’s an idea to keep in mind.

Perhaps we could learn more by creating a feedback loop in the SIR model, and looking for oscillatory dynamics. Negative feedback is anything that acts to slow the infection rate when that rate is high, and to boost it when it’s low. Such a contrarian mechanism could be added to the model in several ways. Perhaps the simplest is a lockdown threshold: Whenever the number of infectives rises above some fixed limit, everyone goes into isolation; when the $\mathcal{I}$ level falls below the threshold again, all cautions and restrictions are lifted. It’s an all-or-nothing rule, which makes it simple to implement. We need a constant to represent the threshold level, and a new factor (which I am naming $\varphi$, for fear) in the equation for $\Delta \mathcal{I}$:

\[\Delta\mathcal{I} = \beta \varphi \mathcal{I} \mathcal{S}\]

The $\varphi$ factor is 1 whenever $\mathcal{I}$ is below the threshold, and $0$ when it rises above. The effect is to shut down all new infections as soon as the threshold is reached, and start them up again when the rate falls.

Does this scheme produce oscillations in the $\mathcal{I}$ curve? Strictly speaking, the answer is yes, but you’d never guess it by looking at the graph.

Figure 13

The feedback loop serves as a control system, like a thermostat that switches the furnace off and on to maintain a set temperature. In this case, the feedback loop holds the infective population steady at the threshold level, which is set at 0.05. On close examination, it turns out that $\mathcal{I}$ is oscillating around the threshold level, but with such a short period and tiny amplitude that the waves are invisible. The value bounces back and forth between 0.049 and 0.051.

To get macroscopic oscillations, we need more than feedback. The SIR output shown below comes from a model that combines feedback with a delay between measuring the state of the epidemic and acting on that information. Introducing such a delay is not the only way to make the model swing, but it’s certainly a plausible one. As a matter of fact, a model without any delay, in which a society responds instantly to every tiny twitch in the case count, seems wholly unrealistic.

Figure 14

The model of Figure 14 adopts the same parameters, $\beta = 0.6$ and $\gamma = 0.2$, as the version of Figure 13, as well as the same lockdown threshold $(0.05)$. It differs only in the timing of events. If the infective count climbs above the threshold at time $t$, control measures do not take effect until $t + 3$; in the meantime, infections continue to spread through the population. The delay and overshoot on the way up are matched by a delay and undershoot at the other end of the cycle, when lockdown continues for three T_SP after the threshold is crossed on the way down.

Given these specific parameters and settings, the model produces four cycles of diminishing amplitude and increasing wavelength. (No further cycles are possible because $\mathcal{I}$ remains below the threshold.) Admittedly, those four spiky sawtooth peaks don’t look much like the humps in the Covid curve. If we’re going to seriously consider the feedback hypothesis, we’ll need stronger evidence than this. But the model is very crude; it could be refined and improved.

The fact is, I really want to believe that feedback could be a major component in the oscillatory dynamics of Covid-19. It would be comforting to know that our measures to combat the epidemic have had a powerful effect, and that we therefore have some degree of control over our fate. But I’m having a hard time keeping the faith. For one thing, I would note that our countermeasures have not always been on target. In the epidemic’s first wave, when the characteristics of the virus were largely unknown, the use of facemasks was discouraged (except by medical personnel), and there was a lot of emphasis on hand washing, gloves, and sanitizing surfaces. Not to mention drinking bleach. Those measures were probably not very effective in stopping the virus, but the wave receded anyway.

Another source of doubt is that wavelike fluctuations are not unique to Covid-19. Figure 15On the contrary, they seem to be a common characteristic of epidemics across many times and places. The $1918\textrm{–}1919$ influenza epidemic had at least three waves. Figure 15, which Wikipedia attributes to the CDC, shows influenza deaths per $1{,}000$ people in the United Kingdom. Those humps look awfully familiar. They are similar enough to the Covid waves that it seems natural to look for a common cause. But if both patterns are a product of feedback effects, we have to suppose that public health measures undertaken a century ago, in the middle of a world war, worked about as well as those today. (I’d like to think there’s been some progress.)

Update: Alina Glaubitz and Feng Fu of Dartmouth have applied a game-theoretical approach to generating oscillatory dynamics in a SIR model. Their work was published last November but I have only just learned about it from an article by Lina Sorg in SIAM News.

One detail of the SIR model troubles me. As formulated by Kermack and McKendrick, the model treats infection and recovery as symmetrical, mirror-image processes, both of them described by exponential functions. The exponential rule for infections makes biological sense. You can only get the virus via transmission from someone who already has it, so the number of new infections is proportional to the number of existing infections. But recovery is different; it’s not contagious. Although the duration of the illness may vary to some extent, there’s no reason to suppose it would depend on the number of other people who are sick at the same time.

In the model, a fixed fraction of the infectives, $\gamma \mathcal{I}$, recover at every time step. Figure 16For $\gamma = 0.2$, this rule generates the exponential distribution seen in Figure 16. Imagine that some large group of people have all been infected at the same time, $t = 0$. At $t = 1$, a fifth of the infectives recover, leaving $80$ percent of the cohort still infective. At $t = 2$, a fifth of the remaining $80$ percent are removed from the infective class, leaving $64$ percent. And so it goes. Even after $10$ T_SP, more than $10$ percent of the original group remain infectious.

The long tail of this distribution corresponds to illnesses that persist for many weeks. Such cases exist, but they are rare. According to the CDC, most Covid patients have no detectable “replication-competent virus” $10$ days after the onset of symptoms. Even in the most severe cases, with immunocompromised patients, $20$ days of infectivity seems to be the outer limit. I don’t know who was the first to notice that the exponential distribution of recovery times is too broad, but I know it wasn’t me. In 2001 Alun L. Lloyd wrote on this theme in the context of measles epidemics (Theoretical Population Biology Vol. 60, No. 1, pp. 59–71).These observations suggest a different strategy for modeling recovery. Rather than assuming that a fixed fraction of patients recover at every time step, we might get a better approximation to the truth by assuming that all patients recover (or at least become noninfective) after a fixed duration of illness.

Modifying the model for a fixed period of infectivity is not difficult. We can keep track of the infectives with a data structure called a queue. Each new batch of newly recruited infectives goes into the tail of the queue, then advances one place with each time step. After $m$ steps (where $m$ is the duration of the illness), the batch reaches the head of the queue and joins the company of the recovered. Here is what happens when $m = 3$ T_SP:

Figure 17

I chose $3$ T_SP for this example because it is close to the median duration in the exponential distribution in Figure 11, and therefore ought to resemble the earlier result. And so it does, approximately. As in Figure 11, the peak in the infectives curve lies near where the susceptible and recovered curves cross. But the peak never grows quite as tall; and, for obvious reasons, it decays much faster. As a result, the epidemic ends with many more susceptibles untouched by the disease—more than 25 percent.

A disease duration of $3$ T_SP, or about $15$ days, is still well over the CDC estimates of the typical length. Shortening the queue to $2$ T_SP, or about $10$ days, transforms the outcome even more dramatically. Now the susceptible and recovered curves never cross, and almost $70$ percent of the susceptible population remains uninfected when the epidemic peters out.

Figure 18

Figure 18 comes a little closer to describing the current Covid situation in the U.S. than the other models considered above. It’s not that the curves’ shape resembles that of the data, but the overall magnitude or intensity of the epidemic is closer to observed levels. Of the models presented so far, this is the first that reaches a natural limit without burning through most of the population. Maybe we’re on to something.

On the other hand, there are a couple of reasons for caution. First, with these parameters, the initial growth of the epidemic is extremely slow; it takes $40$ or $50$ T_SP before infections have a noticeable effect on the population. That’s well over six months. Second, we’re still dealing with a one-hump camel. Even though most of the population is untouched, the epidemic has run its course, and there will not be a second wave. Something important is still missing.

Before leaving this topic behind, I want to point out that the finite time span of a viral infection gives us a special point of leverage for controlling the spread of the disease. The viruses that proliferate in your body must find a new host within a week or two, or else they face extinction. Therefore, if we could completely isolate every individual in the country for just two or three weeks, the epidemic would be over. Admittedly, putting each and every one of us into solitary confinement is not feasible (or morally acceptable), but we could strive to come as close as possible, strongly discouraging all forms of person-to-person contact. Testing, tracing, and quarantines would deal with straggler cases. My point is that a very strict but brief lockdown could be both more effective and less disruptive than a loose one that goes on for months. Where other strategies aim to flatten the curve, this one attempts to break the chain.

When Covid emerged late in 2019, it was soon labeled a pandemic, signifying that it’s bigger than a mere epidemic, that it’s everywhere. But it’s not everywhere at once. Flareups have skipped around from region to region and country to country. Perhaps we should view the pandemic not as a single global event but as an ensemble of more localized outbreaks.

Suppose small clusters of infections erupt at random times, then run their course and subside. By chance, several geographically isolated clusters might be active over the same range of dates and add up to a big bump in the national case tally. Random fluctuations could also produce interludes of widespread calm, which would cause a dip in the national curve.

We can test this notion with a simple computational experiment, modeling a population divided into $N$ clusters or communities. For each cluster a SIR model generates a curve giving the proportion of infectives as a function of time. The initiation time for each of these mini-epidemics is chosen randomly and independently. Summing the $N$ curves gives the total case count for the country as a whole, again as a function of time.

Before scrolling down to look at the graphs generated by this process, you might make a guess about how the experiment will turn out. In particular, how will the shape of the national curve change as the number of local clusters increases?

If there’s just one cluster, then the national curve is obviously identical to the trajectory of the disease in that one place. With two clusters, there’s a good chance they will not overlap much, and so the national curve will probably have two humps, with a deep valley between them. With $N = 3$ or $4$, overlap becomes more of an issue, but the sum curve still seems likely to have $N$ humps, perhaps with shallower depressions separating them. Before I saw the results, I made the following guess about the behavior of the sum as $N$ continues increasing: The sum curve will always have approximately $N$ peaks, I thought, but the height difference between peaks and troughs should get steadily smaller. Thus at large $N$ the sum curve would have many tiny ripples, small enough that the overall curve would appear to be one broad, flat-topped hummock.

So much for my intuition. Here are two examples of sum curves generated by clusters of $N$ mini-epidemics, one curve for $N = 6$ and one for $N = 50$. The histories for individual clusters are traced by fine red lines; the sums are blue. All the curves have been scaled so that the highest peak of the sum curve touches $1.0$.

Figure 19

My Technical details: Cluster initiation time is chosen uniformly at random between $0$ and $80$. $\beta$ is a random normal variable with mean $0.6$ and standard deviation $0.1$, allowing some variation in the intensity and duration of the individual sub-epidemics. The initial infective level is $0.001$.guess about the “broad, flat-topped hummock” with many shallow ripples was altogether wrong. The number of peaks does not increase in proportion to $N$. As a matter of fact, both of the sum curves in Figure 19 have four distinct peaks (possibly five in the example at right), even though the number of component curves contributing to the sum is only six in one case and is $50$ in the other.

I have to confess that the two examples in Figure 19 were not chosen at random. I picked them because they looked good, and because they illustrated a point I wanted to make—namely that the number of peaks in the sum curve remains nearly constant, regardless of the value of $N$. Figure 20 assembles a more representative sample, selected without deliberate bias but again showing that the number of peaks is not sensitive to $N$, although the valleys separating those peaks get shallower as $N$ grows.

Figure 20

The Figure 21takeaway message from these simulations seems to be that almost any collection of randomly timed mini-epidemics will combine to form a macro-epidemic with just a few waves. The number of peaks is not always four, but it’s seldom very far from that number. The bar graphs in Figure 21 offer some quantitative evidence on this point. They record the distribution of the number of peaks in the sum curve for values of $N$ between $4$ and $100$. Each set of bars represents $1{,}000$ repetitions of the process. In all cases the peak falls at $N = 3, 4,$ or $5$.

The question is: Why $4 \pm 1$? Why do we keep seeing those particular numbers? And if $N$, the number of components being summed, has little influence on this property of the sum curve, then what does govern it? I puzzled over these questions for some time before a helpful analogy occurred to me.

Suppose you have a bunch of sine waves, all at the same frequency $f$ but with randomly assigned phases; that is, the waves all have the same shape, but they are shifted left or right along the $x$ axis by random amounts. What would the sum of those waves look like? The answer is: another sine wave of frequency $f$. This is a little fact that’s been known for ages (at least since Euler) and is not hard to prove, but it still comes as a bit of a shock every time I run into it. I believe the same kind of argument can explain the behavior of a sum of SIR curves, even though those curves are not sinusoidal. The component SIR curves have a period of $20$ to $30$ T_SP. In a model run that spans $100$ T_SP, these curves can be considered to have a frequency of between three and five cycles per epidemic period. Their sum should be a wave with the same frequency—something like the Covid curve, with its four (or four and a half) prominent humps. In support of this thesis, when I let the model run to $200$ T_SP, I get a sum curve with seven or eight peaks.

I am intrigued by the idea that an epidemic might arrive in cyclic waves not because of anything special about viral or human behavior but because of a mathematical process akin to wave interference. It’s such a cute idea, dressing up an obscure bit of counterintuitive mathematics and bringing it to bear on a matter of great importance to all of us. And yet, alas, a closer look at the Covid data suggests that nature doesn’t share my fondness for summing waves with random phases.

Figure 22, again based on data extracted from the Times archive, plots $49$ curves, representing the time course of case counts in the Lower $48$ states and the District of Columbia. I have separated them by region, and in each group I’ve labeled the trace with the highest peak. We already know that these curves yield a sum with four tall peaks; that’s where this whole investigation began. But the $49$ curves do not support the notion that those peaks might be produced by summing randomly timed mini-epidemics. The oscillations in the $49$ curves are not randomly timed; there are strong correlations between them. And many of the curves have multiple humps, which isn’t possible if each mini-epidemic is supposed to act like a SIR model that runs to completion.

Figure 22

Although these curves spoil a hypothesis I had found alluring, they also reveal some interesting facts about the Covid epidemic. I knew that the first wave was concentrated in New York City and surrounding areas, but I had not realized how much the second wave, in the summer of 2020, was confined to the country’s southern flank, from Florida all the way to California. The summer wave this year is also most intense in Florida and along the Gulf Coast. Coincidence? When I showed the graphs to a friend, she responded: “Air conditioning.”

Searching for the key to Covid, I’ve tried out three slightly whimsical notions: the possibility of a periodic signal, like the sunspot cycle, bringing us waves of infection on a regular schedule; feedback loops producing yo-yo dynamics in the case count; and randomly timed mini-epidemics that add up to a predictable, slow variation in the infection rate. In retrospect they still seem like ideas worth looking into, but none of them does a convincing job of explaining the data.

In my mind the big questions remain unanswered. In November of 2020 the daily tally of new Covid cases was above $100{,}000$ and rising at a fearful rate. Three months later the infection rate was falling just as steeply. What changed between those dates? What action or circumstance or accident of fate blunted the momentum of the onrushing epidemic and forced it into retreat? And now, just a few months after the case count bottomed out, we are again above $100{,}000$ cases per day and still climbing. What has changed again to bring the epidemic roaring back?

There are a couple of obvious answers to these questions. As a matter of fact, those answers are sitting in the back of the room, frantically waving their hands, begging me to call on them. First is the vaccination campaign, which has now reached half the U.S. population. The incredibly swift development, manufacture, and distribution of those vaccines is a wonder. In the coming months and years they are what will save us, if anything can. But it’s not so clear that vaccination is what stopped the big wave last winter. The sharp downturn in infection rates began in the first week of January, when vaccination was just getting under way in the U.S. On January 9 (the date when the decline began) only about $2$ percent of the population had received even one dose. The vaccination effort reached a peak in April, when more than three million doses a day were being administered. By then, however, the dropoff in case numbers had stopped and reversed. If you want to argue that the vaccine ended the big winter surge, it’s hard to align causation with chronology.

On the other hand, the level of vaccination that has now been achieved should exert a powerful damping effect on any future waves. Removing half the people from the susceptible list may not be enough to reach herd immunity and eliminate the virus from the population, but it ought to be enough to turn a growing epidemic into a wilting one.

Figure 23

The SIR model of Figure 23 has the same parameters as the model of Figure 3 $(\beta = 0.6, \gamma = 0.2,$ implying $\rho = 3.0)$, but $50$ percent of the people are vaccinated at the start of the simulation. With this diluted population of susceptibles, essentially nothing happens for almost a year. The epidemic is dormant, if not quite defunct.

That’s the world we should be living in right now, according to the SIR model. Instead, today’s new case count is $141{,}365$; almost $81{,}556$ people are hospitalized with Covid infections; and 704 people have died. What gives? How can this be happening?

At this point I must acknowledge the other hand waving in the back of the room: the Delta variant, otherwise known as B.1.617.2. Half a dozen mutations in the viral spike protein, which binds to a human cell-surface receptor, have apparently made this new strain at least twice as contagious as the original one.

Figure 24

In Figure 24 contagiousness is doubled by increasing $\rho$ from $3.0$ to $6.0$. That boost brings the epidemic back to life, although there is still quite a long delay before the virus becomes widespread in the unvaccinated half of the population.

The situation is may well be worse than the model suggests. All the models I have reported on here pretend that the human population is homogeneous, or thoroughly mixed. If an infected person is about to spread the virus, everyone in the country has the same probability of being the recipient. This assumption greatly simplifies the construction of the model, but of course it’s far from the truth. In daily life you most often cross paths with people like yourself—people from your own neighborhood, your own age group, your own workplace or school. Those frequent contacts are also people who share your vaccination status. If you are unvaccinated, you are not only more vulnerable to the virus but also more likely to meet people who carry it. This somewhat subtle birds-of-a-feather effect is what allows us to have “a pandemic of the unvaccinated.”

Recent reports have brought still more unsettling and unwelcome news, with evidence that even fully vaccinated people may sometimes spread the virus. I’m waiting for confirmation of that before I panic. (But I’m waiting with my face mask on.)

Having demonstrated that I understand nothing about the history of the epidemic in the U.S.—why it went up and down and up and down and up and down and up and down—I can hardly expect to understand the present upward trend. About the future I have no clue at all. Will this new wave tower over all the previous ones, or is it Covid’s last gasp? I can believe anything.

But let us not despair. This is not the zombie apocalypse. The survival of humanity is not in question. It’s been a difficult ordeal for the past 18 months, and it’s not over yet, but we can get through this. Perhaps, at some point in the not-too-distant future, we’ll even understand what’s going on.

Update 2021-09-01

Today The New York Times has published two articles raising questions similar to those asked here. David Leonhardt and Ashley Wu write a “morning newsletter” titled “Has Delta Peaked?” Apoorva Mandavilli, Benjamin Mueller, and Shalini Venugopal Bhagat ask “When Will the Delta Surge End?” I think it’s fair to say that the questions in the headlines are not answered in the articles, but that’s not a complaint. I certainly haven’t answered them either.

I’m going to take this opportunity to update two of the figures to include data through the end of August.

Figure 1r

In Figure 1r the surge in case numbers that was just beginning back in late July has become a formidable sugarloaf peak. The open question is what happens next. Leonhardt and Wu make the optimistic observation that “The number of new daily U.S. cases has risen less over the past week than at any point since June.” In other words, we can celebrate a negative second derivative: The number of cases is still high and is still growing, but it’s growing slower than it was. And they cite the periodicity observed in past U.S. peaks and in those elsewhere as a further reason for hope that we may be near the turning point.

Figure 22r

Figure 22r tracks where the new cases are coming from. As in earlier peaks, California, Texas, and Florida stand out.

Data and Source Code

The New York Times data archive for Covid-19 cases and deaths in the United States is available in this GitHub repository. The version I used in preparing this article, cloned on 21 July 2021, is identified as “commit c3ab8c1beba1f4728d284c7b1e58d7074254aff8″. You should be able to access the identical set of files through this link.

Source code for the SIR models and for generating the illustrations in this article is also available on GitHub. The code is written in the Julia programming language and organized in Pluto notebooks.

Three Months in Monte Carlo

Brian Hayes — Thu, 15 Jul 2021 19:53:11 +0000

As a kid I loved magnets. I wanted to know where the push and pull came from. Years later, when I heard about the Ising model of ferromagnetism, I became an instant fan. Here was a simple set of rules, like a game played on graph paper, that offered a glimpse of what goes on inside a magnetic material. Lots of tiny magnetic fields spontaneously line up to make one big field, like a school of fish all swimming in the same direction. I was even more enthusiastic when I learned about the Monte Carlo method, a jauntily named collection of mathematical and computational tricks that can be used to simulate an Ising system on a computer. With a dozen lines of code I could put the model in motion and explore its behavior for myself.

Over the years I’ve had several opportunities to play with Ising models and Monte Carlo methods, and I thought I had a pretty good grasp of the basic principles. But, you know, the more you know, the more you know you don’t know.

In 2019 I wrote a brief article on Glauber dynamics, a technique for analyzing the Ising model introduced by Roy J. Glauber, a Harvard physicist. In my article I presented an Ising simulation written in JavaScript, and I explained the algorithm behind it. Then, this past March, I learned that I had made a serious blunder. The program I’d offered as an illustration of Glauber dynamics actually implemented a different procedure, known as the Metropolis algorithm. Oops. (The mistake was brought to my attention by a comment signed “L. Y.,” with no other identifying information. Whoever you are, L. Y., thank you!)

A few days after L. Y.’s comment appeared, I tracked down the source of my error: I had reused some old code and neglected to modify it for its new setting. I corrected the program—only one line needed changing—and I was about to publish an update when I paused for thought. Maybe I could dismiss my goof as mere carelessness, but I realized there were other aspects of the Ising model and the Monte Carlo method where my understanding was vague or superficial. For example, I was not entirely sure where to draw the line between the Glauber and Metropolis procedures. (I’m even less sure now.) I didn’t know which features of the two algorithms are most essential to their nature, or how those features affect the outcome of a simulation. I had homework to do.

Since then, Monte Carlo Ising models have consumed most of my waking hours (and some of the sleeping ones). Sifting through the literature, I’ve found sources I never looked at before, and I’ve reread some familiar works with new understanding and appreciation. I’ve written a bunch of computer programs to clarify just which details matter most. I’ve dug into the early history of the field, trying to figure out what the inventors of these techniques had in mind when they made their design choices. Three months later, there are still soft spots in my knowledge, but it’s time to tell the story as best I can.

This is a long article—nearly 6,000 words. If you can’t read it all, I recommend playing with the simulation programs. There are five of them: 1, 2, 3, 4, 5. On the other hand, if you just can’t get enough of this stuff, you might want to have a look at the source code for those programs on GitHub. The repo also includes data and scripts for the graphs in this article.

Let’s jump right in with an Ising simulation. Below this paragraph is a grid of randomly colored squares, and beneath that a control panel. Feel free to play. Press the Run button, adjust the temperature slider, and click the radio buttons to switch back and forth between the Metropolis and the Glauber algorithms. The Step button slows down the action, showing one frame at a time. Above the grid are numerical readouts labeled Magnetization and Local Correlation; I’ll explain below what those instruments are monitoring.

The model consists of 10,000 sites, arranged in a $100 \times 100$ square lattice, and colored either dark or light, indigo or mauve. In the initial condition (or after pressing the Reset button) the cells are assigned colors at random. Once the model is running, more organized patterns emerge. Adjacent cells “want” to have the same color, but thermal agitation disrupts their efforts to reach accord.

The lattice is constructed with “wraparound” boundaries: This arrangement is also known as periodic boundary conditions. Imagine infinitely many copies of the lattice laid down like square tiles on an infinite plane.All the cells along the right edge are adjacent to those on the left side, and the top and bottom are joined in the same way. Topologically, the structure is a torus, the surface of a doughnut. Although the area of the surface is finite, you can set off in a straight line in any direction and keep going forever, without falling off the edge of the world. Because of the wraparound boundaries, all the cells have exactly four nearest neighbors; those in the corners and along the edges are just like those in the interior and require no special treatment.

When the model is running, changing the temperature can have a dramatic effect. At the upper end of the scale, the grid seethes with activity, like a boiling cauldron, and no feature survives for more than a few milliseconds. In the middle of the temperature range, large clusters of like-colored cells begin to appear, and their lifetimes are somewhat longer. When the system is cooled still further, the clusters evolve into blobby islands and isthmuses, coves and straits, all of them bounded by strangely writhing coastlines. Often, the land masses eventually erode away, or else the seas evaporate, leaving a featureless monochromatic expanse. In other cases broad stripes span the width or height of the array.

Whereas nudging the temperature control utterly transforms the appearance of the grid, the effect of switching between the two algorithms is subtler.

At high temperature (5.0, say), both programs exhibit frenetic activity, but the turmoil in Metropolis mode is more intense.
At temperatures near 3.0, I perceive something curious in the Metropolis program: Blobs of color seem to migrate across the grid. If I stare at the screen for a while, I see dense flocks of crows rippling upward or leftward; sometimes there are groups going both ways at once, with wings flapping. In the Glauber algorithm, blobs of color wiggle and jiggle like agitated amoebas, but they don’t go anywhere.
At still lower temperatures (below about 1.5), the Ising world calms down. Both programs converge to the same monochrome or striped patterns, but Metropolis gets there faster.

I have been noticing these visual curiosities—the fluttering wings, the pulsating amoebas—for some time, but I have never seen them mentioned in the literature. Perhaps that’s because graphic approaches to the Ising model are of more interest to amateurs like me than to serious students of the underlying physics and mathematics. Nevertheless, I would like to understand where the patterns come from. (Some partial answers will emerge toward the end of this article.)

For those who want numbers rather than pictures, I offer the magnetization and local-correlation meters at the top of the program display. Magnetization is a global measure of the extent to which one color or the other dominates the lattice. Specifically, it is the number of dark cells minus the number of light cells, divided by the total number of cells:

\[M = \frac{\blacksquare - \square }{\blacksquare + \square}.\]

$M$ ranges from $-1$ (all light cells) through $0$ (equal numbers of light and dark cells) to $+1$ (all dark).

Local correlation examines all pairs of nearest-neighbor cells and tallies the number of like pairs minus the number of unlike pairs, divided by the total number of pairs:

\[R = \frac{(\square\square + \blacksquare\blacksquare) - (\square\blacksquare + \blacksquare\square) }{\square\square + \square\blacksquare + \blacksquare\square + \blacksquare\blacksquare}.\]

Again the range is from $-1$ to $+1$. These two quantities are both measures of order in the Ising system, but they focus on different spatial scales, global vs. local. All three of the patterns in Figure 1 have magnetization $M = 0$, but they have very different values of local correlation $R$.

Figure 1

The Ising model was invented 100 years ago by Wilhelm Lenz of the University of Hamburg, who suggested it as a thesis project for his student Ernst Ising. It was introduced as a model of a permanent magnet.

A real ferromagnet is a quantum-mechanical device. Inside, electrons in neighboring atoms come so close together that their wave functions overlap. Under these circumstances the electrons can reduce their energy slightly by aligning their spin vectors. According to the rules of quantum mechanics, an electron’s spin must point in one of two directions; by convention, the directions are labeled up and down. The ferromagnetic interaction favors pairings with both spins up or both down. Each spin generates a small magnetic dipole moment. Zillions of them acting together hold your grocery list to the refrigerator door.

In the Ising version of this structure, the basic elements are still called spins, but there is nothing twirly about them, and nothing quantum mechanical either. They are just abstract variables constrained to take on exactly two values. It really doesn’t matter whether we name the values up and down, mauve and indigo, or plus and minus. (Within the computer programs, the two values are $+1$ and $-1$, which means that flipping a spin is just a matter of multiplying by $-1$.) In this article I’m going to refer to up/down spins and dark/light cells interchangeably, adopting whichever term is more convenient at the moment.

As in a ferromagnet, nearby Ising spins want to line up in parallel; they reduce their energy when they do so. This urge to match spin directions (or cell colors) extends only to nearest neighbors; more distant sites in the lattice have no influence on one another. In the two-dimensional square lattice—the setting for all my simulations—each spin’s four nearest neighbors are the lattice sites to the north, east, south, and west (including “wraparound” neighbors for cells on the boundary lines).

If neighboring spins want to point the same way, why don’t they just go ahead and do so? The whole system could immediately collapse into the lowest-energy configuration, with all spins up or all down. That does happen, but there are complicating factors and countervailing forces. Neighborhood conflicts are the principal complication: Flipping your spin to please one neighbor may alienate another. The countervailing influence is heat. Thermal fluctuations can flip a spin even when the change is energetically unfavorable.

The behavior of the Ising model is easiest to understand at the two extremities of the temperature scale. As the temperature $T$ climbs toward infinity, thermal agitation completely overwhelms the cooperative tendencies of adjacent spins, and all possible states of the system are on an equal footing. The lattice becomes a random array of up and down spins, each of which is rapidly changing its orientation. At the opposite end of the scale, where $T$ approaches zero, the system freezes. As thermal fluctuations subside, the spins sooner or later sink into the orderly, low-energy, fully magnetized state—although “sooner or later” can stretch out toward the age of the universe.

Things get more complicated between these extremes. Experiments with real magnets show that the transition from a hot random state to a cold magnetized state is not gradual. As the material is cooled, spontaneous magnetization appears suddenly at a critical temperature called the Curie point (about 840 K in iron). Lenz and Ising wondered whether this abrupt onset of magnetization could be seen in their simple, abstract model. Ising was able to analyze only a one-dimensional version of the system—a line or ring of spins—and he was disappointed to see no sharp phase transition. He thought this result would hold in higher dimensions as well, but on that point he was later proved wrong.

The idealized phase diagram in Figure 2 (borrowed with amendments from my 2019 article) outlines the course of events for a two-dimensional model. To the right, above the critical temperature $T_c$, there is just one phase, in which up and down spins are equally abundant on average, although they may form transient clusters of various sizes. Below the critical point the diagram has two branches, leading to all-up and all-down states at zero temperature. As the system is cooled through $T_c$ it must follow one branch or the other, but which one is chosen is a matter of chance.

Figure 2In this figure I have corrected another error in my 2019 article. The original graph showed magnetization increasing along what looks like a quadratic curve, with $M$ proportional to the square root of $T_C - T$. In fact magnetization is proportional to the eighth root, which makes the onset more abrupt.

The immediate vicinity of $T_c$ is the most interesting region of the phase diagram. If you scroll back up to Program 1 and set the temperature near 2.27, you’ll see filigreed patterns of all possible sizes, from single pixels up to the diameter of the lattice. The time scale of fluctuations also spans orders of magnitude, with some structures winking in and out of existence in milliseconds and others lasting long enough to test your patience.

All of this complexity comes from a remarkably simple mechanism. The model makes no attempt to capture all the details of ferromagnet physics. But with minimal resources—binary variables on a plain grid with short-range interactions—we see the spontaneous emergence of cooperative, collective phenomena, as self-organizing patterns spread across the lattice. The model is not just a toy. Ten years ago Barry M. McCoy and Jean-Marie Maillard wrote:

It may be rightly said that the two dimensional Ising model… is one of the most important systems studied in theoretical physics. It is the first statistical mechanical system which can be exactly solved which exhibits a phase transition.

As I see it, the main question raised by the Ising model is this: At a specified temperature $T$, what does the lattice of spins look like? Of course “look like” is a vague notion; even if you know the answer, you’ll have a hard time communicating it except by showing pictures. But the question can be reformulated in more concrete ways. We might ask: Which configurations of the spins are most likely to be seen at temperature $T$? Or, conversely: Given a spin configuration $S$, what is the probability that $S$ will turn up when the lattice is at temperature $T$?

Intuition offers some guidance on these points. Low-energy configurations should always be more likely than high-energy ones, at any finite temperature. Differences in energy should have a stronger influence at low temperature; as the system gets warmer, thermal fluctuations can mask the tendency of spins to align. These rules of thumb are embodied in a little fragment of mathematics at the very heart of the Ising model:

\[W_B = \exp\left(\frac{-E}{k_B T}\right).\]

Here $E$ is the energy of a given spin configuration, found by scanning through the entire lattice and tallying the number of nearest-neighbor pairs that have parallel vs. antiparallel spins. In the denominator, $T$ is the absolute temperature and $k_B$ is Boltzmann’s constant, named for Ludwig Boltzmann, the Austrian maestro of statistical mechanics. The entire expression is known as the Boltzmann weight, and it determines the probability of observing any given configuration.

In standard physical units the constant $k_B$ is about $10^{-23}$ joules per kelvin, but the Ising model doesn’t really live in the world of joules and kelvins. It’s a mathematical abstraction, and we can measure its energy and temperature in any units we choose. The convention among theorists is to set $k_B = 1$, and thereby eliminate it from the formula altogether. Then we can treat both energy and temperature as if they were pure numbers, without units.

Figure 3

Figure 3 confirms that the equation for the Boltzmann weight yields curves with an appropriate general shape. Lower energies correspond to higher weights, and lower temperatures yield steeper slopes. These features make the curves plausible candidates for describing a physical system such as a ferromagnet. Proving that they are not only good candidates but the unique, true description of a ferromagnet is a mathematical and philosophical challenge that I decline to take on. Fortunately, I don’t have to. The model, unlike the magnet, is a human invention, and we can make it obey whatever laws we choose. In this case let’s simply decree that the Boltzmann distribution gives the correct relation between energy, temperature, and probability.

Note that the Boltzmann weight is said to determine a probability, not that it is a probability. It can’t be. $W_B$ can range from zero to infinity, but a probability must lie between zero and one. To get the probability of a given configuration, we need to calculate its Boltzmann weight and then divide by $Z$, the sum of the weights of all possible configurations—a process called normalization. For a model with $10{,}000$ spins there are $2^{10{,}000}$ configurations, so normalization is not a task to be attempted by direct, brute-force arithmetic.

It’s a tribute to the ingenuity of mathematicians that the impossible-sounding problem of calculating $Z$ has in fact been conquered. In 1944 Lars Onsager published a complete solution of the two-dimensional Ising model—complete in the sense that it allows you to calculate the magnetization, the energy per spin, and a variety of other properties, all as a function of temperature. I would like to say more about Onsager’s solution, but I can’t. I’ve tried more than once to work my way through his paper, but it defeats me every time. I would understand nothing at all about this result if it weren’t for a little help from my friends. Barry Cipra, in a 1987 article, and Cristopher Moore and Stephan Mertens, in their magisterial tome The Nature of Computation, rederive the solution by other means. They relate the Ising model to more tractable problems in graph theory, where I am able to follow most of the steps in the argument. Even in these lucid expositions, however, I find the ultimate result unilluminating. I’ll cite just one fact emerging from Onsager’s difficult algebraic exercise. The exact location of the critical temperature, separating the magnetic from the nonmagnetic phases, is:

\[\frac{2}{\log{(1 + \sqrt{2})}} \approx 2.269185314.\]

For those intimidated by the icy crags of Mt. Onsager, I can recommend the warm blue waters of Monte Carlo. The math is easier. There’s a clear, mechanistic connection between microscopic events and macroscopic properties. And there are the visualizations—that lively dance of the mauve and the indigo—which offer revealing glimpses of what’s going on behind the mathematical curtains. All that’s missing is exactness. Monte Carlo studies can pin down $T_C$ to several decinal places, but they will never give the algebraic expression found by Onsager.

The Monte Carlo method was devised in the years immediately after World War II by mathematicians and physicists working at the Los Alamos Laboratory in New Mexico. This origin story is not without controversy. Statisticians point out that William Gossett (“Student”) and Lord Kelvin both calculated probabilities with random numbers circa 1900. And there’s the even earlier (though likely apocryphal) story about Le Comte de Buffon’s experiments with a randomly tossed needle or stick for estimating the value of $\pi$. Stanislaw Ulam replied: “It seems to me that while it is true that cavemen have already used divination and the Roman priests have tried to prophesy the future from the interiors of birds, there wasn’t anything in literature about solving differential and integral equations by means of suitable stochastic processes.” The key innovation was the use of randomness as a tool for estimation or approximation. This idea came from the mathematician Stanslaw Ulam. While recuperating from an illness, he passed the time playing a card game called Canfield solitaire. Curious about what proportion of the games were winnable, he realized he could estimate this number just by playing a great many games and recording the outcomes. That was the first application of the new method.

The second application was the design of nuclear weapons. The problem at hand was to understand the diffusion of neutrons through uranium and other materials. When a wandering neutron collided with an atomic nucleus, the neutron might be scattered in a new direction, or it might be absorbed by the nucleus and effectively disappear, or it might induce fission in the nucleus and thereby give rise to several more neutrons. Experiments had provided reasonable estimates of the probability of each of these events, but it was still hard to answer the crucial question: In a lump of fissionable matter with a particular shape, size, and composition, would the nuclear chain reaction fizzle or boom? The Monte Carlo method offered an answer by simulating the paths of thousands of neutrons, using random numbers to generate events with the appropriate probabilities. The first such calculations were done with the ENIAC, the vacuum-tube computer built at the University of Pennsylvania. Later the work shifted to the MANIAC, built at Los Alamos.

This early version of the Monte Carlo method is now sometimes called simple or naive Monte Carlo; I have also seen the term hit-or-miss Monte Carlo. The scheme served well enough for card games and for weapons of mass destruction, but the Los Alamos group never attempted to apply it to a problem anything like the Ising model. It would not have worked if they had tried. I know that because textbooks say so, but I had never seen any discussion of exactly how the model would fail. So I decided to try it for myself.

My plan was indeed simple and naive and hit-or-miss. First I generated a random sample of $10{,}000$ spin configurations, drawn independently with uniform probability from the set of all possible states of the lattice. This was easy to do: I constructed the samples by the computational equivalent of tossing a fair coin to assign a value to each spin. Then I calculated the energy of each configuration and, assuming some definite temperature $T$, assigned a Boltzmann weight. I still couldn’t convert the Boltzmann weights into true probabilities without knowing the sum of all $2^{10{,}000}$ weights, but I could sum up the weights of the $10{,}000$ configurations in the sample. Dividing each weight by this sum yields a relative probability: It estimates how frequently (at temperature $T$) we can expect to see a member of the sample relative to all the other members.

At extremely high temperatures—say $T \gt 1{,}000$—this procedure works pretty well. That’s because all configurations are nearly equivalent at those temperatures; they all have about the same relative probability. On cooling the system, I hoped to see a gradual skewing of the relative probabilities, as configurations with lower energy are given greater weight. What happens, however, is not a gradual skewing but a spectacular collapse. At $T = 2$ the lowest-energy state in my sample had a relative probability of $0.9999999979388337$, leaving just $0.00000000206117$ to be shared among the other $9{,}999$ members of the set.

Figure 4

The fundamental problem is that a small sample of randomly generated lattice configurations will almost never include any states that are commonly seen at low temperature. The histograms of Figure 4 show Boltzmann distributions at various temperatures (blue) compared with the distribution of randomly generated states (red). The random distribution is a slender peak centered at zero energy. There is slight overlap with the Boltzmann distribution at $T = 50$, but none whatever for lower temperatures.

There’s actually some good news in this fiasco. The failure of random sampling indicates that the interesting states of the Ising system—those which give the model its distinctive behavior—form a tiny subset buried within the enormous space of $2^{10{,}000}$ configurations. If we can find a way to focus on that subset and ignore the rest, the job will be much easier.

The means to focus more narrowly came with a second wave of Monte Carlo methods, also emanating from Los Alamos. The foundational document was a paper titled “Equation of State Calculations by Fast Computing Machines,” published in 1953. Among the five authors, Nicholas Metropolis was listed first (presumably for alphabetical reasons), and his name remains firmly attached to the algorithm presented in the paper.

With admirable clarity, Metropolis et al. explain the distinction between the old and the new Monte Carlo: “[I]nstead of choosing configurations randomly, then weighting them with $\exp(-E/kT)$, we choose configurations with a probability $\exp(-E/kT)$ and weight them evenly.” Starting from an arbitrary initial state, the scheme makes small, random modifications, with a bias favoring configurations with a lower energy (and thus higher Boltzmann weight), but not altogether excluding moves to higher-energy states. After many moves of this kind, the system is almost certain to be meandering through a neighborhood that includes the most probable configurations. Methods based on this principle have come to be known as MCMC, for Markov chain Monte Carlo. The Metropolis algorithm and Glauber dynamics are the best-known exemplars.

Roy Glauber also had Los Alamos connections. He worked there during World War II, in the same theory division that was home to Ulam, John von Neumann, Hans Bethe, Richard Feynman, and many other notables of physics and mathematics. But Glauber was a very junior member of the group; he was 18 when he arrived, and a sophomore at Harvard. His one paper on the Ising model was published two decades later, in 1963, and makes no mention of his former Los Alamos colleagues. It also makes no mention of Monte Carlo methods; nevertheless, Glauber dynamics has been taken up enthusiastically by the Monte Carlo community.

When applied to the Ising model, both the Metropolis algorithm and Glauber dynamics work by focusing attention on a single spin at each step, and either flipping the selected spin or leaving it unchanged. Thus the system passes through a sequence of states that differ by at most one spin flip. Statistically speaking, this procedure sounds a little dodgy. Unlike the naive Monte Carlo approach, where successive states are completely independent, MCMC generates configurations they are closely correlated. It’s a biased sample. To overcome the bias, the MCMC process has to run long enough for the correlations to fade away. With a lattice of $N$ sites, a common protocol retains only every $N$th sample, discarding all those in between.

The mathematical justification for the use of correlated samples is the theory of Markov chains, devised by the Russian mathematician A. A. Markov circa 1900. It is a tool for calculating probabilities when each event depends on the previous event. And, in the Monte Carlo method, it allows one to work with those probabilities without getting bogged down in the morass of normalization.

The Metropolis and the Glauber algorithms are built on the same armature. They both rely on two main components: a visitation sequence and an acceptance function. The visitation sequence determines which lattice site to visit next; in effect, it shines a spotlight on one selected spin, proposing to flip it to the opposite orientation. The acceptance function determines whether to accept this proposal (and flip the spin) or reject it (and leave the existing spin direction unchanged). Each iteration of this two-phase process constitutes one “microstep” of the Monte Carlo procedure. Repeating the procedure $N$ times constitutes a “macrostep.” Thus one macrostep amounts to one microstep per spin.

In the Metropolis algorithm, the visitation order is deterministic. The program sweeps through the lattice methodically, repeating the same sequence of visits in every macrostep. The original 1953 presentation of the algorithm did not prescribe any specific sequence, but the procedure was clearly designed to visit each site exactly once during a sweep. The version of the Metropolis algorithm in Program 1 adopts the most obvious deterministic option: “typewriter order.” The program chugs through the first row of the lattice from left to right, then goes through the second row in the same way, and so on down to the bottom.

Glauber dynamics takes a different approach: At each microstep the algorithm selects a single spin at random, with uniform probability, from the entire set of $N$ spins. In other words, every spin has a $1 / N$ chance of being chosen at each microstep, whether or not it has been chosen before. A macrostep lasts for $N$ microsteps, but the procedure does not guarantee that every spin will get a turn during every sweep. Some sites will be passed over, while others are visited more than once. Still, as the number of steps goes to infinity, all the sites eventually get equal attention.

So much for the visitation sequence; now on to the acceptance function. It has three parts:

Calculate $\Delta E$, the change in energy that would result from flipping the selected spin $s$. To determine this value, we need to examine $s$ itself and its four nearest neighbors.

Based on $\Delta E$ and the temperature $T$, calculate the probability $p$ of flipping spin $s$.

Generate a random number $r$ in the interval $[0, 1)$. If $r \lt p$, flip the selected spin; otherwise leave it as is.

Part 2 of the acceptance rule calls for a mathematical function that maps values of $\Delta E$ and $T$ to a probability $p$. To be a valid probability, $p$ must be confined to the interval $[0, 1]$. To make sense in the context of the Monte Carlo method, the function should assign a higher probability to spin flips that reduce the system’s energy, without totally excluding those that bring an energy increase. And this preference for negative $\Delta E$ should grow sharper as T gets lower. The specific functions chosen by the Metropolis and the Glauber algorithms satisfy both of these criteria.

Let’s begin with the Glauber acceptance function, which I’m going to call the G-rule:

\[p = \frac{e^{-\Delta E/T}}{1 + e^{-\Delta E/T}}.\]

Parts of this equation should look familiar. The expression for the Boltzmann weight, $e^{-\Delta E/T}$, appears twice, except that the configuration energy $E$ is replaced by $\Delta E$, the change in energy when a specific spin is flipped. Figure 5But where the Boltzmann weight ranges from zero to infinity, the quotient of exponentials in the G-rule stays within the bounds of $0$ to $1$. The curve at right shows the probability distribution for $T = 2.7$, near the critical point for the onset of magnetization. To get a qualitative understanding of the form of this curve, consider what happens when $\Delta E$ grows without bound toward positive infinity: The numerator of the fraction goes to $0$ while the denominator goes to $1$, leaving a quotient that approaches $0.0$. At the other end of the curve, as $\Delta E$ goes to negative infinity, both numerator and denominator increase without limit, and the probability approaches (but never quite reaches) $1.0$. Between these extremes, the curve is symmetrical and smooth. It looks like it would make a pleasant ski run.

The Metropolis acceptance criterion also includes the expression $e^{-\Delta E/T}$, but the function and the curve are quite different. The acceptance probability is defined in a piecewise fashion:

\[p = \left\{\begin{array}{cl}
1 & \text { if } \quad \Delta E \leq 0 \\
e^{-\Delta E/T} & \text { if } \quad \Delta E>0
\end{array}\right.\]

Figure 6

In words, the rule says: If flipping a spin would reduce the energy of the system or leave it unchanged, always do it; otherwise, flip the spin with probability $e^{-\Delta E/T}$. The probability curve (left) has a steep escarpment; if this one is a ski slope, it rates a black diamond. Unlike the smooth and symmetrical Glauber curve, this one has a sharp corner, as well as a strong bias. Consider a spin with $\Delta E = 0$. Glauber flips such a spin with probability $1/2$, but Metropolis always flips it.

The graphs in Figure 7 compare the two acceptance functions over a range of temperatures. The curves differ most at the highest temperatures, and they become almost indistinguishable at the lowest temperatures, where both curves approximate a step function. Although both functions are defined over the entire real number line, the two-dimensional Ising model allows $\Delta E$ to take on only five distinct values: $–8, –4, 0, +4,$ and $+8$. Thus the Ising probability functions are never evaluated anywhere other than the positions marked by colored dots.

Figure 7

Here are JavaScript functions that implement a macrostep in each of the algorithms, with their differences in both visitation sequence and acceptance function:

function runMetro() {
  for (let y = 0; y < gridSize; y++) {
    for (let x = 0; x < gridSize; x++) {
      let deltaE = calcDeltaE(x, y);
      let boltzmann = Math.exp(-deltaE/temperature);
      if ((deltaE <= 0) || (Math.random() < boltzmann)) {
        lattice[x][y] *= -1;
      }
    }
  }
  drawLattice();
}

function runGlauber() {
  for (let i = 0; i < N; i++) {
    let x = Math.floor(Math.random() * gridSize);
    let y = Math.floor(Math.random() * gridSize);
    let deltaE = calcDeltaE(x, y);
    let boltzmann = Math.exp(-deltaE/temperature);
    if (Math.random() < (boltzmann / (1 + boltzmann))) {
      lattice[x][y] *= -1;
    }
  }
  drawLattice();
}

(As I mentioned above, the rest of the source code for the simuations is available on GitHub.)

We’ve seen that the Metropolis and the Glauber algorithms differ in their choice of both visitation sequence and acceptance function. They also produce different patterns or textures when you watch them in action on the computer screen. But what about the numbers? Do they predict different properties for the Ising ferromagnet?

A theorem mentioned throughout the MCMC literature says that these two algorithms (and others like them) should give identical results when properties of the model are measured at thermal equilibrium. I have encountered this statement many times in my reading, but until a few weeks ago I had never tested it for myself. Here are some magnetization data that look fairly convincing:

Magnitude of Magnetization

	T = 1.0	T = 2.0	T = 2.7	T = 3.0	T = 5.0	T = 10.0
Metropolis	0.9993	0.9114	0.0409	0.0269	0.0134	0.0099
Glauber	0.9993	0.9118	0.0378	0.0274	0.0136	0.0100

The table records the absolute value of the magnetization in Metropolis and Glauber simulations at various temperatures. Five of the six measurements differ by less than $0.001$; the exception cames at $T = 2.7$, near the critical point, where the difference rises to about $0.003$. Note that the results are consistent with the presence of a phase transition: Magnetization remains close to $0$ down to the critical point and then approaches $1$ at lower temperatures. (By reporting the magnitude, or absolute value, of the magnetization, we treat all-up and all-down states as equivalent.)

I made the measurements by first setting the temperature and then letting the simulation run for at least 1,000 macrosteps in order to reach an equilibrium condition. How do I know that 1,000 macrosteps is enough to reach equilibrium? There is a fascinating body of work on this question, full of ingenious ideas, such as running two simulations that approach equilibrium from opposite directions and waiting until they agree. I took the duller approach of just waiting until the numbers stopped changing. I also started each run from an initial state on the same side of $T_C$ as the target temperature.Following this “burn-in” period, the simulation continued for another 100 macrosteps; during this phase I counted up and down spins after each macrostep. The entire procedure, including both burn-in and measurement periods, was repeated 100 times, after which I averaged all the measurements for each temperature.

When I first looked at these results and saw the close match between Metropolis and Glauber, I felt a twinge of paradoxical surprise. I call it paradoxical because I knew before I started what I would see, and that’s exactly what I did see, so obviously I should not have been surprised at all. But some part of my mind didn’t get that memo, and as I watched the two algorithms converge to the same values all across the temperature scale, it seemed remarkable.

The theory behind this convergence was apparently understood by the pioneers of MCMC in the 1950s. The theorem states that any MCMC algorithm will produce the same distribution of states at equilibrium, as long as the algorithm satisfies two conditions, called ergodicity and detailed balance.

The adjective ergodic was coined by Boltzmann, and is usually said to have the Greek roots εργον οδος, meaning something like “energy path.” Giovanni Gallavotti disputes this etymology, suggesting a derivation from εργον ειδoς, which he translates as “monode with a given energy.” Take your pick.Ergodicity requires that the system be able move from any one configuration to any other configuration in a finite number of steps. In other words, there are no cul de sac states you might wander into and never be able to escape, or border walls that divide the space into isolated regions. The Metropolis and Glauber algorithms satisfy this condition because every transition between states has a nonzero probability. (In both algorithms the acceptance probability comes arbitrarily close to zero but never touches it.) In the specific case of the $100 \times 100$ lattice I’ve been playing with, any two states are connected by a path of no more than $10{,}000$ steps.

Both algorithms also exhibit detailed balance, which is essentially a requirement of reversibility. Suppose that while watching a model run, you observe a transition from state $A$ to state $B$. Detailed balance says that if you continue observing long enough, you will see the inverse transition $B \rightarrow A$ with the same frequency as $A \rightarrow B$. Given the shapes of the acceptance curves, this assertion may seem implausible. If $A \rightarrow B$ is energetically favorable, then $B \rightarrow A$ must be unfavorable, and it will have a lower probability. But there’s another factor at work here. Remember we are assuming the system is in equilibrium, which implies that the occupancy of each state—or the amount of time the system spends in that state—is proportional to the state’s Boltzmann weight. Because the system is more often found in state $B$, the transition $B \rightarrow A$ has more chances to be chosen, counterbalancing the lower intrinsic probability.

The claim that Metropolis and Glauber yield identical results applies only when the Ising system is in equilibrium—poised at the eternal noon where the sun stands still and nothing ever changes. For Metropolis and his colleagues at Los Alamos in the early 1950s, understanding the equilibrium behavior of a computational model was challenge enough. They were coaxing answers from a computer with about four kilobytes of memory. Ten years later, however, Glauber wanted to look beyond equilibrium. For example, he wanted to know what happens when the temperature suddenly changes. How do the spins reorganize themselves during the transient period between one equilibrium state and another? He designed his version of the Ising model specifically to deal with such dynamic situations. He wrote in his 1963 paper:

If the mathematical problems of equilibrium statistical mechanics are great, they are at least relatively well-defined. The situation is quite otherwise in dealing with systems which undergo large-scale changes with time…. We have attempted, therefore, to devise a form of the Ising model whose behavior can be followed exactly, in statistical terms, as a function of time.

The data were gathered with Program 1, but using commands that have to be invoked from the console rather than the web interface. See the source code for details.So how does the Glauber model behave following an abrupt change in temperature? And how does it compare with the Metropolis model? Let’s try the experiment. We’ll simulate an Ising lattice at high temperature ($T = 10.0$), and let the program run long enough to be sure the system is in thermal equilibrium. Then we’ll instantaneously lower the temperature to $T = 2.0$, which is well below the critical point. During this flash-freeze process, we’ll monitor the magnetization of the lattice. The graph below records the magnetization after every tenth Monte Carlo macrostep. The curves are averages computed over 500 repetitions of the experiment.

Figure 8

Clearly, in this dynamic situation, the algorithms are not identical or interchangeable. The Metropolis program adapts more quickly to the cooler environment; Glauber produces a slower but steadier rise in magnetization. The curves differ in shape, with Metropolis exhibiting a distinctive “knee” where the slope flattens. I want to know what causes these differences, but before digging into that question it seems important to understand why both algorithms are so agonizingly slow. At the right edge of the graph the blue Metropolis curve is approaching the equilibrium value of magnetization (which is about 0.91), but it has taken 7,500 Monte Carlo macrosteps (or 75 million microsteps) to get there. The red Glauber curve will require many more. What’s the holdup?

To put this sluggishness in perspective, let’s look at the behavior of local spin correlations measured under the same circumstances. Graphing the average nearest-neighbor correlation following a sudden temperature drop produces these hockey-stick curves:

Figure 9

The response is dramatically faster; both algorithms reach quite high levels of local correlation within just a few macrosteps.

For a hint of why local correlations grow so much faster than global magnetization, it’s enough to spend a few minutes watching the Ising simulation evolve on the computer screen. When the temperature plunges from warm $T = 5$ to frigid $T = 2$, nearby spins have a strong incentive to line up in parallel, but magnetization does not spread uniformly across the entire lattice. Small clusters of aligned spins start expanding, and they merge with other clusters of the same polarity, thereby growing larger still. It doesn’t take long, however, before clusters of opposite polarity run into one another, blocking further growth for both. From then on, magetization is a zero-sum game: The up team can win only if the down team loses.

Figure 10

Figure 10 shows the first few Monte Carlo macrosteps following a flash freeze. The initial configuration at the upper left reflects the high-temperature state, with a nearly random, salt-and-pepper mix of up and down spins. The rest of the snapshots (reading left to right and top to bottom) show the emergence of large-scale order. Prominent clusters appear after the very first macrostep, and by the second or third step some of these blobs have grown to include hundreds of lattice sites. But the rate of change becomes sluggish thereafter. The balance of power may tilt one way and then the other, but it’s hard for either side to gain a permanent advantage. The mottled, camouflage texture will persist for hundreds or thousands of steps.

If you choose a single spin at random from such a mottled lattice, you’ll almost surely find that it lives in an area where most of the neighbors have the same orientation. Hence the high levels of local correlation. But that fact does not imply that the entire array is approaching unanimity. On the contrary, the lattice can be evenly divided between up and down domains, leaving a net magnetization near zero. (Yes, it’s like political polarization, where homogeneous states add up to a deadlocked nation.)

The images in Figure 11 show three views of the same state of an Ising lattice. At left is the conventional representation, with sinuous, interlaced territories of nearly pure up and down spins. The middle panel shows the same configuration recolored according to the local level of spin correlation. The vast majority of sites (lightest hue) are surrounded by four neighbors of the same orientation; they correspond to both the mauve and the indigo regions of the leftmost image. Only along the boundaries between domains is there any substantial conflict, where darker colors mark cells whose neighbors include spins of the opposite orientation. The panel at right highlights a special category of sites—those with exactly two parallel and two antiparallel neighbors. They are special because they are tiny neutral territories wedged between the contending factions. Flipping such a spin does not alter its correlation status; both before and after it has two like and two unlike neighbors. Flipping a neutral spin also does not alter the total energy of the system. But it can shift the magnetization. Indeed, flipping such “neutral” spins is the main agent of evolution in the Ising system at low temperature.Figure 11

The struggle to reach full magnetization in an Ising lattice looks like trench warfare. Contending armies, almost evenly matched, face off over the boundary lines between up and down territories. All the action is along the borders; nothing that happens behind the lines makes much difference. Even along the boundaries, some sections of the front are static. If a domain margin is a straight line parallel to the $x$ or $y$ axis, the sites on each side of the border have three friendly neighbors and only one enemy; they are unlikely to flip. The volatile neutral sites that make movement possible appear only at corners and along diagonals, where neighborhoods are evenly split.

There are Monte Carlo algorithms that flip only neutral spins. They have the pleasant property of conserving energy, which is not true of the Metropolis and Glauber algorithms.Neutral sites become rare as the light and dark regions coalesce into fewer but larger blobs. This scarcity of freely flippable spins leaves the Ising gears grinding without lubricant, and not making much progress. The situation is particularly acute in those cases where a broad stripe extends all the way across the lattice from left to right or from top to bottom. If the stripe’s boundaries are exactly horizontal or vertical, there will be no neutral sites at all. I’ll return to this situation below.

From these observations and ruminations I feel I’ve acquired some intuition about why my Monte Carlo simulations bog down during the transition from a chaotic to an ordered state. But why is the Glauber algorithm even slower than the Metropolis?

Since the schemes differ in two features—the visitation sequence and the acceptance function—it makes sense to investigate which of those features has the greater effect on the convergence rate. That calls for another computational experiment.

The tableau below is a mix-and-match version of the MCMC Ising simulation. In the control panel you can choose the visitation order and the acceptance function independently. If you select a deterministic visitation order and the M-rule acceptance function, you have the classical Metropolis algorithm. Likewise random order and the G-rule correspond to Glauber dynamics. But you can also pair deterministic order with the G-rule or random order with the M-rule. (The latter mixed-breed choice is what I unthinkingly implemented in my 2019 program.)

I have also included an acceptance rule labeled M*, which I’ll explain below.

Watching the screen while switching among these alternative components reveals that all the combinations yield different visual textures, at least at some temperatures. Also, it appears there’s something special about the pairing of deterministic visitation order with the M-rule acceptance function (i.e., the standard Metropolis algorithm).

Try setting the temperature to 2.5 or 3.0. I find that the distinctive sensation of fluttery motion—bird flocks migrating across the screen—appears only with the deterministic/M-rule combination. With all other pairings, I see amoeba-like blobs that grow and shrink, fuse and divide, but there’s not much coordinated motion.

Now lower the temperature to about 1.5, and alternately click Run and Reset until you get a persistent bold stripe that crosses the entire grid either horizontally or vertically. Diagonal stripes are also possible, but rare.(This may take several tries.) Again the deterministic/M-rule combination is different from all the others. With this mode, the stripe appears to wiggle across the screen like a millipede, either right to left or bottom to top. Changing either the visitation order or the acceptance function suppresses this peristaltic motion; the stripe may still have pulsating bulges and constrictions, but they’re not going anywhere.

These observations suggest some curious interactions between the visitation order and the acceptance function, but they do not reveal which factor gives the Metropolis algorithm its speed advantage. Using the same program, however, we can gather some statistical data that might help answer the question.

Figure 12

These curves were a surprise to me. From my earlier experiments I already knew that the Metropolis algorithm—the combination of elements in the blue curve—would outperform the Glauber version, corresponding to the red curve. But I expected the acceptance function to account for most of the difference. The data do not support that supposition. On the contrary, they suggest that both elements matter, and the visitation sequence may even be the more important one. A deterministic visitation order beats a random order no matter which acceptance function it is paired with.

My expectations were based mainly on discussions of the “mixing time” for various Monte Carlo algorithms. Mixing time is the number of steps needed for a simulation to reach equilibrium from an arbitrary initial state, or in other words the time needed for the system to lose all memory of how it began. If you care only about equilibrium properties, then an algorithm that offers the shortest mixing time is likely to be preferred, since it also minimizes the number of CPU cycles you have to waste before you can start taking data. Discussions of mixing time tend to focus on the acceptance function, not the visitation sequence. In particular, the M-rule acceptance function of the Metropolis algorithm was explicitly designed to minimize mixing time.

What I am measuring in my experiments is not exactly mixing time, but it’s closely related. Going from an arbitrary initial state to equilibrium at a specified temperature is much like a transition from one temperature to another. What’s going on inside the model is similar. Thus if the acceptance function determines the mixing time, I would expect it also to be the major factor in adapting to a new temperature regime.

On the other hand, I can offer a plausible-sounding theory of why visitation order might matter. The deterministic model scans through all $10{,}000$ lattice sites during every Monte Carlo macrostep; each such sweep is guaranteed to visit every site exactly once. The random order makes no such promise. In that algorithm, each microstep selects a site at random, whether or not it has been visited before. A macrostep concludes after $10{,}000$ such random choices. Under this protocol some sites are passed over without being selected even once, while others are chosen two or more times. How many sites are likely to be missed? During each microstep, every site has the same probability of being chosen, namely $1 / 10{,}000$. Thus the probability of not being selected on any given turn is $9{,}999 / 10{,}000$. For a site to remain unvisited throughout an entire macrostep, it must be passed over $10{,}000$ times in a row. The probability of that event is $(9{,}999 / 10{,}000)^{10{,}000}$, which works out to about $0.368$.For $N$ sites, the probability is $((N - 1) / N)^N$; as $N$ goes to infinity this expression converges to $1 / e \approx 0.367879$. Thus in each macrostep roughly $3{,}700$ of the $10{,}000$ spins are simply never called on. They have no chance of being flipped no matter what the acceptance function might say.

Excluding more than a third of the sites on every pass through the lattice seems certain to have some effect on the outcome of an experiment. In the long run the random selection process is fair, in the sense that every spin is sampled at the same frequency. But the rate of convergence to the equilibrium state may well be lower.

There are also compelling arguments for the importance of the acceptance function. A key fact mentioned by several authors is that the M acceptance rule leads to more spin flips per Monte Carlo step. If the energy change of a proposed flip is favorable or neutral, the M-rule always approves the flip, whereas the G-rule rejects some proposed flips even when they lower the energy. Indeed, for all values of $T$ and $\Delta E$ the M-rule gives a higher probability of acceptance than the G-rule does. This liberal policy—if in doubt, flip—allows the M-rule to explore the space of all possible spin configurations more rapidly.

The discrete nature of the Ising model, with just five possible values of $\Delta E$, introduces a further consideration. At $\Delta E = \pm 4$ and at $\Delta E = \pm 8$, the M-rule and the G-rule don’t actually differ very much when the temperature is below the critical point (see Figure 7). The two curves diverge only at $\Delta E = 0$: The M-rule invariably flips a spin in this circumstance, whereas the G-rule does so only half the time, assigning a probability of $0.5$. This difference is important because the lattice sites where $\Delta E = 0$ are the ones that account for almost all of the spin flips at low temperature. These are the neutral sites highlighted in the right panel ofFigure 11, the ones with two like and two unlike neighbors.

This line of thought leads to another hypothesis. Maybe the big difference between the Metropolis and the Glauber algorithms has to do with the handling of this single point on the acceptance curve. And there’s an obvious way to test the hypothesis: Simply change the M-rule at this one point, having it toss a coin whenever $\Delta E = 0$. The definition becomes:

\[p = \left\{\begin{array}{cl}
1 & \text { if } \quad \Delta E \lt 0 \\
\frac{1}{2} & \text { if } \quad \Delta E = 0 \\
e^{-\Delta E/T} & \text { if } \quad \Delta E>0
\end{array}\right.\]

This modified acceptance function is the M* rule offered as an option in Program 2. Watching it in action, I find that switching the Metropolis algorithm from M to M* robs it of its most distinctive traits: At high temperature the fluttering birds are banished, and at low temperature the wiggling worms are immobilized. The effects on convergence rates are also intriguing. In the Metropolis algorithm, replacing M with M* greatly diminishes convergence speed, from a standout level to just a little better than average. At the same time, in the Glauber algorithm replacing G with M* brings a considerable performance improvement; when combined with random visitation order, M* is superior not only to G but also to M.

Figure 13

I don’t know how to make sense of all these results except to suggest that both the visitation order and the acceptance function have important roles, and non-additive interactions between them may also be at work. Here’s one further puzzle. In all the experiments described above, the Glauber algorithm and its variations respond to a disturbance more slowly than Metropolis. But before dismissing Glauber as the perennial laggard, take a look at Figure 14.

Figure 14

Here we’re observing a transition from low to high temperature, the opposite of the situation discussed above. When going in this direction—from an orderly phase to a chaotic one, melting rather than freezing—both algorithms are quite zippy, but Glauber is a little faster than Metropolis. Randomness, it appears, is good for randomization. That sounds sensible enough, but I can’t explain in any detail how it comes about.

Up to this point, a deterministic visitation order has always meant the typewriter scan of the lattice—left to right and top to bottom. Of course this is not the only deterministic route through the grid. In Program 3 you can play with a few of the others.

Why should visitation order matter at all? As long as you touch every site exactly once, you might imagine that all sequences would produce the same result at the end of a macrostep. But it’s not so, and it’s not hard to see why. Whenever two sites are neighbors, the outcome of applying the Monte Carlo process can depend on which neighbor you visit first.

Consider the cruciform configuration at right. At first glance, you might assume that the dark central square will be unlikely to change its state. After all, the central square has four like-colored neighbors; if it were to flip, it would have four opposite-colored neighbors, and the energy associated with those spin-spin interactions would rise from $-4$ to $+4$. Any visitation sequence that went first to the central square would almost surely leave it unflipped. However, when the Metropolis algorithm comes tap-tap-tapping along in typewriter mode, the central cell does in fact change color, and so do all four of its neighbors. The entire structure is annihilated in a single sweep of the algorithm. (The erased pattern does leave behind a ghost—one of the diagonal neighbor sites flips from light to dark. But then that solitary witness disappears on the next sweep.)

To understand what’s going on here, just follow along as the algorithm marches from left to right and top to bottom through the lattice. When it reaches the central square of the cross, it has already visited (and flipped) the neighbors to the north and to the west. Hence the central square has two neighbors of each color, so that $\Delta E = 0$. According to the M-rule, that square must be flipped from dark to light. The remaining two dark squares are now isolated, with only light neighbors, so they too flip when their time comes.

The underlying issue here is one of chronology—of past, present, and future. Each site has its moment in the present, when it surveys its surroundings and decides (based on the results of the survey) whether or not to change its state. But in that present moment, half of the site’s neighbors are living in the past—the typewriter algorithm has already visited them—and the other half are in the future, still waiting their turn.

A well-known alternative to the typewriter sequence might seem at first to avoid this temporal split decision. Superimposing a checkerboard pattern on the lattice creates two sublattices that do not communicate for purposes of the Ising model. Each black square has only white neighbors, and vice versa. Thus you can run through all the black sites (in any order; it really doesn’t matter), flipping spins as needed. Afterwards you turn to the white sites. These two half-scans make up one macrostep. Throughout the process, every site sees all of its neighbors in the same generation. And yet time has not been abolished. The black cells, in the first half of the sweep, see four neighboring sites that have not yet been visited. The white cells see neighbors that have already had their chance to flip. Again half the neighbors are in the past and half in the future, but they are distributed differently.

There are plenty of other deterministic sequences. You can trace successive diagonals; in Program 3 they run from southwest to northeast. There’s the ever-popular boustrophedonic order, following in the footsteps of the ox in the plowed field. More generally, if we number the sites consecutively from $1$ to $10{,}000$, any permutation of this sequence represents a valid visitation order, touching each site exactly once. There are $10{,}000!$ such permutations, a number that dwarfs even the $2^{10{,}000}$ configurations of the binary-valued lattice. The permuted choice in Program 3 selects one of those permutations at random; it is then used repeatedly for every macrostep until the program is reset. The re-permuted option is similar but selects a new permutation for each macrostep. The random selection is here for comparison with all the deterministic variations.

(There’s one final button labeled simultaneous, which I’ll explain below. If you just can’t wait, go ahead and press it, but I won’t be held responsible for what happens.)

The variations add some further novelties to the collection of curious visual effects seen in earlier simulations. The fluttering wings are back, in the diagonal as well as the typewriter sequences. Checkerboard has a different rhythm; I am reminded of a crowd of frantic commuters in the concourse of Grand Central Terminal. Boustrophedon is bidirectional: The millipede’s legs carry it both up and down or both left and right at the same time. Permuted is similar to checkerboard, but re-permuted is quite different.

The next question is whether these variant algorithms have any quantitative effect on the model’s dynamics. Figure 15 shows the response to a sudden freeze for seven visitation sequences. Five of them follow roughly the same arcing trajectory. Typewriter remains at the top of the heap, but checkerboard, diagonal, boustrophedon, and permuted are all close by, forming something like a comet tail. The random algorithm is much slower, which is to be expected given the results of earlier experiments.

Figure 15

The intriguing case is the re-permuted order, which seems to lie in the no man’s land between the random and the deterministic algorithms. Perhaps it belongs there. In earlier comparisons of the Metropolis and Glauber algorithms, I speculated that random visitation is slower to converge because many sites are passed over in each macrostep, while others are visited more than once. That’s not true of the re-permuted visitation sequence, which calls on every site exactly once, though in random order. The only difference between the permuted algorithm and the re-permuted one is that the former reuses the same permutation over and over, whereas re-permuted creates a new sequence for every macrostep. The faster convergence of the static permuted algorithm suggests there is some advantage to revisiting all the same sites in the same order, no matter what that order may be. Most likely this has something to do with sites that get switched back and forth repeatedly, on every sweep.

Now for the mysterious simultaneous visitation sequence. If you have not played with it yet in Program 3, I suggest running the following experiment. Select the typewriter sequence, press the Run button, reduce the temperature to 1.10 or 1.15, and wait until the lattice is all mauve or all indigo, with just a peppering of opposite-color dots. (If you get a persistent wide stripe instead of a clear field, raise the temperature and try again.) Now select the simultaneous visitation order. I have deliberately slowed this version of the model by a factor of 10, to make the nature of the action clearer.Most likely nothing much will happen for a little while, then you’ll notice tiny patches of checkerboard pattern, with all the individual cells in these patches blinking from light to dark and back again on every other cycle. Then notice that the checkerboard patches are growing. When they touch, they merge, either seamlessly if they have the same polarity or with a conspicuous suture where opposite polarities meet. Eventually the checkerboards will cover the whole screen. Furthermore, once the pattern is established, it will persist even if you raise the temperature all the way to the top, where any other algorithm would produce a roiling random stew.

This behavior is truly weird but not inexplicable. The algorithm behind it is one that I have always thought should be the best approach to a Monte Carlo Ising simulation. In fact it seems to be just about the worst.

All of the other visitation sequences are—as the term suggests they should be—sequential. They visit one site at a time, and differ only in how they decide where to go next. If you think about the Ising model as if it were a real physical process, this kind of serialization seems pretty implausible. I can’t bring myself to believe that atoms in a ferromagnet politely take turns in flipping their spins. And surely there’s no central planner of the sequence, no orchestra conductor on a podium, pointing a baton at each site when its turn comes.

Natural systems have an all-at-onceness to them. They are made up of many independent agents that are all carrying out the same kinds of activities at the same time. If we could somehow build an Ising model out of real atoms, then each cell or site would be watching the state of its four neighbors all the time, and also sensing thermal agitation in the lattice; it would decide to flip whenever circumstances favored that choice, although there might be some randomness to the timing. If we imagine a computer model of this process (yes, a model of a model), the most natural implementation would require a highly parallel machine with one processor per site.

Lacking such fancy hardware, I make due with fake parallelism. The simultaneous algorithm makes two passes through the lattice on every macrostep. On the first pass, it looks at the neighborhood of each site and decides whether or not to flip the spin, but it doesn’t actually make any changes to the lattice. Instead, it uses an auxiliary array to keep track of which spins are scheduled to flip. Then, after all sites have been surveyed in the first pass, the second pass goes through the lattice again, flipping all the spins that were designated in the first pass. The great advantage of this scheme is that it avoids the temporal oddities of working within a lattice where some spins have already been updated and others have not. In the simultaneous algorithm, all the spins make the transition from one generation to the next at the same instant.

When I first wrote a program to implement this scheme, almost 40 years ago, I didn’t really know what I was doing, and I was utterly baffled by the outcome. The information mechanics group at MIT (Ed Fredkin, Tommaso Toffoli, Norman Margolus, and Gérard Vichniac) soon came to my rescue and explained what was going on, but all these years later I still haven’t quite made my peace with it.

Although the pattern looks like what you might see in an antiferromagnet—a material in which spins prefer antiparallel alignment—the resemblance is deceptive. For a true antiferromagnet the checkerboard arrangement is stable; here it is maximally unstable.Once you’ve observed the “blinking checkerboard catastrophe,” it’s not hard to understand the mechanism. For a ferromagnetic Ising model, a checkerboard pattern of alternating up and down spins has the highest possible energy and should therefore be the least likely configuration of the lattice. Every single site is surrounded by four opposite-color neighbors and therefore has a strong incentive to flip. That’s just the problem. With the simultaneous update rule, every spin does flip, with the result that the new configuration is a mirror image of the previous one, with every up spin become a down and vice versa. When the next round begins, every site wants to flip again.

What continues to disturb me about this phenomenon is that I still think the simultaneous update rule is in some sense more natural or realistic than many of the alternatives. It is closer to how the world works—or how I imagine that it works—than any serial ordering of updates. Yet nature does not create magnets that continually swing between states that have the highest possible energy. (A 2002 paper by Gabriel Pérez, Francisco Sastre, and Rubén Medina attempts to rehabilitate the simultaneous-update scheme, but the blinking catastrophe remains pretty catastrophic.)

This is not the only bizarre behavior to be found in the dark corners of Monte Carlo Ising models. In the Metropolis algorithm, Figure 16simply setting the temperature to a very high value (say, $T = 1{,}000$) has a similar effect. Again every spin flips on every cycle, producing a display that throbs violently but otherwise remains unchanged. The explanation is laughably simple. At high temperature the Metropolis acceptance function flattens out and yields a spin-flip probability near $1$ for all sites, no matter what their neighborhood looks like. The Glauber acceptance curve also flattens out, but at a value of 0.5, which leads to a totally randomized lattice—a more plausible high-temperature outcome.

I have not seen this high-temperature anomaly mentioned in published works on the Metropolis algorithm, although it must have been noticed many times over the years. Perhaps it’s not mentioned because this kind of failure will never be seen in physical systems. $T = 1{,}000$ in the Ising model is $370$ times the critical temperature; the corresponding temperature in iron would be over $300{,}000$ kelvins. Iron boils at $3{,}000$ kelvins.

The curves in Figure 15 and most of the other graphs above are averages taken over hundreds of repetitions of the Monte Carlo process. The averaging operation is meant to act like sandpaper, smoothing out noise in the curves, but it can also grind away interesting features, replacing a population of diverse individuals with a single homogenized exemplar. Figure 17 shows six examples of the lumpy and jumpy trajectories recorded during single runs of the program:

Figure 17

In these squiggles, magnetization does not grow smoothly or steadily with time. Instead we see spurts of growth followed by plateaus and even episodes of retreat. One of the Metropolis runs is slower than the three Glauber examples, and indeed makes no progress toward a magnetized state. Looking at these plots, it’s tempting to explain them away by saying that the magnetization measurements exhibit high variance. That’s certain true, but it’s not the whole story.

Figure 18 shows the distribution of times needed for a Metropolis Ising model to reach a magnetization of $0.85$ in response to a sudden shift from $T = 10$ to $T= 2$. The histogram records data from $10{,}000$ program runs, expressing convergence time in Monte Carlo macrosteps.

Figure 18

The median of this distribution is $451$ macrosteps; in other words, half of the runs concluded in $451$ steps or fewer. But the other half of the population is spread out over quite a wide range. Runs of $10$ times the median length are not great rareties, and the blip at the far right end of the $x$ axis represents the $59$ runs that had still not reached the threshold after $10{,}000$ macrosteps (where I stopped counting). This is a heavy-tailed distribution, which appears to be made up of two subpopulations. In one group, forming the sharp peak at the left, magnetization is quick and easy, but members of the other group are recalcitrant, holding out for thousands of steps. I have a hypothesis about what distinguishes those two sets. The short-lived ones are ponds; the stubborn ones that overstay their welcome are rivers.

When an Ising system cools and becomes fully magnetized, it goes from a salt-and-pepper array of tiny clusters to a monochromatic expanse of one color or the other. At some point during this process, there must be a moment when the lattice is divided into exactly two regions, one light and one dark. Figure 19 Figure 19 shows one possible configuration: A pond of indigo cells lies within a mauve landmass that covers the rest of the lattice. If the system is to make further progress toward full magnetization, either the pond must dry up (leaving a blank expanse of mauve), or the pond must overflow its banks, inundating the remaining land area (leaving a sea of indigo). Experiments reveal that the former outcome is overwhelmingly more likely. Why? One guess is that it’s just a matter of majority rule: Whichever patch controls the greater amount of territory will eventually prevail. But that’s not it. Even when the pond covers most of the lattice, leaving only a thin strip of shoreline at the edges, the surrounding land eventually squeezes the pond out of existence.

I believe the correct answer has to do with the concepts of inside and outside, connected and disconnected, open sets and closed sets—but I can’t articulate these ideas in a way that would pass mathematical muster. I want to say that the indigo pond is a bounded region, entirely enclosed by the unbounded mauve continent. But the wraparound lattice make it hard to wrap your head around this notion. The two images in Figure 20 show exactly the same object as Figure 19, the only difference being that the origin of the coordinate system has moved, so that the center of the disk seems to lie on an edge or in a corner of the lattice. The indigo pond is still surrounded by the mauve continent, but it sure doesn’t look that way. In any case, why should boundedness determine which area survives the Monte Carlo process?

Figure 20

For me, the distinction between inside and outside began to make sense when I tried taking a more “local” view of the boundaries between regions, and the curvature of those boundaries. As noted in connection with Figure 11, boundaries are places where you can expect to find neutral lattice sites (i.e., $\Delta E = 0$), which are the only sites where a spin is likely to change orientation at low temperature.Figure 21 In honor of their neutrality I’m going to call these sites Swiss cells. In Figure 21 I have marked all the Swiss cells with colored dots (making them dotted Swiss!). Orange-dotted cells lie in the indigo interior of the pond, whereas green dots lie outside on the mauve landmass.

I’ll spare you the trouble of counting the dots in Figure 21: There are 34 orange ones inside the pond but only 30 green ones outside. That margin could be significant. Because the dotted cells are likely to change state, the greater abundance of orange dots means there are more indigo cells ready to turn mauve than mauve cells that might become indigo. If the bias continues as the system evolves, the indigo region will steadily lose area and eventually be swallowed up.

But is there any reason to think the interior of the pond will always have a surplus of neutral sites susceptible to flipping? Figure 22 The simplest geometry for a pond is a square or rectangle, as in Figure 22. It has four interior Swiss cells (in the corners) and no exterior Swiss cells—which would be marked with green dots if they existed. In other words, no mauve cells along the outside boundary of a square or rectangular pond have exactly two mauve and two indigo neighbors.

What if the shape becomes a little more complicated? Perhaps the square pond grows a protuberance on one side, and an invagination on another, as in Figure 23. Figure 23 Each of these modifications generates a pair of orange-dotted neutral sites inside the patch, along with a pair of green-dotted ones outside. Thus the count of inside minus outside remains unchanged at four. On first noticing this invariance I had a delicious Aha! moment. There’s a conservation law, I thought. No matter how you alter the outline of the pond, the neutral sites inside will outnumber those outside by four.

This notion is not utterly crazy. If you walk clockwise around the boundary of the simple square pond in Figure 22, you will have made four right turns by the time you get back to your starting point. Each of those right turns creates a neutral cell in the interior of the pond—we’ll call them innie turns—where you can place an orange dot. A clockwise circuit of the modified pond in Figure 23, with its excrenscences and increscences, requires some left turns as well as right turns. Figure 24Each left turn produces a neutral cell in the mauve exterior region—it’s an outie turn—earning a green dot. But for every outie turn added to the perimeter, you’ll have to make an additional innie turn in order to get back to your starting point. Thus, except for the four original corners, innie and outie turns are like particles and antiparticles, always created and annihilated in pairs. A closed path, no matter how convoluted, always has exactly four more innies than outies. The four-turn differential is again on exhibit in the more elaborate example of Figure 24, where the orange dots prevail 17 to 13. Indeed, I assert that there are always four more innie turns than outie turns on the perimeter of any simple (i.e., non-self-intersecting) closed curve on the square lattice. (I think this is one of those statements that is obviously true but not so simple to prove, like the claim that every simple closed curve on the plane has an inside and an outside.)

Unfortunately, even if the statement about counting right and left turns is true, the corresponding statement about orange and green dots is not. Figure 25It holds only for a rather special subclass of lattice shapes, namely those in which the perimeter line not only has no self-intersections but never comes within one lattice spacing of itself. In effect, we exclude all closed figures that have hair on their surface or pores in their skin. Figure 25 shows an example of a shape that violates the rule. Narrow, single-lane passages create neutral sites that do not come in matched innie/outie pairs. In this case there are more green dots than orange ones, which might be taken to suggest that the indigo area will grow rather than shrink.

In my effort to explain why ponds always evaporate, I seem to have reached a dead end. I should have known from the outset that the effort was doomed. I can’t prove that ponds always shrink because they don’t. The system is ergodic: Any state can be reached from any other state in a finite number of steps. In particular, a single indigo cell (a very small pond) can grow to cover the whole lattice. The sequence of steps needed to make that happen is improbable, but it certainly exists.

If proof is out of reach, maybe we can at least persuade ourselves that the pond shrinks with high probability. And we have a tool for doing just that: It’s called the Monte Carlo method. Figure 26 follows the fate of a $25 \times 25$ square pond embedded in an otherwise blank lattice of $100 \times 100$ sites, evolving under Glauber dynamics at a very low temperature $(T = 0.1)$. The uppermost curve, in indigo, shows the steady evaporation of the pond, dropping from the original $625$ sites to $0$ after about $320$ macrosteps. The middle traces record the abundance of Swiss sites, orange for those inside the pond and green for those outside. Because of the low temperature, these are the only sites that have any appreciable likelihood of flipping. The black trace at the bottom is the difference between orange and green. For the most part it hovers at $+4$, never exceeding that value but seldom falling much below it, and making only one brief foray into negative territory. Statistically speaking, the system appears to vindicate the innie/outie hypothesis. The pond shrinks because there are almost always more flippable spins inside than outside.

Figure 26

Figure 26 is based on a single run of the Monte Carlo algorithm. Figure 27 presents essentially the same data averaged over $1{,}000$ Monte Carlo runs under the same conditions—starting with a $25 \times 25$ square pond and applying Glauber dynamics at $T = 0.1$.

Figure 27

The pond’s loss of area follows a remarkably linear path, with a steady rate very close to two lattice sites per Monte Carlo macrostep. And it’s clear that virtually all of these pondlike blocks of indigo cells disappear within a little more than $300$ macrosteps, putting them in the tall peak at the short end of the lifetime distribution in Figure 18. None of them contribute to the long tail that extends out past $10{,}000$ steps.

So much for the quick-drying ponds. What about the long-flowing rivers?

Figure 28We can convert a pond into a river. Take a square block of dark pixels and tug on one side, stretching the square into an elongated rectangle. When the moving side of the rectangle reaches the edge of the lattice, just keep on pulling. Because of the model’s wraparound boundary conditions, the lattice actually has no edge; when an object exits to the right, it immediately re-enters on the left. Thus you can keep pulling the (former) right side of the rectangle until it meets up with the (former) left side.

When the two sides join, everything changes. It’s not just a matter of adjusting the shape and size of the rectangle. There is no more rectangle! By definition, a rectangle has four sides and four right-angle corners. The object now occupying the lattice has only two sides and no corners. It may appear to have corners at the far left and right, but that’s an artifact of drawing the figure on a flat plane. It really lives on a torus, and the band of indigo cells is like a ring of chocolate icing that goes all the way around the doughnut. Or it’s a river—an endless river. You can walk along either bank as far as you wish, and you’ll never find a place to cross.

The topological difference between a pond and a river has dire consequences for Monte Carlo simulations of the Ising model. When the rectangle’s four corners disappeared, so did the four orange dots marking interior Swiss cells. Indeed, the river has no neutral cells at all, neither inside nor outside. At temperatures near zero, where neutral cells are the only ones that ever change state, the river becomes an all-but-eternal feature. The Monte Carlo process has no effect on it. The system is stuck in a metastable state, with no practical way to reach the lower-energy state of full magnetization.

When I first noticed how a river can block magnetization, I went looking to see what others might have written about the phenomonon. I found nothing. There was lots of talk about metastability in general, but none of the sources I consulted mentioned this particular topological impediment. I began to worry that I had made some blunder in programming or had misinterpreted what I was seeing. Finally I stumbled on a 2002 paper by Spirin, Krapivsky, and Redner that reports essentially the same observations and extends the discussion to three dimensions, where the problem is even worse.

A river with perfectly straight banks looks rather unnatural—more like a canal.Figure 29 Perhaps adding some meanders would make a difference in the outcome? The upper part of Figure 29 shows a river with a sinusoidal bend. The curves create 46 interior neutral cells and an exactly equal number of exterior ones. These corner points serve as handholds where the Monte Carlo process can get a grip, so one might hope that by flipping some of these spins the channel will narrow and eventually close.

But that’s not what happens. The lower part of Figure 29 shows the same stretch of river after $1{,}000$ Monte Carlo macrosteps at $T = 0.1$. The algorithm has not amplified the sinuosity; on the contrary, it has shaved off the peaks and filled in the troughs, generally flattening the terrain. After $5{,}000$ steps the river has returned to a perfectly straight course. No neutral cells remain, so no further change can be expected in any human time frame.

The presence or absence of four corners makes all the difference between ponds and rivers. Ponds shrink because the corners create a consistent bias: Sites subject to flipping are more numerous inside than outside, which means, over the long run, that the outside steadily encroaches on the inside’s territory. That bias does not exist for rivers, where the number of interior and exterior neutral sites is equal on average. Figure 30 records the inside-minus-outside difference for the first $1{,}000$ steps of a simulation beginning with a sinusoidal river.

Figure 30

The difference hovers near zero, though with short-lived excursions both above and below; the mean value is $+0.062$.

Even at somewhat higher temperatures, any pattern that crosses from one side of the grid to the other will stubbornly resist extinction. Figure 31 shows snapshots every $1{,}000$ macrosteps in the evolution of a lattice at $T = 1.0$, which is well below the critical temperature but high enough to allow a few energetically unfavorable spin flips. The starting configuration was a sinusoidal river, but by $1{,}000$ steps it has already become a thick, lumpy band. In subsequent snapshots the ribbon grows thicker and thinner, migrates up and down—and then abruptly disappears, sometime between the $8{,}000$th and the $9{,}000$th macrostep.

Figure 31

Swiss cells, with equal numbers of friends and enemies among their neighbors, appear wherever a boundary line takes a turn. All the rest of the sites along a boundary—on both sides—have three friendly neighbors and one enemy neighbor. At a site of this kind, flipping a spin carries an energy penalty of $\Delta E = +4$. At $T = 1.0$ the probability of such an event is roughly $1/50$. In a $10{,}000$-site lattice crossed by a river there can be as many as $200$ of these three-against-one sites, so we can expect to see a few of them flip during every macrostep. Thus at $T = 1.0$ the river is not a completely static formation, as it is at temperatures closer to zero. The channel can shift or twist, grow wider or narrower. But these motions are glacially slow, not only because they depend on somewhat rare events but also because the probabilities are unbiased. At every step the river is equally likely to grow wider or narrower; on average, it goes nowhere.

In one last gesture to support my claim that short-lived patterns are ponds and long-lived patterns are rivers I offer Figure 32:

Figure 32

Details: The simulations were run with Glauber dynamics on a $50 \times 50$ lattice.In the red upper histogram a thousand Ising systems are launched from a square pond state and allowed to run until the initial pattern is annihilated and the lattice is approaching full magnetization. In the blue lower histogram another thousand systems begin with a sinuous river pattern and also continue until the pattern collapses and the lattice is nearly uniform. The distribution of lifetimes for the two cases is dramatically different. No pond lasts as long as $1{,}000$ macrosteps, and the median lifetime is $454$ steps. The median for rivers is more than $16{,}000$ steps, and some instances go on for more than $100{,}000$ steps, the limit where I stopped counting.

A troubling question is whether these uncrossable rivers that block full magnetization in Ising models also exist in physical ferromagnets. It seems unlikely. The rivers I describe above owe their existence to the models’ wraparound boundary conditions. The crystal lattices of real magnetic materials do not share that topology. Thus it seems that metastability may be an artifact or a mere incidental feature of the model, not something present in nature.

Statistical mechanics is generally formulated in terms of systems without boundaries. You construct a theory of $N$ particles, but it’s truly valid only in the “thermodynamic limit,” where $N \to \infty$. Under this regime the two-dimensional Ising model would be studied on a lattice extending over an infinite plane. Computer models can’t do that, and so we wind up with tricks like wraparound boundary conditions, which can be considered a hack for faking infinity.

It’s a pretty good hack. As in an infinite lattice, every site has the same local environment, with exactly four neighbors, who also have four neighbors, and so on. There are no edges or corners that require special treatment. For these reasons wraparound or periodic boundary conditions have always been the most popular choice for computational models in the sciences, going back at least as far as 1953. Still, there are glitches. If you were standing on a wraparound lattice, you could walk due north forever, but you’d keep passing your starting point again and again. If you looked into the distance, you’d see the back of your own head. For the Ising model, perhaps the most salient fact is this: On a genuinely infinite plane, every simple, finite, closed curve is a pond; no finite structure can behave like a river, transecting the entire surface so that you can’t get around it. Thus the wraparound model differs from the infinite one in ways that may well alter important conclusions.

These defects are a little worrisome. On the other hand, physical ferromagnets are also mere finite approximations to the unbounded spaces of thermodynamic theory. A single magnetic domain might have $10^{20}$ atoms, which is large compared with the $10^4$ sites in the models presented here, but it’s a long ways short of infinity. The domains have boundaries, which can have a major influence on their properties. All in all, it seems like a good idea to explore the space of possibile boundary conditions, including some alternatives to the wraparound convention. Hence Program 4:

An extra row of cells around the perimeter of the lattice serves to make the boundary conditions visible in this simulation. The cells in this halo layer are not active participants in the Ising process; they serve as neighbors to the cells on the periphery of the lattice, but their own states are not updated by the Monte Carlo algorithm. To mark their special role, their up and down states are indicated by red and pink instead of indigo and mauve.

The behavior of wraparound boundaries is already familiar. If you examine the red/pink stripe along the right edge of the grid, you will see that it matches the leftmost indigo/mauve column. Similar relations determine the patterns along the other edges.

The two simplest alternatives to the wraparound scheme are static borders made up of cells that are always up or always down. You can probably guess how they will affect the outcome of the simulation. Try setting the temperature around 1.5 or 2.0, then click back and forth between all up and all down as the program runs. The border color quickly invades the interior space, encircling a pond of the opposite color and eventually squeezing it down to nothing. Switching to the opposite border color brings an immediate re-enactment of the same scene with all colors reversed. The biases are blatant.

Another idea is to assign the border cells random values, chosen independently and with equal probability. A new assignment is made after every macrostep. Randomness is akin to high temperature, so this choice of boundary condition amounts to an Ising lattice surrounded by a ring of fire. There is no bias in favor of up or down, but the stimulation from the sizzling periphery creates recurrent disturbances even at temperatures near zero, so the system never attains a stable state of full magnetization.

Before I launched this project, my leading candidate for a better boundary condition was a zero border. This choice is equivalent to an “open” or “free” boundary, or to no boundary at all—a universe that just ends in blankness. Implementing open boundaries is slightly irksome because cells on the verge of nothingness require special handling: Those along the edges have only three neighbors, and those in the corners only two. A zero boundary produces the same effect as a free boundary without altering the neighbor-counting rules. The cells of the outer ring all have a numerical value of $0$, indicated by gray. For the interior cells with numerical values of $+1$ and $-1$, the zero cells act as neighbors without actually contributing to the $\Delta E$ calculations that determine whether or not a spin flips.

The zero boundary introduces no bias favoring up or down, it doesn’t heat or cool the system, and it doesn’t tamper with the topology, which remains a simple square embedded in a flat plane. Sounds ideal, no? However, it turns out the zero boundary has a lot in common with wraparound borders. In particular, it allows persistent rivers to form—or maybe I should call them lakes. I didn’t see this coming before I tried the experiment, but it’s not hard to understand what’s happening. On the wraparound lattice, elongating a rectangle until two opposite edges meet eliminates the Swiss cells at the four corners. The same thing happens when a rectangle extends all the way across a lattice with zero borders. The corner cells, now up against the border, no longer have two friendly and two enemy neighbors; instead they have two friends, one enemy, and one cel of spin zero, for a net $\Delta E$ of $+1$.

A pleasant surprise of these experiments was the boundary type I have labeled sampled. The idea is to make the boundary match the statistics of the interior of the lattice, but without regard to the geometry of any patterns there. For each border cell $b$ we select an interior cell $s$ at random, and assign the color of $s$ to $b$. The procedure is repeated after each macrostep. The border therefore maintains the same up/down proportion as the interior lattice, and always favors the majority. If the spins are evenly split between mauve and indigo, the border region shows no bias; as soon as the balance begins to tip, however, the border shifts in the same direction, supporting and hastening the trend.

If you tend to root for the underdog, this rule is not for you—but we can turn it upside down, assigning a color opposite that of a randomly chosen interior cell. The result is interesting. Magnetization is held near $0$, but at low temperature the local correlation coefficient approaches $1$. The lattice devolves into two large blobs of no particular shape that circle the ring like wary wrestlers, then eventually reach a stable truce in which they split the territory either vertically or horizontally. This behavior has no obvious bearing on ferromagnetism, but maybe there’s an apt analogy somewhere in the social or political sciences.

The curves in Figure 33 record the response to a sudden temperature step in systems using each of six boundary conditions. The all-up and all-down boundaries converge the fastest—which is no surprise, since they put a thumb on the scale. The response of the sampled boundary is also quick, reflecting its weathervane policy of supporting the majority. The random and zero boundaries are the slowest; they follow identical trajectories, and I don’t know why. Wraparound is right in the middle of the pack. All of these results are for Glauber dynamics, but the curves for the Metropolis algorithm are very similar.

Figure 33

The menu in Program 4 has one more choice, labeled twisted. I wrote the code for this one in response to the question, “I wonder what would happen if…?” Twisted is the same as wraparound, except that one side is given a half-twist before it is mated with the opposite edge. Thus if you stand on the right edge near the top of the lattice and walk off to the right, you will re-enter on the left near the bottom. The object formed in this way is not a torus but a Klein bottle—a “nonorientable surface without boundary.” All I’m going to say about running an Ising model on this surface is that the results are not nearly as weird as I expected. See for yourself.

I have one more toy to present for your amusement: the MCMC microscope. It was the last program I wrote, but it should have been the first.

All of the programs above produce movies with one frame per macrostep. In that high-speed, high-altitude view it can be hard to see how individual lattice sites are treated by the algorithm. The MCMC microscope provides a slo-mo close-up, showing the evolution of a Monte Carlo Ising system one microstep at a time. The algorithm proceeds from site to site (in an order determined by the visitation sequence) and either flips the spin or not (according to the acceptance function).

As the algorithm proceeds, the site currently under examination is marked by a hot-pink outline. Sites that have yet to be visited are rendered in the usual indigo or mauve; those that have already had their turn are shown in shades of gray. The Microstep button advances the algorithm to the next site (determined by the visitation sequence) and either flips the spin or leaves it as-is (according to the acceptance function). The Macrostep button performs a full sweep of the lattice and then pauses; the Run button invokes a continuing series of microsteps at a somewhat faster pace. Some adjustments to this protocol are needed for the simultaneous update option. In this mode no spins are changed during the scan of the lattice, but those that will change are marked with a small square of constrasting gray. At the end of the macrostep, all the changes are made at once.

The Dotted Swiss checkbox paints orange and green dots on neutral cells (those with equal numbers of friendly and enemy neighbors). Doodle mode allows you to draw on the lattice via mouse clicks and thereby set up a specific initial pattern.

I’ve found it illuminating to draw simple geometric figures in doodle mode, then watch as they are transformed and ultimately destroyed by the various algorithms. These experiments are particularly interesting with the Metropolis algorithm at very low temperature. Under these conditions the Monte Carlo process—despite its roots in randomness—becomes very nearly deterministic. Cells with $\Delta E \le 0$ always flip; other cells never do. (What, never? Well, hardly ever.) Thus we can speak of what happens when a program is run, rather than just describing the probability distribution of possible outcomes.

Here’s a recipe to try: Set the temperature to its lower limit, choose doodle mode, down initialization, typewriter visitation, and the M-rule acceptance function. Now draw some straight lines on the grid in four orientations: vertical, horizontal, and along both diagonals. Each line can be six or seven cells long, but don’t let them touch. Lines in three of the four orientations are immediately erased when the program runs; they disappear after the first macrostep. The one survivor is the diagonal line oriented from lower left to upper right, or southwest to northeast. With each macrostep the line migrates one cell to the left, and also loses one site at the bottom. This combination of changes gives the subjective impression that the pattern is moving not only left but also upward. I’m pretty sure that this phenomenon is responsible for the fluttering wings illusion seen at much higher temperatures (and higher animation speeds).

If you perform the same experiment with the diagonal visitation order, you’ll see exactly the same outcomes. A question I can’t answer is whether there is any pattern that serves to discriminate between the typewriter and diagonal orders. What I’m seeking is some arrangement of indigo cells on a mauve background that I could draw on the grid and then look away while you ran one algorithm or the other for some fixed number of macrosteps (which I get to specify). Afterwards, I win if I can tell which visitation sequence you chose.

The checkerboard algorithm is also worth trying with the four line orientations. The eventual outcome is the same, but the intermediate stages are quite different.

Finally I offer a few historical questions that seem hard to settle, and some philosophical musings on what it all means.

How did the method get the name “Monte Carlo”?

The name, of course, is an allusion to the famous casino, a prodigious producer and consumer of randomness. Nicholas Metropolis claimed credit for coming up with the term. In a 1987 retrospective he wrote:

It was at that time [spring of 1947] that I suggested an obvious name for the statistical method—a suggestion not unrelated to the fact that Stan [Ulam] had an uncle who would borrow money from relatives because he “just had to go to Monte Carlo.”

An oddity of this story is that Metropolis was not at Los Alamos in 1947. He left after the war and didn’t return until 1948.

Ulam’s account of the matter does not contradict the Metropolis version, Marshall Rosenbluth does contradict Metropolis, writing: “The basic idea, as well as the name was due to Stan Ulam originally.” But Rosenbluth wasn’t at Los Alamos in 1947 either.but it’s less colorful. In his autobiography, written in the 1970s, he mentions an uncle who is buried in Monte Carlo, but he says nothing about the uncle gambling or borrowing from relatives. (In fact it seems the uncle was the wealthiest member of the family.) Ulam’s only comment on the name reads as follows:

It seems to me that the name Monte Carlo contributed very much to the popularization of this procedure. It was named Monte Carlo because of the element of chance, the production of random numbers with which to play the suitable games.

Note the anonymous passive voice: “It was named…,” with no hint of by whom. If Ulam was so carefully noncommittal, who am I to insist on a definite answer?

As far as I know, the phrase “Monte Carlo method” first appeared in public print in 1949, in an article co-authored by Metropolis and Ulam. Presumably the term was in use earlier among the denizens of the Los Alamos laboratory. Daniel McCracken, in a 1955 Scientific American article, said it was a code word invented for security reasons. This is not implausible. Code words were definitely a thing at Los Alamos (the place itself was designated “Project Y”), but I’ve never seen the code word status of “Monte Carlo” corroborated by anyone with first-hand knowledge.

Who invented the Metropolis algorithm?

To raise the question, of course, is to hint that it was not Metropolis.

The 1953 paper that introduced Markov chain Monte Carlo, “Equation of State Calculations by Fast Computing Machines,” had five authors, who were listed in alphabetical order: Nicholas Metropolis, Arianna W. Rosenbluth, Marshall N. Rosenbluth, Augusta H. Teller, and Edward Teller. The two Rosenbluths were wife and husband, as were the two Tellers. Who did what in this complicated collaboration? Apparently no one thought to ask that question until 2003, when J. E. Gubernatis of Los Alamos was planning a symposium to mark the 50th anniversary of MCMC. He got in touch with Marshall Rosenbluth, who was then in poor health. Nevertheless, Rosenbluth attended the gathering, gave a talk, and sat for an interview. (He died a few months later.)

According to Rosenbluth, the basic idea behind MCMC—sampling the states of a system according to their Boltzmann weight, while following a Markov chain from one state to the next—came from Edward Teller. Augusta Teller wrote a first draft of a computer program to implement the idea. Then the Rosenbluths took over. In particular, it was Arianna Rosenbluth who wrote the program that produced all the results reported in the 1953 paper. Gubernatis adds:

Marshall’s recounting of the development of the Metropolis algorithm first of all made it very clear that Metropolis played no role in its development other than providing computer time.

In his interview, Rosenbluth was even blunter: “Metropolis was boss of the computer laboratory. We never had a single scientific discussion with him.”

These comments paint a rather unsavory portrait of Metropolis as a credit mooch. I don’t know to what extent that harsh verdict might be justified. In his own writings, Metropolis makes no overt claims about his contributions to the work. On the other hand, he also makes no disclaimers; he never suggests that someone else’s name might be more appropriately attached to the algorithm.

An interesting further question is who actually wrote the 1953 paper—who put the words together on the page. Internal textual evidence suggests there were at least two writers. Halfway through the article there’s a sudden change of tone, from gentle exposition to merciless technicality.

In recent years the algorithm has acquired the hyphenated moniker Metropolis-Hastings, acknowledging the contributions of W. Keith Hastings, a Canadian mathematician and statistician. Hastings wrote a 1970 paper that generalized the method, showing it could be applied to a wider class of problems, with probability distributions other than Boltzmann’s. Hastings is also given credit for rescuing the technique from captivity among the physicists and bringing it home to statistics, although it was another 20 years before the statistics community took much notice.

I don’t know who started the movement to name the generalized algorithm “Metropolis-Hastings.” The hyphenated term was already fairly well established by 1995, when Siddhartha Chib and Edward Greenberg put it in the title of a review article.

Who invented Glauber dynamics?

In this case there is no doubt or controversy about authorship. Glauber wrote the 1963 paper, and he did the work reported in it. On the other hand, Glauber did not invent the Monte Carlo algorithm that now goes by the name “Glauber dynamics.” His aim in tackling the Ising model was to find exact, mathematical solutions, in the tradition of Ising and Onsager. (Those two authors are the only ones cited in Glauber’s paper.) He never mentions Monte Carlo methods or any other computational schemes.

So who did devise the algorithm? The two main ingredients—the G-rule and the random visitation sequence—were already on the table in the 1950s. A form of the G-rule acceptance function $e^{-\Delta E/T} / (1 + e^{-\Delta E/T})$ was proposed in 1954 by John G. Kirkwood of Yale University, a major figure in statistical mechanics at midcentury. He suggested it to the Los Alamos group as an alternative to the M-rule. Although the suggestion was not taken, the group did acknowledge that it would produce valid simulations. The random visitation sequence was used in a followup study by the Los Alamos group in 1957. (By then the group was led by William W. Wood, who had been a student of Kirkwood.)

Those two ingredients first came together a few years later in work by P. A. Flinn and G. M. McManus, who were then at Westinghouse Research in Pittsburgh. Their 1961 paper describes a computer simulation of an Ising model with both random visitation order and the $e^{-\Delta E/T} / (1 + e^{-\Delta E/T})$ acceptance function, two years before Glauber’s article appeared. On grounds of publication priority, shouldn’t the Monte Carlo variation be named for Flinn and McManus rather than Glauber?

For a while, it was. There were dozens of references to Flinn and McManus throughout the 1960s and 70s. For example, an article by G. W. Cunningham and P. H. E. Meijer compared and evaluated the two main MCMC methods, identifying them as algorithms introduced by “Metropolis et al.” and by “Flinn and McManus.” A year later another compare-and-contrast article by John P. Valleau and Stuart G. Whittington adopted the same terminology. Neither of these articles mentions Glauber.

According to Semantic Scholar, the phrase “Glauber dynamics” first appeared in the physics literature in 1977, in an article by Ph. A. Martin. But this paper is a theoretical work, with no computational component, along the same lines as Glauber’s own investigation. Among the Semantic Scholar listings, “Glauber dynamics” was first mentioned in the context of Monte Carlo studies by A. Sadiq and Kurt Binder, in 1984. After that, the balance shifted strongly toward Glauber.

In bringing up the disappearance of Flinn and McManus from the Ising and Monte Carlo literature, I don’t mean to suggest that Glauber doesn’t deserve his recognition. His main contribution to studies of the Ising model—showing that it could give useful results away from equilibrium—is of the first importance. On the other hand, attaching his name to a Monte Carlo algorithm is unhelpful. If you turn to his 1963 paper to learn about the origin of the algorithm, you’ll be disappointed.

One more oddity. I have been writing the G-rule as

\[\frac{e^{-\Delta E/T}}{1 + e^{-\Delta E/T}},\]

which is the way it appeared in Flinn and McManus, as well as in many recent accounts of the algorithm. However, nothing resembling this expression is to be found in Glauber’s paper. Instead he defined the rule in terms of the hyperbolic tangent. Reconstructing various bits of his mathematics in a form that could serve as a Monte Carlo acceptance function, I come up with:

\[\frac{1}{2}\left(1 -\tanh \frac{\Delta E}{2 T}\right).\]

The two expressions are mathematically synonymous, but the prevalence of the first form suggests that some authors who cite Glauber rather than Flinn and McManus are not getting their notation from the paper they cite.

Who made the first pictures of an Ising lattice?

When I first heard of the Ising model, sometime in the 1970s, I would read statements along the lines of “as the system cools to the critical temperature, fluctuations grow in scale until they span the entire lattice.” I wanted to see what that looked like. What kinds of patterns or textures would appear, and how they would evolve over time? In those days, live motion graphics were too much to ask for, but it seemed reasonable to expect at least a still image, or perhaps a series of them covering a range of temperatures.

In my reading, however, I found no pictures. Part of the reason was surely technological. Turning computations into graphics wasn’t so easy in those days. But I suspect another motive as well. A computational scientist who wanted to be taken seriously was well advised to focus on quantitative results. A graph of magnetization as a function of temperature was worth publishing; a snapshot of a single lattice configuration might seem frivolous—not real physics but a plaything like the Game of Life. Nevertheless, I still yearned to see what it would look like.

In 1979 I had an opportunity to force the issue. I was working with Kenneth G. Wilson, a physicist then at Cornell University, on a Scientific American article about “Problems in Physics with Many Scales of Length.” The problems in question included the Ising model, and I asked Wilson if he could produce pictures showing spin configurations at various temperatures. He resisted; I persisted; a few weeks later I received a fat envelope of fanfold paper, covered in arrays of $1$s and $0$s. With help from the Scientific American art department the numbers were transformed into black and white squares:

Figure 34

This particular image, one of three we published, is at the critical temperature. Wilson credited his colleagues Stephen Shenker and Jan Tobochnik for writing the program that produced it.

The lattice pictures made by Wilson, Shenker, and Tobochnik were the first I had ever seen of an Ising model at work, but they were not the first to be published. In recent weeks I’ve discovered a 1974 paper by P. A. Flinn in which black-and-white spin tableaux form the very centerpiece of the presentation. Flinn discusses aspects of the appearance of these grids that would be very hard to reduce to simple numerical facts:

Phase separation may be seen to occur by the formation and growth of clusters, but they look rather more like “seaweed” than like the roughly round clusters of traditional theory. The structures look somewhat like those observed in phase-separated glass.

I also found one even earlier instance of lattice diagrams, in a 1963 paper by J. R. Beeler, Jr., and J. A. Delaney. Are they the first?

What does the Ising model model?

Modeling calls for a curious mix of verisimilitude and fakery. A miniature steam locomotive chugging along the tracks of a model railroad reproduces in meticulous detail the pistons and linkage rods that drive the wheels of the real locomotive. But in the model it’s the wheels that impart motion to the links and pistons, not the other way around. The model’s true power source is hidden—an electric motor tucked away inside, where the boiler ought to be.

Scientific models also rely on shortcuts and simplifications. In a physics textbook you will meet the ideal gas, the frictionless pendulum, the perfectly elastic spring, the falling body that encounters no air resistance, the planet whose entire mass is concentrated at a dimensionless point. Such idealizations are not necessarily defects. By brushing aside irrelevant details, a good model allows a deeper truth to shine through. The problem, of course, is that some details are not irrelevant.

The Ising model is a fascinating case study in this process. Lenz and Ising set out to explain ferromagnetism, and almost all later discussions of the model (including the one you are reading right now) put some emphasis on that connection. The original aim was to find the simplest framework that would exhibit important properties of real ferromagnets, most notably the sudden onset of magnetization at the Curie temperature. As far as I can tell, the Ising model has failed in this respect. Some of the omitted details were of the essence; quantum mechanics just won’t go away, no matter how much we might like it to. These days, serious students of magnetism seem to have little interest in simple grids of flipping spins. A 2006 review of “modeling, analysis, and numerics of ferromagnetism,” by Martin Kružík and Andreas Prohl, doesn’t even mention the Ising model.

Yet the model remains wildly popular, the subject of hundreds of papers every year. Way back in 1967, Stephen G. Brush wrote that the Ising model had become “the preferred basic theory of all cooperative phenomena.” I’d go even further. I think it’s fair to say the Ising model has become an object of study for its own sake. The quest is to understand the phase diagram of the Ising system itself, whether or not it tells us anything about magnets or other physical phenomena.

Uprooting the Ising system from its ancestral home in physics leaves us with a model that is not a model of anything. It’s like a map of an imaginary territory; there is no ground truth. You can’t check the model’s accuracy by comparing its predictions with the results of experiments.

Seeing the Ising model as a free-floating abstraction, untethered from the material world, is a prospect I find exhilarating. We get to make our own universe—and we’ll do it right this time, won’t we! However, losing touch with physics is also unsettling. On what basis are we to choose between versions of the model, if not through fidelity to nature? Are we to be guided only by taste or convenience? A frequent argument in support of Glauber dynamics is that it seems more “natural” than the Metropolis algorithm. I would go along with that judgment: The random visitation sequence and the smooth, symmetrical curve of the G-rule both seem more like something found in nature than the corresponding Metropolis apparatus. But does naturalness matter if the model is solely a product of human imagination?

Foldable Words

Brian Hayes — Tue, 09 Feb 2021 20:15:47 +0000

Packing up the household for a recent move, I was delving into shoeboxes, photo albums, and file folders that had not been opened in decades. One of my discoveries, found in an envelope at the back of a file drawer, was the paper sleeve from a drinking straw, imprinted with a saccharine message:

This flimsy slip of paper seems like an odd scrap to preserve for the ages, but when I pulled it out of the envelope, I knew instantly where it came from and why I had saved it.

The year was 1967. I was 17 then; I’m 71 now. Transposing those two digits takes just a flick of the fingertips. I can blithely skip back and forth from one prime number to the other. But the span of lived time between 1967 and 2021 is a chasm I cannot so easily leap across. At 17 I was in a great hurry to grow up, but I couldn’t see as far as 71; I didn’t even try. Going the other way—revisiting the mental and emotional life of an adolescent boy—is also a journey deep into alien territory. But the straw wrapper helps—it’s a Proustian aide memoire.

In the spring of 1967 I had a girlfriend, Lynn. After school we would meet at the Maple Diner, where the booths had red leatherette upholstery and formica tabletops with a boomerang motif. We’d order two Cokes and a plate of french fries to share. The waitress liked us; she’d make sure we had a full bottle of ketchup. I mention the ketchup because it was a token of our progress toward intimacy. On our first dates Lynn had put only a dainty dab on her fries, but by April we were comfortable enough to reveal our true appetites.

One afternoon I noticed she was fiddling intently with the wrapper from her straw, folding and refolding. I had no idea what she was up to. A teeny paper airplane she would sail over my head? When she finished, she pushed her creation across the table:

What a wallop there was in that little wad of paper. At that point in our romance, the words had not yet been spoken aloud.

How did I respond to Lynn’s folded declaration? I can’t remember; the words are lost. But evidently I got through that awkward moment without doing any permanent damage. A year later Lynn and I were married.

Today, at 71, with the preserved artifact in front of me, my chief regret is that I failed to take up the challenge implicit in the word game Lynn had invented. Why didn’t I craft a reply by folding my own straw wrapper? There are quite a few messages I could have extracted by strategic deletions from “It’s a pleasure to serve you.”

          itsapleasuretoserveyou   ==>   I love you.

          itsapleasuretoserveyou   ==>   I please you.

          itsapleasuretoserveyou   ==>   I tease you.

          itsapleasuretoserveyou   ==>   I pleasure you.

          itsapleasuretoserveyou   ==>   I pester you.

          itsapleasuretoserveyou   ==>   I peeve you.

          itsapleasuretoserveyou   ==>   I salute you.

          itsapleasuretoserveyou   ==>   I leave you.

Not all of those statements would have been suited to the occasion of our rendezvous at the Maple Diner, but over the course of our years together—17 years, as it turned out—there came a moment for each of them.

How many words can we form by making folds in the straw-paper slogan? I could not have answered that question in 1967. I couldn’t have even asked it. But times change. Enumerating all the foldable messages now strikes me as an obvious thing to do when presented with the straw wrapper. Furthermore, I have the computational means to do it—although the project was not quite as easy as I expected.

A first step is to be explicit about the rules of the game. We are given a source text, in this case “It’s a pleasure to serve you.” Let us ignore the spaces between words as well as all punctuation and capitalization; in this way we arrive at the normalized text “itsapleasuretoserveyou”. A word is foldable if all of its letters appear in the normalized text in the correct order (though not necessarily consecutively). The folding operation amounts to an editing process in which our only permitted act is deletion of letters; we are not allowed to insert, substitute, or permute. If two or more foldable words are to be combined to make a phrase or sentence, they must follow one another in the correct order without overlaps.

So much for foldability. Next comes the fraught question: What is a word? Linguists and lexicographers offer many subtly divergent opinions on this point, but for present purposes a very simple definition will suffice: A finite sequence of characters drawn from the 26-letter English alphabet is a word if it can legally be played in a game of Scrabble. I have been working with a word list from the 2015 edition of Collins Scrabble Words, which has about 270,000 entries. (There are a number of alternative lists, which I discuss in an appendix at the end of this article.)

Scrabble words range in length from 2 to 15 letters. The upper limit—determined by the size of the game board—is not much of a concern. You’re unlikely to meet a straw-paper text that folds to yield words longer than sesquipedalian. The absence of 1-letter words is more troubling, but the remedy is easy: I simply added the words a, I, and O to my copy of the Scrabble list.

My first computational experiments with foldable words searched for examples at random. Writing a program for random sampling is often easier than taking an exact census of a population, and the sample offers a quick glimpse of typical results. The following Python procedure generates random foldable sequences of letters drawn from a given source text, then returns those sequences that are found in the Scrabble word list. (The parameter k is the length of the words to be generated, and reps specifies the number of random trials.)

def randomFoldableWords(text, lexicon, k, reps):
    normtext = normalize(text)
    n = len(normtext)
    findings = []
    for i in range(reps):
        indices = random.sample(range(n), k)
        indices.sort()
        letters = ""
        for idx in indices:
            letters += normtext[idx]
        if letters in lexicon:
            findings.append(letters)
    return findings

Here are the six-letter foldable words found by invoking the program as follows: randomFoldableWords(scrabblewords, 6, 10000).

please, plater, searer, saeter, parter, sleety, sleeve, parser, purvey, laster, islets, taster, tester, slarts, paseos, tapers, saeter, eatery, salute, tsetse, setose, salues, sparer

Note that the word saeter (you could look it up—I had to) appears twice in this list. The frequency of such repetitions can yield an estimate of the total population size. A variant of the mark-and-recapture method, well-known in wildlife ecology, led me to an estimate of 92 six-letter foldable Scrabble words in the straw-wrapper slogan. The actual number turns out to be 106.

Samples and estimates are helpful, but they leave me wondering, What am I missing? What strange and beautiful word has failed to turn up in any of the samples, like the big fish that never takes the bait? I had to have an exhaustive list.

In many word games, the tool of choice for computer-aided playing (or cheating) is the regular expression, or regex. A regex is a pattern defining a set of strings, or character sequences; from a collection of strings, a regex search will pick out those that match the pattern. For example, the regular expression ^.*love.*$ selects from the Scrabble word list all words that have the letter sequence love somewhere within them. There are 137 such words, including some that I would not have thought of, such as rollover and slovenly. The regex ^.*l.*o.*v.*e.*$ finds all words in which l, o, v, and e appear in sequence, whether of not they are adjacent. The set has 267 members, including such secret-lover gems as bloviate, electropositive, and leftovers.

A solution to the foldable words problem could surely be crafted with regular expressions, but I am not a regex wizard. In search of a more muggles-friendly strategy, my first thought was to extend the idea behind the random-sampling procedure. Instead of selecting foldable sequences at random, I’d generate all of them, and check each one against the word list.

The procedure below generates all three-letter strings that can be folded from the given text, and returns the subset of those strings that appear in the Scrabble word list:

def foldableStrings3(lexicon, text):
    normtext = normalize(text)
    n = len(normtext)
    words = []
    for i in range(0, n-2):
        for j in range(i+1, n-1):
            for k in range(j+1, n):
                s = normtext[i] + normtext[j] + normtext[k]
                if s in lexicon:
                    words.append(s)
    return(words)

At the heart of the procedure are three nested loops that methodically step through all the foldable combinations: For any initial letter text[i] we can choose any following letter text[j] with j > i; likewise text[j] can be followed by any text[k] with k > j. This scheme works perfectly well, finding 348 instances of three-letter words. I speak of “instances” because some words appear in the list more than once; for example, pee can be formed in three ways. If we count only unique words, there are 137.

Following this model, we could write a separate routine for each word length from 1 to 15 letters, but that looks like a dreary and repetitious task. Nobody wants to write a procedure with loops nested 15 deep. An alternative is to write a meta-procedure, which would generate the appropriate procedure for each word length. I made a start on that exercise in advanced loopology, but before I got very far I realized there’s an easier way. I was wondering: In a text of n letters, how many foldable substrings exist—whether or not they are recognizable words? There are several ways of answering this question, but to me the most illuminating argument comes from an inclusion/exclusion principle. Consider the first letter of the text, which in our case is the letter I. In the set of all foldable strings, half include this letter and half exclude it. The same is true of the second letter, and the third, and so on. Thus each letter added to the text doubles the number of foldable strings, which means the total number of strings is simply $2^n$. (Included in this count is the empty string, made up of no letters.)

This observation suggests a simple algorithm for generating all the foldable strings in any n-letter text. Just count from $0$ to $2^{n} - 1$, and for each value along the way line up the binary representation of the number with the letters of the text. Then select those letters that correspond to a 1 bit, like so:

                    itsapleasuretoserveyou
                    0000100000110011111000

And so we see that the word preserve corresponds to the binary representation of the number 134392.

Counting is something that computers are good at, so a word-search procedure based on this principle is straightforward:

def foldablesByCounting(lexicon, text):
    normtext = normalize(text)
    n = len(normtext)
    words = []
    for i in range(2**n - 1):
        charSeq = ''
        positions = positionsOf1Bits(i, n)
        for p in positions:
            charSeq += normtext[p]
        if charSeq in lexicon:
            words.append(charSeq)
    return(words)

The outer loop (variable i) counts from $0$ to $2^{n} - 1$; for each of these numbers the inner loop (variable p) picks out the letters corresponding to 1 bits. The program produces the output expected. Unfortunately, it does so very slowly. For every character added to the text, running time roughly doubles. I haven’t the patience to plod through the $2^{22}$ patterns in “itsapleasuretoserveyou”; estimates based on shorter phrases suggest the running time would be more than three hours.

In the middle of the night I realized my approach to this problem was totally backwards. Instead of blindly generating all possible character strings and filtering out the few genuine words, I could march through the list of Scrabble words and test each of them to see if it’s foldable. At worst I would have to try some 270,000 words. I could speed things up even more by making a preliminary pass through the Scrabble list, discarding all words that include characters not present in the normalized text. For the text “It’s a pleasure to serve you,” the character set has just 12 members: aeiloprstuvy. Allowing only words formed from these letters slashes the Scrabble list down to a length of 12,816.

To make this algorithm work, we need a procedure to report whether or not a word can be formed by folding the given text. The simplest approach is to slide the candidate word along the text, looking for a match for each character in turn:

                    taste
                    itsapleasuretoserveyou

                     taste
                    itsapleasuretoserveyou

                     t aste
                    itsapleasuretoserveyou

                     t a    ste
                    itsapleasuretoserveyou

                     t a    s   te
                    itsapleasuretoserveyou

                     t a    s   t  e
                    itsapleasuretoserveyou

If every letter of the word finds a mate in the text, the word is foldable, as in the case of taste, shown above. But an attempt to match tastes would fall off the end of the text looking for a second s, which does not exist.

The following code implements this idea:

def wordIsFoldable(word, text):
    normtext = normalize(text)
    t = 0                      # pointer to positions in normtext
    w = 0                      # pointer to positions in word
    while t < len(normtext):
        if word[w] == normtext[t]:  # matching chars in word and text
            w += 1                  # move to next char in word
        if w == len(word):          # matched all chars in word
            return(True)            # so: thumbs up
        t += 1                 # move to next char in text
    return(False)              # fell off the end: thumbs down

All we need to do now is embed this procedure in a loop that steps through all the candidate Scrabble words, collecting those for which wordIsFoldable returns True.

There’s still some waste motion here, since we are searching letter-by-letter through the same text, and repeating the same searches thousands of times. The source code (available on GitHub as a Jupyter notebook) explains some further speedups. But even the simple version shown here runs in less than two tenths of a second, so there’s not much point in optimizing.

I can now report that there are 778 unique foldable Scrabble words in “It’s a pleasure to serve you” (including the three one-letter words I added to the list). Words that can be formed in multiple ways bring the total count to 899.

And so we come to the tah-dah! moment—the unveiling of the complete list. I have organized the words into groups based on each word’s starting position within the text. (By Python convention, the positions are numbered from 0 through $n-1$.) Within each group, the words are sorted according to the position of their last character; that position is given in the subscript following the word. For example, tapestry is in Group 1 because it begins at position 1 in the text (the t in It’s), and it carries the subscript 19 because it ends at position 19 (the y in you).

This arrangement of the words is meant to aid in contructing multiword phrases. If a word ends at position $m$, the next word in the phrase must come from a group numbered $m+1$ or greater.

Group 0: i₀ it₁ is₂ its₂ ita₃ isle₆ ilea₇ isles₈ itas₈ ire₁₁ issue₁₁ iure₁₁ islet₁₂ io₁₃ iso₁₃ ileus₁₄ ios₁₄ ires₁₄ islets₁₄ isos₁₄ issues₁₄ issuer₁₆ ivy₁₉

Group 1: ta₃ tap₄ tae₆ tale₆ tape₆ te₆ tala₇ talea₇ tapa₇ tea₇ taes₈ talas₈ tales₈ tapas₈ tapes₈ taps₈ tas₈ teas₈ tes₈ tapu₉ tau₉ talar₁₀ taler₁₀ taper₁₀ tar₁₀ tear₁₀ tsar₁₀ taleae₁₁ tare₁₁ tease₁₁ tee₁₁ tapet₁₂ tart₁₂ tat₁₂ taut₁₂ teat₁₂ test₁₂ tet₁₂ tret₁₂ tut₁₂ tao₁₃ taro₁₃ to₁₃ talars₁₄ talers₁₄ talus₁₄ taos₁₄ tapers₁₄ tapets₁₄ tapus₁₄ tares₁₄ taros₁₄ tars₁₄ tarts₁₄ tass₁₄ tats₁₄ taus₁₄ tauts₁₄ tears₁₄ teases₁₄ teats₁₄ tees₁₄ teres₁₄ terts₁₄ tests₁₄ tets₁₄ tres₁₄ trets₁₄ tsars₁₄ tuts₁₄ tasse₁₅ taste₁₅ tate₁₅ terete₁₅ terse₁₅ teste₁₅ tete₁₅ toe₁₅ tose₁₅ tree₁₅ tsetse₁₅ taperer₁₆ tapster₁₆ tarter₁₆ taser₁₆ taster₁₆ tater₁₆ tauter₁₆ tearer₁₆ teaser₁₆ teer₁₆ teeter₁₆ terser₁₆ tester₁₆ tor₁₆ tutor₁₆ tav₁₇ tarre₁₈ testee₁₈ tore₁₈ trove₁₈ tutee₁₈ tapestry₁₉ tapstry₁₉ tarry₁₉ tarty₁₉ tasty₁₉ tay₁₉ teary₁₉ terry₁₉ testy₁₉ toey₁₉ tory₁₉ toy₁₉ trey₁₉ troy₁₉ try₁₉ too₂₀ toro₂₀ toyo₂₀ tatou₂₁ tatu₂₁ tutu₂₁

Group 2: sap₄ sal₅ sae₆ sale₆ sea₇ spa₇ sales₈ sals₈ saps₈ seas₈ spas₈ sau₉ sar₁₀ sear₁₀ ser₁₀ slur₁₀ spar₁₀ spear₁₀ spur₁₀ sur₁₀ salse₁₁ salue₁₁ seare₁₁ sease₁₁ seasure₁₁ see₁₁ sere₁₁ sese₁₁ slae₁₁ slee₁₁ slue₁₁ spae₁₁ spare₁₁ spue₁₁ sue₁₁ sure₁₁ salet₁₂ salt₁₂ sat₁₂ saut₁₂ seat₁₂ set₁₂ slart₁₂ slat₁₂ sleet₁₂ slut₁₂ spart₁₂ spat₁₂ speat₁₂ spet₁₂ splat₁₂ spurt₁₂ st₁₂ suet₁₂ salto₁₃ so₁₃ salets₁₄ salses₁₄ saltos₁₄ salts₁₄ salues₁₄ sapless₁₄ saros₁₄ sars₁₄ sass₁₄ sauts₁₄ sears₁₄ seases₁₄ seasures₁₄ seats₁₄ sees₁₄ seres₁₄ sers₁₄ sess₁₄ sets₁₄ slaes₁₄ slarts₁₄ slats₁₄ sleets₁₄ slues₁₄ slurs₁₄ sluts₁₄ sos₁₄ spaes₁₄ spares₁₄ spars₁₄ sparts₁₄ spats₁₄ spears₁₄ speats₁₄ speos₁₄ spets₁₄ splats₁₄ spues₁₄ spurs₁₄ spurts₁₄ sues₁₄ suets₁₄ sures₁₄ sus₁₄ salute₁₅ saree₁₅ sasse₁₅ sate₁₅ saute₁₅ setose₁₅ slate₁₅ sloe₁₅ sluse₁₅ sparse₁₅ spate₁₅ sperse₁₅ spree₁₅ saeter₁₆ salter₁₆ saluter₁₆ sapor₁₆ sartor₁₆ saser₁₆ searer₁₆ seater₁₆ seer₁₆ serer₁₆ serr₁₆ slater₁₆ sleer₁₆ spaer₁₆ sparer₁₆ sparser₁₆ spearer₁₆ speer₁₆ spuer₁₆ spurter₁₆ suer₁₆ surer₁₆ sutor₁₆ sav₁₇ sov₁₇ salve₁₈ save₁₈ serre₁₈ serve₁₈ slave₁₈ sleave₁₈ sleeve₁₈ slove₁₈ sore₁₈ sparre₁₈ sperre₁₈ splore₁₈ spore₁₈ stere₁₈ sterve₁₈ store₁₈ stove₁₈ salary₁₉ salty₁₉ sassy₁₉ saury₁₉ savey₁₉ say₁₉ serry₁₉ sesey₁₉ sey₁₉ slatey₁₉ slaty₁₉ slavey₁₉ slay₁₉ sleety₁₉ sley₁₉ slurry₁₉ sly₁₉ soy₁₉ sparry₁₉ spay₁₉ speary₁₉ splay₁₉ spry₁₉ spurrey₁₉ spurry₁₉ spy₁₉ stey₁₉ storey₁₉ story₁₉ sty₁₉ suety₁₉ surety₁₉ surrey₁₉ survey₁₉ salvo₂₀ servo₂₀ stereo₂₀ sou₂₁ susu₂₁

Group 3: a₃ al₅ ae₆ ale₆ ape₆ aa₇ ala₇ aas₈ alas₈ ales₈ als₈ apes₈ as₈ alu₉ alar₁₀ aper₁₀ ar₁₀ alae₁₁ alee₁₁ alure₁₁ apse₁₁ are₁₁ aue₁₁ alert₁₂ alt₁₂ apart₁₂ apert₁₂ apt₁₂ aret₁₂ art₁₂ at₁₂ aero₁₃ also₁₃ alto₁₃ apo₁₃ apso₁₃ auto₁₃ aeros₁₄ alerts₁₄ altos₁₄ alts₁₄ alures₁₄ alus₁₄ apers₁₄ apos₁₄ apres₁₄ apses₁₄ apsos₁₄ apts₁₄ ares₁₄ arets₁₄ ars₁₄ arts₁₄ ass₁₄ ats₁₄ aures₁₄ autos₁₄ alate₁₅ aloe₁₅ arete₁₅ arose₁₅ arse₁₅ ate₁₅ alastor₁₆ alerter₁₆ alter₁₆ apter₁₆ aster₁₆ arere₁₈ ave₁₈ aery₁₉ alary₁₉ alay₁₉ aleatory₁₉ apay₁₉ apery₁₉ arsey₁₉ arsy₁₉ artery₁₉ artsy₁₉ arty₁₉ ary₁₉ ay₁₉ aloo₂₀ arvo₂₀ avo₂₀ ayu₂₁

Group 4: pe₆ pa₇ pea₇ plea₇ pas₈ peas₈ pes₈ pleas₈ plu₉ par₁₀ pear₁₀ per₁₀ pur₁₀ pare₁₁ pase₁₁ peare₁₁ pease₁₁ pee₁₁ pere₁₁ please₁₁ pleasure₁₁ plue₁₁ pre₁₁ pure₁₁ part₁₂ past₁₂ pat₁₂ peart₁₂ peat₁₂ pert₁₂ pest₁₂ pet₁₂ plast₁₂ plat₁₂ pleat₁₂ pst₁₂ put₁₂ pareo₁₃ paseo₁₃ peso₁₃ pesto₁₃ po₁₃ pro₁₃ pareos₁₄ pares₁₄ pars₁₄ parts₁₄ paseos₁₄ pases₁₄ pass₁₄ pasts₁₄ pats₁₄ peares₁₄ pears₁₄ peases₁₄ peats₁₄ pees₁₄ peres₁₄ perts₁₄ pesos₁₄ pestos₁₄ pests₁₄ pets₁₄ plats₁₄ pleases₁₄ pleasures₁₄ pleats₁₄ plues₁₄ plus₁₄ pos₁₄ pros₁₄ pures₁₄ purs₁₄ pus₁₄ puts₁₄ parse₁₅ passe₁₅ paste₁₅ pate₁₅ pause₁₅ perse₁₅ plaste₁₅ plate₁₅ pose₁₅ pree₁₅ prese₁₅ prose₁₅ puree₁₅ purse₁₅ parer₁₆ parr₁₆ parser₁₆ parter₁₆ passer₁₆ paster₁₆ pastor₁₆ pater₁₆ pauser₁₆ pearter₁₆ peer₁₆ perter₁₆ pester₁₆ peter₁₆ plaster₁₆ plater₁₆ pleaser₁₆ pleasurer₁₆ pleater₁₆ poser₁₆ pretor₁₆ proser₁₆ puer₁₆ purer₁₆ purr₁₆ purser₁₆ parev₁₇ pav₁₇ perv₁₇ pareve₁₈ parore₁₈ parve₁₈ passee₁₈ pave₁₈ peeve₁₈ perve₁₈ petre₁₈ pore₁₈ preeve₁₈ preserve₁₈ preve₁₈ prore₁₈ prove₁₈ parry₁₉ party₁₉ pastry₁₉ pasty₁₉ patsy₁₉ paty₁₉ pay₁₉ peatery₁₉ peaty₁₉ peavey₁₉ peavy₁₉ peeoy₁₉ peery₁₉ perry₁₉ pervy₁₉ pesty₁₉ plastery₁₉ platy₁₉ play₁₉ ploy₁₉ plurry₁₉ ply₁₉ pory₁₉ posey₁₉ posy₁₉ prey₁₉ prosy₁₉ pry₁₉ pursy₁₉ purty₁₉ purvey₁₉ puy₁₉ parvo₂₀ poo₂₀ proo₂₀ proso₂₀ pareu₂₁ patu₂₁ poyou₂₁

Group 5: la₇ lea₇ las₈ leas₈ les₈ leu₉ lar₁₀ lear₁₀ lur₁₀ lare₁₁ lase₁₁ leare₁₁ lease₁₁ leasure₁₁ lee₁₁ lere₁₁ lure₁₁ last₁₂ lat₁₂ least₁₂ leat₁₂ leet₁₂ lest₁₂ let₁₂ lo₁₃ lares₁₄ lars₁₄ lases₁₄ lass₁₄ lasts₁₄ lats₁₄ leares₁₄ lears₁₄ leases₁₄ leasts₁₄ leasures₁₄ leats₁₄ lees₁₄ leets₁₄ leres₁₄ leses₁₄ less₁₄ lests₁₄ lets₁₄ los₁₄ lues₁₄ lures₁₄ lurs₁₄ laree₁₅ late₁₅ leese₁₅ lose₁₅ lute₁₅ laer₁₆ laser₁₆ laster₁₆ later₁₆ leaser₁₆ leer₁₆ lesser₁₆ lor₁₆ loser₁₆ lurer₁₆ luser₁₆ luter₁₆ lav₁₇ lev₁₇ luv₁₇ lave₁₈ leave₁₈ lessee₁₈ leve₁₈ lore₁₈ love₁₈ lurve₁₈ lay₁₉ leary₁₉ leavy₁₉ leery₁₉ levy₁₉ ley₁₉ lory₁₉ lovey₁₉ loy₁₉ lurry₁₉ laevo₂₀ lasso₂₀ levo₂₀ loo₂₀ lassu₂₁ latu₂₁ lou₂₁

Group 6: ea₇ eas₈ es₈ eau₉ ear₁₀ er₁₀ ease₁₁ ee₁₁ ere₁₁ east₁₂ eat₁₂ est₁₂ et₁₂ euro₁₃ ears₁₄ eases₁₄ easts₁₄ eats₁₄ eaus₁₄ eres₁₄ eros₁₄ ers₁₄ eses₁₄ ess₁₄ ests₁₄ euros₁₄ erose₁₅ esse₁₅ easer₁₆ easter₁₆ eater₁₆ err₁₆ ester₁₆ erev₁₇ eave₁₈ eve₁₈ easy₁₉ eatery₁₉ eery₁₉ estro₂₀ evo₂₀

Group 7: a₇ as₈ ar₁₀ ae₁₁ are₁₁ aue₁₁ aret₁₂ art₁₂ at₁₂ auto₁₃ ares₁₄ arets₁₄ ars₁₄ arts₁₄ ass₁₄ ats₁₄ aures₁₄ autos₁₄ arete₁₅ arose₁₅ arse₁₅ ate₁₅ aster₁₆ arere₁₈ ave₁₈ aery₁₉ arsey₁₉ arsy₁₉ artery₁₉ artsy₁₉ arty₁₉ ary₁₉ ay₁₉ aero₂₀ arvo₂₀ avo₂₀ ayu₂₁

Group 8: sur₁₀ sue₁₁ sure₁₁ set₁₂ st₁₂ suet₁₂ so₁₃ sets₁₄ sos₁₄ sues₁₄ suets₁₄ sures₁₄ sus₁₄ see₁₅ sese₁₅ setose₁₅ seer₁₆ ser₁₆ suer₁₆ surer₁₆ sutor₁₆ sov₁₇ sere₁₈ serve₁₈ sore₁₈ stere₁₈ sterve₁₈ store₁₈ stove₁₈ sesey₁₉ sey₁₉ soy₁₉ stey₁₉ storey₁₉ story₁₉ sty₁₉ suety₁₉ surety₁₉ surrey₁₉ survey₁₉ servo₂₀ stereo₂₀ sou₂₁ susu₂₁

Group 9: ur₁₀ ure₁₁ ut₁₂ ures₁₄ us₁₄ uts₁₄ use₁₅ ute₁₅ ureter₁₆ user₁₆ uey₁₉ utu₂₁

Group 10: re₁₁ ret₁₂ reo₁₃ reos₁₄ res₁₄ rets₁₄ ree₁₅ rete₁₅ roe₁₅ rose₁₅ rev₁₇ reeve₁₈ resee₁₈ reserve₁₈ retore₁₈ rore₁₈ rove₁₈ retry₁₉ rory₁₉ rosery₁₉ rosy₁₉ retro₂₀ roo₂₀

Group 11: et₁₂ es₁₄ ee₁₅ er₁₆ ere₁₈ eve₁₈ eery₁₉ evo₂₀

Group 12: to₁₃ te₁₅ toe₁₅ tose₁₅ tor₁₆ tee₁₈ tore₁₈ toey₁₉ tory₁₉ toy₁₉ trey₁₉ try₁₉ too₂₀ toro₂₀ toyo₂₀

Group 13: o₁₃ os₁₄ oe₁₅ ose₁₅ or₁₆ ore₁₈ oy₁₉ oo₂₀ ou₂₁

Group 14: ser₁₆ see₁₈ sere₁₈ serve₁₈ sey₁₉ servo₂₀ so₂₀ sou₂₁

Group 15: er₁₆ ee₁₈ ere₁₈ eve₁₈ evo₂₀

Group 16: re₁₈ reo₂₀

Group 17:

Group 18:

Group 19: yo₂₀ you₂₁ yu₂₁

Group 20: o₂₀ ou₂₁

Group 21:

Naturally, I’ve tried out the code on a few other well-known phrases.

If Lynn and I had met at a different dining establishment, she might have found a straw with the statement, “It takes two hands to handle a Whopper.” There’s quite a diverse assortment of possible messages lurking in this text, with 1,154 unique foldable words and almost 2,000 word instances. Perhaps she would have chosen the upbeat “Inhale hope.” Or, in a darker mood, “I taste woe.”

If we had been folding dollar bills instead of straw wrappers, “In God We Trust” might have become the forward-looking proclamation, “I go west!” Horace Greeley’s marching order on the same theme, “Go west, young man,” gives us the enigmatic “O, wet yoga!” or, perhaps more aptly, “Gunman.”

Jumping forward from 1967 to 2021—from the Summer of Love to the Winter of COVID—I can turn “Wear a mask. Wash your hands.” into the plaintive, “We ask: Why us?” With “Maintain social distance,” the best I can do is “A nasal dance” or “A sad stance.”

And then there’s “Make America Great Again.” It yields “Meme rage.” Also “Make me ragtag.”

Appendix: The Word-List Problem.

In a project like this one, you might think that getting a suitable list of English words would be the easy part. In fact it seems to be the main trouble spot.

The Scrabble lexicon I’ve been relying on derives from a word list known as SOWPODS, compiled by two associations of Scrabble players starting in the 1980s. Current editions of the list are distributed by a commercial publisher, Collins Dictionaries. If I understand correctly, all versions of the list are subject to copyright (see discussion on Stack Exchange) and cannot legally be distributed without permission. But no one seems to be much bothered by that fact. Copies of the lists in plain-text format, with one word per line, are easy to find on the internet—and not just on dodgy sites that specialize in pirated material.

There are alternative lists without legal encumbrances. Indeed, there’s a good chance you already have one such list pre-installed on your computer. A file called words is included in most distributions of the Unix operating system, including MacOS; my copy of the file lives in usr/share/dict/words. If you don’t have or can’t find the Unix words file, I suggest downloading the Natural Language Toolkit, a suite of data files and Python programs that includes a lexicon almost identical to Unix words, as well as many other linguistic resources.

The Scrabble list has one big advantage over words: It includes plurals and inflected forms of verbs—not just test but also tests, tested, and testing. [Bad example; see comments below.] The words file is more like a list of dictionary head words, with only the stem form explicitly included. On the other hand, words has an abundance of names and other proper nouns, as well as abbreviations, which are excluded from the Scrabble list since they are not legal plays in the board game.

How about combining the two word lists? Their union has just under 400,000 entries—quite a large lexicon. Using this augmented list for the analysis of “It’s a pleasure to serve you,” my program finds an additional 219 foldable words, beyond the 778 found with the Scrabble list alone. Here they are:

aaru aer aerose aes alares alaster alea alerse aleut alo alose alur aly ao apa apar aperu apus aro arry aru ase asor asse ast astor atry aueto aurore aus ausu aute e eastre eer erse esere estre eu ey iao ie ila islay ist isuret itala itea iter ito iyo l laet lao larry larve lastre lasty latro laur leo ler lester lete leto loro lu lue luo lut luteo lutose ly oer ory ovey p parsee parto passo pastose pato pau paut pavo pavy peasy perty peru pess peste pete peto petr plass platery pluto poe poy presee pretry pu purre purry puru r reve ro roer roey roy s sa saa salar salat salay saltee saltery salvy sao sapa saple sapo sare sart saur sauty sauve se seary seave seavy seesee sero sert sesuto sla slare slav slete sloo sluer soe sory soso spary spass spave spleet splet splurt spor spret sprose sput ssu stero steve stre strey stu sueve suto sutu suu t taa taar tal talao talose taluto tapeats tapete taplet tapuyo tarr tarse tartro tarve tasser tasu taur tave tavy teaer teaey teart teasy teaty teave teet teety tereu tess testor toru torve tosy tou treey tsere tst tu tue tur turr turse tute tutory u uro urs uru usee v vu y

Many of the proper nouns in this list are present in the vocabulary of most English speakers: Aleut, Peru, Pluto, Slav; the same is true of personal names such as Larry, Leo, Stu, Tess. But the rest of the words are very unlikely to turn up in the smalltalk of teenage sweethearts. Indeed, the list is full of letter sequences I simply don’t recognize as English words. Please define isuret, ovey, spleet, or sput.

There are even bigger word lists out there. In 2006 Google extracted 13.5 million unique English words from public web pages. (The sheer number implies a very liberal definition of English and word.) A good place to start exploring this archive is Peter Norvig’s website, which offers a file with the 333,333 most frequent words from the corpus. The list begins as you might expect: the, of, and, to, a, in, for…; but the weirdness creeps in early. The single letters c, e, s, and x are all listed among the 100 most common “words,” and the rest of the alphabet turns up soon after. By the time we get to the end of the file, it’s mostly typos (mepquest, halloweeb, scholarhips), run-together words (dietsdontwork, weightlossdrugs), and hundreds of letter strings that have some phonetic or orthographic resemblance to Google or Yahoo! or both (hoogol, googgl, yahhol, gofool, yogol). (I suspect that much of this rubbish was scraped not from the visible text of web pages but from metadata stuffed into headers for purposes of search-engine optimization.)

Applying the Google list to the search for foldable words more than doubles the volume of results, but it contributes almost nothing to the stock of words that might form interesting messages. I found 1,543 new words, beyond those that are also present in the union of the Scrabble and Unix lists. In alphabetical order, the additions begin: aae, aao, aaos, aar, aare, aaro, aars, aart, aarts, aase, aass, aast, aasu, aat, aats, aatsr, aau, aaus, aav, aave, aay, aea, aeae…. I’m not going to be folding up any straw wrappers with those words for my sweetheart.

What we really need, I begin to think, is not a longer word list but a shorter and more discriminating one.

We Gather Together…

Brian Hayes — Tue, 24 Nov 2020 16:55:09 +0000

The Thanksgiving holiday is upon us, but Anthony Fauci and the CDC and 79 percent of epidemiologists are urging us to forgo the big family gathering this year. I’m sure that’s sound advice, but I haven’t seen much quantitative analysis to back it up. How serious is the risk when we go over the river and through the woods to grandmother’s house? What are the public health consequences if the whole country sticks to the familiar ritual of too much food and football?

The tableau presented below is a product of my amateur efforts to address these questions. It’s a simple exercise in the mechanics of probability. I take a sample of the U.S. population, roughly 10,000 people, and randomly assign them to clusters of size $n$, where $n$ can range from 1 to 32. (In any single run of the model, $n$ is fixed; all the groups are the same size.) Each cluster represents a Thanksgiving gathering. If a cluster includes someone infected with SARS-CoV-2, the disease may spread to the uninfected and susceptible members of the same group.

With the model’s default settings, $n = 12$. The population sample consists of 9,900 people, represented as tiny colored dots arranged in 825 clusters of 12 dots each. Most of the dots are green, indicating susceptible individuals. Red dots are the infectious spreaders. Purple dots represent the unfortunates who are newly infected as a result of mingling with spreaders in these holiday get-togethers. I count the purple dots and estimate the rate of new infections per 100,000 population.

You can explore the model on your own. Twiddle with the sliders in the control panel, then press the “Go” button to generate a new sample population and a new cycle of infections. For example, by moving the group-size slider you can get a thousand clusters of 10 persons each, or 400 clusters of 25 each.

Before going any further with this discussion, I should make clear that the simulation is not offered as a prediction of how Covid-19 will spread during tomorrow’s Thanksgiving festivities. This is not a guide to personal risk assessment. If you play around with the controls, you’ll soon discover you can make the model say anything you wish. Depending on the settings you choose, the result can lie anywhere along the entire spectrum of possible outcomes, from nobody-gets-sick to everybody’s-got-it. There are settings that lead to impossible states, such as infection rates beyond 100 percent. Even so, I’m not totally convinced that the model is useless. It might point to combinations of parameters that would limit the damage.

The crucial input that drives the model is the daily tally of Covid cases for the entire country, expressed as a rate of new infections per 100,000 population. The official version of this statistic is published by the CDC; a few other organizations, including Johns Hopkins and the New York Times, maintain their own daily counts. The CDC report for November 24 cites a seven-day rolling average of 52.3 new cases per 100,000 people. For the model I set the default rate at 50, but the slider marked “daily new cases per 100,000 population” will accommodate any value between 0 and 500.

From the daily case rate we can estimate the prevalence of the disease: the total number of active cases at a given moment. In the model, the prevalence is simply 14 times the daily case rate. In effect, I am assuming (or pretending) that the daily rate is unchanging and that everyone’s illness lasts 14 days from the moment of infection to full recovery. Neither of these assumptions is true. In a model of ongoing disease propagation, where today’s events determine what happens next week, the steady-state approximation would be unacceptable. But this model produces only a snapshot on one particular day of the year, and so dynamics are not very important.

What we do need to consider in more detail is the sequence of stages in a case of Covid-19. The archetypal model in epidemiology has three stages: susceptible (S), infected (I), and recovered (R); In some accounts the letter R stands for “removed,” acknowledging that recovery isn’t the only possible end of an illness. But I am going to look away from the grimmer aspects of this story.thus the moniker SIR. I find it convenient to divide the infected stage into three substages: incubating (U), infectious (I), and symptomatic (Q), which gives us a SUIQR model. An incubating patient has been infected but is not yet producing enough virus particles to infect others. The infectious stage is the most dangerous period: Patients have no conspicuous symptoms and are still unaware of their own infection, but nonetheless they are spewing virus particles with every breath.

During the symptomatic phase, patients know they are sick and should be in quarantine; hence the letter Q. For the purposes of the model I assume that everyone in category Q will decline the invitation to Thanksgiving dinner. In some contexts the assumption that everyone will abide by quarantine rules would be unrealistic. But I want to believe that very few people would knowingly endanger their parents and grandparents, siblings, cousins, aunts and uncles.In the tableau each such person is represented by a red x, which I think of as an empty chair at the dinner table. The purple dots for newly acquired infections add a sixth category to the model, although they really belong to the incubating U class.

A parameter of some importance is the duration of the presymptomatic infectious stage, since the red-dot people in that category are the only ones actually spreading the disease in my model of Thanksgiving gatherings. I made a foray into the medical literature to pin down this number, but what I learned is that after a year of intense scrutiny there’s still a lot we don’t know about Covid-19. The typical period from infection to the onset of symptoms (encompassing both the U and I stages of my model) is four or five days, but apparently it can range from less than two days to three weeks. The graph below is based on a paper by Conor McAloon and colleagues that aggregates results of eight studies carried out early in the pandemic (when it’s easier to determine the date of infection, since cases are rare and geographically isolated).

Incubation period of Covid-19, modeled as a lognormal distribution. The data come from a meta-analysis published by Conor McAloon and colleagues. The graph itself was generated by an R/Shiny app they have made available.

Ultimately I decided, for the sake of simplicity (or lazy convenience) to collapse this distribution to its median, which is about five days. Then there’s the question of when within this period an infected person becomes dangerous to those nearby. Various sources [Harvard, MIT, Fox News] suggest that infected individuals begin spreading the virus two or three days before they show symptoms, and that the moment of maximum infectiousness comes shortly before symptom onset. I chose to interpret “two or three days” as 2.5 days.

What all this boils down to is the following relation: If the national new-case rate is 50 per 100,000, then among Thanksgiving celebrants in the model, 125 per 100,000 are Covid spreaders. That’s 0.125 percent. Turn to the person on your left. Turn to your right. Are you feeling lucky?

The model’s default settings assume a new-case rate of $50$ per $100,000$, a Thanksgiving group size of $12$, and a $0.25$ probability of transmitting the virus from an infectious person to a susceptible person. Let’s do some back-of-the-envelope calculating. As noted above, the $50/100{,}000$ new case rate translates into $125/100{,}000$ infectious individuals. Among the $\approx 10,000$ members of the model population, we shoud expect to see $12$ or 13 red-dot Is. Because the number of Is is much smaller than the number of groups $(825)$, it’s unlikely that more than one red dot will turn up in any single group of $12$. In each group with a single spreader, we can expect the virus to jump to $0.25 \times 11 = 2.75$ of the spreader’s companions. This assumes that all the companions are green-dot susceptibles, which isn’t quite true. There are also yellow-dot incubating and blue-dot recovered people, as well as the red-x empty chairs of those in quarantine. But these are small corrections. The envelope estimate gives $344/100,000$ new infections on Thanksgiving day; the computer model yields 325 per 100,000, when averaged over many runs.

But the average doesn’t tell the whole story. The variance of these outcomes is quite high, as you’ll see if you press the “Go” button repeatedly. Counting the number of new infections in each of a million runs of the model, the distribution looks like this:

The peak of the curve is at 30 new infections per model run, which corresponds to about 300 cases for 100,000 population, but you shouldn’t be surprised to see a result of 150 or 500.

If the effect of Thanksgiving gatherings in the real world matches the results of this model, we’re in serious trouble. A rate of 300 cases per 100,000 people corresponds to just under a million new cases in the U.S. population. All of those infections would arise on a single day (although few of them would be detected until about a week later). That’s an outburst of contagion more than five times bigger than the worst daily toll recorded so far.

But there are plenty of reasons to be skeptical of this result.

Even in a “normal” year, not everyone in America sits down at a table for 12 to exchange gossip and germs, and surely many more will be sitting out this year’s events. According to a survey attributedThe survey has been widely reported in the press, but I have been unable to find any sources more authoritative than a blog post and a press release. I’ve seen no links to a journal publication, and no discussion of sample size or methodology. to the Ohio State University Wexner Medical Center, about 40 percent of Americans plan to get together with people outside their household or in a group of more than 10 people. The infection rate could be reduced accordingly.

Another potential mitigating factor is that people invited to your holiday celebration are probably not selected at random from the whole population, as they are in the model. Guests tend to come in groups, often family units. If your aunt and uncle and their three kids all live together, they probably get sick together, too. Thus a gathering of 12 individuals might better be treated as an assembly of three or four “pods.” One way to introduce this idea into the computer model is to enforce nonzero correlations between the people selected for each group. If one attendee is infectious, that raises the probability that others will also be infectious, and vice versa. As the correlation coefficient increases, groups are increasingly homogeneous. If lots of spreaders are crowded in one group, they can’t infect the vulnerable people in other groups. In the model, a correlation coefficient of 0.5 reduces the average number of new cases from 32.5 to 23.5. (Complete or perfect correlation eliminates contagion altogether, but this is highly unrealistic.)

Geography should also be considered. The national average case rate of 50 per 100,000 conceals huge local and regional variations. In Hawaii the rate is about 5 cases, so if you and all your guests are Hawaiians, you’ll have to be quite unlucky to pick up a Covid case at the Thanksgiving luau. At the other end of the scale, there are counties in the Great Plains states that have approached 500 cases per 100,000 in recent weeks. A meal with a dozen attendees in one of those hotspots looks fairly calamitous: The model shows 3,000 new cases for 100,000, or 3 percent of the population.

If you are determined to have a big family meal tomorrow and you want to minimize the risks, there are two obvious strategies. You can reduce the chance that your gathering includes someone infectious, or you can reduce the likelihood that any infectious person who happens to be present will transmit the virus to others. Most of the recommendations I’ve read in the newspaper and on health-care websites focus on the latter approach. They urge us to wear masks, the keep everyone at arms’ length, to wash our hands, to open all the windows (or better yet to hold the whole affair outdoors). Making it a briefer event should also help.

In the model, any such measures are implemented by nudging the slider for transmission probability toward smaller values. The effect is essentially linear over a broad range of group sizes. Reducing the transmission probability by half reduces the number of new infections proportionally.

The probability that a spreader will produce at least one new infection in a group is a nearly linear function of group size except for the smallest and largest groups.

The trouble is, I have no firm idea of what the actual transmission probability might be, or how effective those practices would be in reducing it. A recent study by a group at Vanderbilt University found a transmission rate within households of greater than 50 percent. I chose 25 percent as the default value in the model on the grounds that spending a single day together should be less risky than living permanently under the same roof. But the range of plausible values remains quite wide. Perhaps studies done in the aftermath of this Thanksgiving will yield better data.

As for reducing the chance of having an infectious guest, one approach is simply to reduce the size of the group. In this case the effect is better than linear, but only slightly so. Splitting that 12-person meal into two separate 6-seat gatherings cuts the infection rate by a little more than half, from 32.5 to 15.2. And, predictably, larger groups have worse outcomes. Pack in 24 people per group and you can expect 70 infections. Neither of these strategies seems likely to cut the infection rate by a factor of 10 or more. Unless, of course, everyone eats alone. Set the group-size slider to 1, and no one gets sick.

Another factor to keep in mind is that this model counts only infections passed from person to person during a holiday get-together. Leaving all those cases aside, the country has quite a fierce rate of “background” transmission happening on days with no special events. If the Thanksgiving cases are to be added to the background cases, we’re even worse off than the model would suggest. But the effect could be just the opposite. A family holiday is an occasion when most people skip some ordinary activities that can also be risky. Most of us have the day off from work. We are less likely to go out to a bar or a restaurant. It’s even possible that the holiday will actually suppress the overall case rate. But don’t bet your life on it.

There’s one more wild card to be taken into account. A tacit assumption in the structure of the model is that the reported Covid case count accurately reflects the prevalence of the disease in the population. This is surely not quite true. There are persistent reports of asymptomatic cases—people who are infected and infectious, but who never feel unwell. Those cases are unlikely to be recorded. Others may be ill and suspect the cause is Covid but avoid getting medical care for one reason or another. (For example, they may fear losing their job.) All in all, it seems likely the CDC is under-reporting the number of infections.

Early in the course of the epidemic, a group at Georgia Tech led by Aroon Chande built a risk-estimating web tool based on case rates for individual U.S. counties. They included an adjustment for “ascertainment bias” to compensate for cases omitted from official public health estimates. Their model multiplies the reported case counts by a factor of either 5 or 10. This adjustment may well have been appropriate last spring, when Covid testing was hard to come by even for those with reasonable access to medical services. It seems harder to justify such a large multiplier now, but the model, which is still being maintained, continues to insert a fivefold or tenfold adjustment. Out of curiosity, I have included a slider that can be set to make a similar adjustment.

Is it possible that we are still counting only a tenth of all the cases? If so, the cumulative total of infections since the virus first came ashore in the U.S. is 10 times higher than official estimates. Instead of 12.5 million total cases, we’ve experienced 125 million; more than a third of the population has already been through the ordeal and (mostly) come out the other side. We’ll know the answer soon. At the present infection rate (multiplied by 10), we will have burned through another third of the population in just a few weeks, and infection rates should fall dramatically through herd immunity. (I’m not betting my life on this one either.)

One other element of the Covid story that ought to be in the model is testing, which provides another tool for improving the chances that we socialize only with safe companions. If tests were completely reliable, their effect would merely be to move some fraction of the dangerous red-dot category into the less-dangerous red-x quarantined camp. But false-positive and false-negative testing results complicate the situation. (If the actual infection rate is low, false positives may outnumber true positives.)

I offer no conclusions or advice as a result of my little adventure in computational epidemiology. You should not make life-or-death decisions based on the writings of some doofus at a website called bit-player. (Nor based on a tweet from @realDonaldTrump.)

I do have some stray thoughts about the nature of holidays in Covid times. In the U.S. most of our holidays, both religious and secular, are intensely social, convivial occasions. Thanksgiving is a feast, New Year’s Eve is a party, Mardi Gras is a parade, St. Patrick’s Day is a pub crawl, July Fourth is a picnic. I’m not asking to abolish these traditions, some of which I enjoy myself. But they are not helping matters in the midst of a raging epidemic. Every one of these occasions can be expected to produce a spike in that curve we’re supposed to be flattening.

I wish we could find a spot on the calendar for a new kind of holiday—a day or a weekend for silent and solitary contemplative respite. Close the door, or go off by yourself. Put a dent in the curve.

Competitive exclusion does not forbid all cohabitation. Suppose olive and orange rely on two mineral nutrients in the soil—say, iron and calcium. Assume both of these elements are in short supply, and their availability is what limits growth in the populations of the trees. If olive trees are better at taking up iron and oranges assimilate calcium more effectively, then the two species may be able to reach an accommodation where both survive.

In this model, neither species is driven to extinction. At the default setting of the slider control, where iron and calcium are equally abundant in the environment, olive and orange trees also maintain roughly equal numbers on average. Random fluctuations carry them away from this balance point, but not very far or for very long. The populations are stabilized by a negative feedback loop. If a random perturbation increases the proportion of olive trees, each one of those trees gets a smaller share of the available iron, thereby reducing the species’ potential for further population growth. The orange trees are less affected by an iron deficiency, and so their population rebounds. But if the oranges then overshoot, they will be restrained by overuse of the limited calcium supply.

Moving the slider to the left or right alters the balance of iron and calcium in the environment. A 60:40 proportion favoring iron will shift the equilibrium between the two tree species, allowing the olives to occupy more of the territory. But, as long as the resource ratio is not too extreme, the minority species is in no danger of extinction. The two kinds of trees have a live-and-let-live arrangement.

In the idiom of ecology, the olive and orange species escape the rule of competitive exclusion because they occupy distinct niches, or roles in the ecosystem. They are specialists, preferentially exploiting different resources. The niches do not have to be completely disjoint. In the simulation above they overlap somewhat: The olives need calcium as well as iron, but only 25 percent as much; the oranges have mirror-image requirements.

Will this loophole in the law of competitive exclusion admit more than two species? Yes: N competing species can coexist if there are at least N independent resources or environmental strictures limiting their growth, and if each species has a different limiting factor. Everybody must have a specialty. It’s like a youth soccer league where every player gets a trophy for some unique, distinguishing talent.

This notion of slicing and dicing an ecosystem into multiple niches is a well-established practice among biologists. It’s how Darwin explained the diversity of finches on the Galapagos islands, where a dozen species distinguish themselves by habitat (ground, shrubs, trees) or diet (insects, seeds and nuts of various sizes). Forest trees might be organized in a similar way, with a number of microenvironments that suit different species. The process of creating such a diverse community is known as niche assembly.

Some niche differentiation is clearly present among forest trees. For example, gums and willows prefer wetter soil. In my local woods, however, I can’t detect any systematic differences in the sites colonized by maples, oaks, hickories and other trees. They are often next-door neighbors, on plots of land with the same slope and elevation, and growing in soil that looks the same to me. Maybe I’m just not attuned to what tickles a tree’s fancy.

Niche assembly is particularly daunting in the tropics, where it requires a hundred or more distinct limiting resources. Each tree species presides over its own little monopoly, claiming first dibs on some environmental factor no one else really cares about. Meanwhile, all the trees are fiercely competing for the most important resources, namely sunlight and water. Every tree is striving to reach an opening in the canopy with a clear view of the sky, where it can spread its leaves and soak up photons all day long. Given the existential importance of winning this contest for light, it seems odd to attribute the distinctive diversity of forest communities to squabbling over other, lesser resources.

Where niche assembly makes every species the winner of its own little race, another theory dispenses with all competition, suggesting the trees are not even trying to outrun their peers. They are just milling about at random. According to this concept, called neutral ecological drift, all the trees are equally well adapted to their environment, and the set of species appearing at any particular place and time is a matter of chance. A site might currently be occupied by an oak, but a maple or a birch would thrive there just as well. Natural selection has nothing to select. When a tree dies and another grows in its place, nature is indifferent to the species of the replacement.

This idea brings us back to a question I sidestepped above: What happens when two competing species are exactly equal in fitness? The answer is the same whether there are two species or ten, so for the sake of visual variety let’s look at a larger community.

If you have run the simulation—and if you’ve been patient enough to wait for it to finish—you are now looking at a monochromatic array of trees. I can’t know what the single color on your screen might be—or in other words which species has taken over the entire forest patch—but I know there’s just one species left. The other nine are extinct. In this case the outcome might be considered at least a little surprising. Earlier we learned that if a species has even a slight advantage over its neighbors, it will take over the entire system. Now we see that no advantage is needed. Even when all the players are exactly equal, one of them will emerge as king of the mountain, and everyone else will be exterminated. Harsh, no?

Here’s a record of one run of the program, showing the abundance of each species as a function of time:

At the outset, all 10 species are present in roughly equal numbers, clustered close to the average abundance of $625/10$. As the program starts up, the grid seethes with activity as the sites change color rapidly and repeatedly. Within the first 70,000 times steps, however, all but three species have disappeared. The three survivors trade the lead several times, as waves of contrasting colors wash over the array. Then, after about 250,000 steps, the species represented by the bright green line drops to zero population—extinction. The final one-on-one stage of the contest is highly uneven—the orange species is close to total dominance and the crimson one is bumping along near extinction—but nonetheless the tug of war lasts another 100,000 steps. (Once the system reaches a monospecies state, nothing more can ever change, and so the program halts.)

This lopsided result is not to be explained by any sneaky bias hidden in the algorithm. At all times and for all species, the probability of gaining a member is exactly equal to the probability of losing a member. It’s worth pausing to verify this fact. Suppose species $X$ has population $x$, which must lie in the range $0 \le x \le 625$. A tree chosen at random will be of species $X$ with probability $x/625$; therefore the probability that the tree comes from some other species must be $(625 - x)/625$. $X$ gains one member if it is the replacement species but not the victim species, an event with a combined probability of $x(625 - x)/625$. $X$ loses one member if it is the victim but not the replacement, which has the same probability.

It’s a fair game. No loaded dice. Nevertheless, somebody wins the jackpot, and the rest of the players lose everything, every time.

The spontaneous decay of species diversity in this simulated patch of forest is caused entirely by random fluctuations. Think of the population $x$ as a random walker wandering along a line segment with $0$ at one end and $625$ at the other. At each time step the walker moves one unit right $(+1)$ or left $(-1)$ with equal probability; on reaching either end of the segment, the game ends. The most fundamental fact about such a walk is that it does always end. A walk that meanders forever between the two boundaries is not impossible, but it has probability $0$; hitting one wall or the other has probability $1$.

How long should you expect such a random walk to last? In the simplest case, with a single walker, the expected number of steps starting at position $x$ is $x(625 - x)$. This expression has a maximum when the walk starts in the middle of the line segment; the maximum length is just under $100{,}000$ steps. In the forest simulation with ten species the situation is more complicated because the multiple walks are correlated, or rather anti-correlated: When one walker steps to the right, another must go left. Computational experiments suggest that the median time needed for ten species to be whittled down to one is in the neighborhood of $320{,}000$ steps.

From these computational models it’s hard to see how neutral ecological drift could be the savior of forest diversity. On the contrary, it seems to guarantee that we’ll wind up with a monoculture, where one species has wiped out all others. But this is not the end of the story.

One issue to keep in mind is the timescale of the process. In the simulation, time is measured by counting cycles of death and replacement among forest trees. I’m not sure how to convert that into calendar years, but I’d guess that 320,000 death-and-replacement events in a tract of 625 trees might take 50,000 years or more. Here in New England, that’s a very long time in the life of a forest. This entire landscape was scraped clean by the Laurentide ice sheet just 20,000 years ago. If the local woodlands are losing species to random drift, they would not yet have had time to reach the end game.

The trouble is, this thesis implies that forests start out diverse and evolve toward a monoculture, which is not supported by observation. If anything, diversity seems to increase with time. The cove forests of Tennessee, which are much older than the New England woods, have more species, not fewer. And the hyperdiverse ecosystem of the tropical rain forests is thought to be millions of years old.

Despite these conceptual impediments, a number of ecologists have argued strenuously for neutral ecological drift, most notably Stephen P. Hubbell in a 2001 book, The Unified Neutral Theory of Biodiversity and Biogeography. The key to Hubbell’s defense of the idea (as I understand it) is that 625 trees do not make a forest, and certainly not a planet-girdling ecosystem.

Hubbell’s theory of neutral drift was inspired by earlier studies of the biogeography of islands, in particular the collaborative work of Robert H. MacArthur and Edward O. Wilson in the 1960s. Suppose our little plot of $625$ trees is growing on an island at some distance from a continent. For the most part, the island evolves in isolation, but every now and then a bird carries a seed from the much larger forest on the mainland. We can simulate these rare events by adding a facility for immigration to the neutral-drift model. In the panel below, the slider controls the immigration rate. At the default setting of $1/100$, every $100$th replacement tree comes not from the local forest but from a stable reserve where all $10$ species have an equal probability of being selected.

For the first few thousand cycles, the evolution of the forest looks much like it does in the pure-drift model. There’s a brief period of complete tutti-frutti chaos, then waves of color erupt over the forest as it blushes pink, then deepens to crimson, or fades to a sickly green. What’s different is that none of those expanding species ever succeeds in conquering the entire array. As shown in the timeline graph below, they never grow much beyond 50 percent of the total population before they retreat into the scrum of other species. Later, another tree color makes a bid for empire but meets the same fate. (Because there is no clear endpoint to this process, the simulation is designed to halt after 500,000 cycles. If you haven’t seen enough by then, click Resume.)

Immigration, even at a low level, brings a qualitative change to the behavior of the model and the fate of the forest. The big difference is that we can no longer say extinction is forever. A species may well disappear from the 625-tree plot, but eventually it will be reimported from the permanent reserve. Thus the question is not whether a species is living or extinct but whether it is present or absent at a given moment. At an immigration rate of $1/100$, the average number of species present is about $9.6$, so none of them disappear for long.

With a higher level of immigration, the 10 species remain thoroughly mixed, and none of them can ever make any progress toward world domination. On the other hand, they have little risk of disappearing, even temporarily. Push the slider control all the way to the left, setting the immigration rate at $1/10$, and the forest display becomes an array of randomly blinking lights. In the timeline graph below, there’s not a single extinction.

Pushing the slider in the other direction, rarer immigration events allow the species distribution to stray much further from equal abundance. In the trace below, with an immigrant arriving every $1{,}000$th cycle, the population is dominated by one or two species for most of the time; other species are often on the brink of extinction—or over the brink—but they come back eventually. The average number of living species is about 4.3, and there are moments when only two are present.

Finally, with a rate of $1/10{,}000$, the effect of immigration is barely noticeable. As in the model without immigration, one species invades all the terrain; in the example recorded below, this takes about $400{,}000$ steps. After that, occasional immigration events cause a small blip in the curve, but it will be a very long time before another species is able to displace the incumbent.

The island setting of this model makes it easy to appreciate how sporadic, weak connections between communities can have an outsize influence on their development. But islands are not essential to the argument. Trees, being famously immobile, have only occasional long-distance communication, even when there’s no body of water to separate them. (It’s a rare event when Birnam Wood marches off to Dunsinane.) Hubbell formulates a model of ecological drift in which many small patches of forest are organized into a hierarchical metacommunity. Each patch is both an island and part of the larger reservoir of species diversity. If you choose the right patch sizes and the right rates of migration between them, you can maintain multiple species at equilibrium. Hubbell also allows for the emergence of entirely new species, which is also taken to be a random or selection-neutral process.

Niche assembly and neutral ecological drift are theories that elicit mirror-image questions from skeptics. With niche assembly we look at dozens or hundreds of coexisting tree species and ask, “Can every one of them have a unique limiting resource?” With neutral drift we ask, “Can all of those species be exactly equal in fitness?”

Hubbell responds to the latter question by turning it upside down. The very fact that we observe coexistence implies equality:

All species that manage to persist in a community for long periods with other species must exhibit net long-term population growth rates of nearly zero…. If this were not the case, i.e., if some species should manage to achieve a positive growth rate for a considerable length of time, then from our first principle of the biotic saturation of landscapes, it must eventually drive other species from the community. But if all species have the same net population growth rate of zero on local to regional scales, then ipso facto they must have identical or nearly identical per capita relative fitnesses.

Herbert Spencer proclaimed: Survival of the fittest. Here we have a corollary: If they’re all survivors, they must all be equally fit.

Now for something completely different.

Another theory of forest diversity was devised specifically to address the most challenging case—the extravagant variety of trees in tropical ecosystems. In the early 1970s J. H. Connell and Daniel H. Janzen, field biologists working independently in distant parts of the world, almost simultaneously came up with the same idea. The phrase “social distancing” does not appear in the work of Connell and Janzen from 50 years ago, but today it’s irresistable as a description of their theory.In tropical forests, they suggested, trees practice social distancing as a defense against contagion, and this promotes diversity.

A tropical rain forest is a tough neighborhood. Trees are under frequent attack by marauding gangs of predators, parasites, and pathogens. (Connell lumped these bad guys together under the label “enemies.”) Many of the enemies are specialists, targeting only trees of a single species. The specialization can be explained by competitive exclusion: Each tree species becomes a unique resource supporting one type of enemy.

Suppose a tree is beset by a dense population of host-specific enemies. The swarm of meanies attacks not only the adult tree but also any offspring of the host that have taken root near their parent. Since young trees are more vulnerable than adults, the entire cohort could be wiped out. Seedlings at a greater distance from the parent should have a better chance of remaining undiscovered until they have grown large and robust enough to resist attack. In other words, evolution might favor the rare apple that falls far from the tree. Janzen illustrated this idea with a graphical model something like the one at right. As distance from the parent increases, the probability that a seed will arrive and take root grows smaller (red curve), but the probability that any such seedling will survive to maturity goes up (blue curve). The overall probability of successful reproduction is the product of these two factors (purple curve); it has a peak where the red and blue curves cross.

The Connell-Janzen theory predicts that trees of the same species will be widely dispersed in the forest, leaving plenty of room in between for trees of other species, which will have a similarly scattered distribution. The process leads to anti-clustering: conspecific trees are farther apart on average than they would be in a completely random arrangement. This pattern was noted by Alfred Russel Wallace in 1878, based on his own long experience in the tropics:

If the traveller notices a particular species and wishes to find more like it, he may often turn his eyes in vain in every direction. Trees of varied forms, dimensions, and colours are around him, but he rarely sees any one of them repeated. Time after time he goes towards a tree which looks like the one he seeks, but a closer examination proves it to be distinct. He may at length, perhaps, meet with a second specimen half a mile off, or may fail altogether, till on another occasion he stumbles on one by accident.

My toy model of the social-distancing process implements a simple rule. When a tree dies, it cannot be replaced by another tree of the same species, nor may the replacement match the species of any of the eight nearest neighbors surrounding the vacant site. Thus trees of the same species must have at least one other tree between them. To say the same thing in another way, each tree has an exclusion zone around it, where other trees of the same species cannot grow.

It turns out that social distancing is a remarkably effective way of preserving diversity. When you click Start, the model comes to life with frenetic activity, blinking away like the front panel of a 1950s Hollywood computer. Then it just keeps blinking; nothing else ever really happens. There are no spreading tides of color as a successful species gains ground, and there are no extinctions. The variance in population size is even lower than it would be with a completely random and uniform assignment of species to sites. This stability is apparent in the timeline graph below, where the 10 species tightly hug the mean abundance of 62.5:

When I finished writing this program and pressed the button for the first time, long-term survival of all ten species was not what I expected to see. My thoughts were influenced by some pencil-and-paper doodling. I had confirmed that only four colors are needed to create a pattern where no two trees of the same color are adjacent horizontally, vertically, or on either of the diagonals. One such pattern is shown at right. I suspected that the social-distancing protocol might cause the model to condense into such a crystalline state, with the loss of species that don’t appear in the repeated motif. I was wrong. Although four is indeed the minimum number of colors for a socially distanced two-dimensional lattice, there is nothing in the algorithm that encourages the system to seek the minimum.

After seeing the program in action, I was able to figure out what keeps all the species alive. There’s an active feedback process that puts a premium on rarity. Suppose that oaks currently have the lowest frequency in the population at large. As a result, oaks are least likely to be present in the exclusion zone surrounding any vacancy in the forest, which means in turn they are most likely to be acceptable as a replacement. As long as the oaks remain rarer than the average, their population will tend to grow. Symmetrically, individuals of an overabundant species will have a harder time finding an open site for their offspring. All departures from the mean population level are self-correcting.

The initial configuration in this model is completely random, ignoring the restrictions on adjacent conspecifics. Typically there are about 200 violations of the exclusion zone in the starting pattern, but they are all eliminated in the first few thousand time steps. Thereafter the rules are obeyed consistently. Note that with ten species and an exclusion zone consisting of nine sites, there is always at least one species available to fill a vacancy. If you try the experiment with nine or fewer species, some vacancies must be left as gaps in the forest. I should also mention that the model uses toroidal boundary conditions: the right edge of the grid is adjacent to the left edge, and the top wraps around to the bottom. This ensures that all sites in the lattice have exactly eight neighbors.

Connell and Janzen envisioned much larger exclusion zones, and correspondingly larger rosters of species. Implementing such a model calls for a much larger computation. A recent paper by Taal Levi et al. reports on such a simulation. They find that the number of surviving species and their spatial distribution remain reasonably stable over long periods (200 billion tree replacements).

Could the Connell-Janzen mechanism also work in temperate-zone forests? As in the tropics, the trees of higher latitudes do have specialized enemies, some of them notorious—the vectors of Dutch elm disease and chestnut blight, the emerald ash borer, the gypsy moth caterpillars that defoliate oaks. The hemlocks in my neighborhood are under heavy attack by the woolly adelgid, a sap-sucking bug. Thus the forces driving diversification and anti-clustering in the Connell-Janzen model would seem to be present here. However, the observed spatial structure of the northern forests is somewhat different. Social distancing hasn’t caught on here. The distribution of trees tends to be a little clumpy, with conspecifics gathering in small groves.

Plague-driven diversification is an intriguing idea, but, like the other theories mentioned above, it has certain plausibility challenges. In the case of niche assembly, we need to find a unique limiting resource for every species. In neutral drift, we have to ensure that selection really is neutral, assigning exactly equal fitness to trees that look quite different. In the Connell-Janzen model we need a specialized pest for every species, one that’s powerful enough to suppress all nearby seedlings. Can it be true that every tree has its own deadly nemesis?

You might have to click Invade more than once, since a new arrival may die out before becoming established. Also note that I have slowed down this simulation, lest it all be over in a flash.There’s also reason to doubt the model’s robustness, its resistance to disruption. Suppose an invasive species shows up in the socially distanced tropical forest—a tree new to the continent, with no enemies anywhere nearby. What happens then? The program below offers an answer. Start it running, and then click the Invade button.

Lacking enemies, the invader can flout the social-distancing rules, occupying any forest vacancy regardless of neighborhood. Once the invader has taken over a majority of the sites, the distancing rules become less onerous, but by then it’s too late for the other species.

One further half-serious thought on the Connell-Janzen theory: In the war between trees and their enemies, humanity has clearly chosen sides. We would wipe out those insects and fungi and other tree-killing pests if we could figure out how to do so. Everyone would like to bring back the elms and the chestnuts, and save the eastern hemlocks before it’s too late. On this point I’m as sentimental as the next treehugger. But if Connell and Janzen are correct, and if their theory applies to temperate-zone forests, eliminating all the enemies would actually cause a devastating collapse of tree diversity. Without pest pressure, competitive exclusion would be unleashed, and we’d be left with one-tree forests everywhere we look.

Species diversity in the forest is now matched by theory diversity in the biology department. The three ideas I have discussed here—niche assembly, neutral drift, and social distancing—all seem to be coexisting in the minds of ecologists. And why not? Each theory is a success in the basic sense that it can overcome competitive exclusion. Each theory also makes distinctive predictions. With niche assembly, every species must have a unique limiting resource. Neutral drift generates unusual population dynamics, with species continually coming and going, although the overall number of species remains stable. Social distancing entails spatial anticlustering.

How can we choose a winner among these theories (and perhaps others)? Scientific tradition says nature should have the last word. We need to conduct some experiments, or at least go out in the field and make some systematic observations, then compare those results with the theoretical predictions.

There have been quite a few experimental tests of competitive exclusion. For example, Thomas Park and his colleagues ran a decade-long series of experiments with two closely related species of flour beetles. One species or the other always prevailed. In 1969 Francisco Ayala reported on a similar experiment with fruit flies, in which he observed coexistence under circumstances that were thought to forbid it. Controversy flared, but in the end the result was not to overturn the theory but to refine the mathematical description of where exclusion applies.

Wouldn’t it be grand to perform such experiments with trees? Unfortunately, they are not so easily grown in glass vials. And conducting multigenerational studies of organisms that live longer than we do is a tough assignment. With flour beetles, Park had time to observe more than 100 generations in a decade. With trees, the equivalent experiment might take 10,000 years. But field workers in biology are a resourceful bunch, and I’m sure they’ll find a way. In the meantime, I want to say a few more words about theoretical, mathematical, and computational approaches to the problem.

Ecology became a seriously mathematical discipline in the 1920s, with the work of Alfred J. Lotka and Vito Volterra. To explain their methods and ideas, one might begin with the familiar fact that organisms reproduce themselves, thereby causing populations to grow. Mathematized, this observation becomes the differential equation

\[\frac{d x}{d t} = \alpha x,\]

which says that the instantaneous rate of change in the population $x$ is proportional to $x$ itself—the more there are, the more there will be. The constant of proportionality $\alpha$ is called the intrinsic reproduction rate; it is the rate observed when nothing constrains or interferes with population growth. The equation has a solution, giving $x$ as a function of $t$:

\[x(t) = x_0 e^{\alpha t},\]

where $x_0$ is the initial population. This is a recipe for unbounded exponential growth (assuming that $\alpha$ is positive). In a finite world such growth can’t go on forever, but that needn’t worry us here.

The original version of this essay (published on 4 September 2020) had serious errors in the description of the Lotka-Volterra equations. The problem was brought to my attention in a comment by Matt on 16 September. The corrected version here was published on 19 September.Let’s introduce a second species, $y$, that obeys the same kind of growth law but has its own intrinsic reproductive rate $\beta$. Now we can ask what happens if the two species interact. Lotka and Volterra (working independently) first considered the case where $y$ preys upon $x$. They proposed the following pair of equations, with interaction terms proportional to $x y$:

\[\begin{align}
\frac{d x}{d t} &= \alpha x -\gamma x y\\
\frac{d y}{d t} &= -\beta y + \delta x y
\end{align}\]

The prey species $x$ prospers when left to itself, but suffers as the product $x y$ increases. The situation is just the opposite for the predator $y$, which can’t get along alone ($x$ is its only food source) and whose population swells when $x$ and $y$ are both abundant.

Competition is a more symmetrical relation: Either species can thrive when alone, and the interaction between them is negative for both parties.

\[\begin{align}
\frac{d x}{d t} &= \alpha x -\gamma x y\\
\frac{d y}{d t} &= \beta y - \delta x y
\end{align}\]

The Lotka-Volterra equations yield some interesting behavior. At any instant $t$, the state of the two-species system can be represented as a point in the $x, y$ plane, whose coordinates are the two population levels. For some combinations of the $\alpha, \beta, \gamma, \delta$ parameters, there’s a point of stable equilibrium. Once the system has reached this point, it stays put, and it returns to the same neighborhood following any small perturbation. Other equilibria are unstable: The slightest departure from the balance point causes a major shift in population levels. And the really interesting cases have no stationary point; instead, the state of the system traces out a closed loop in the $x, y$ plane, continually repeating a cycle of states. The cycles correspond to oscillations in the two population levels. Such oscillations have been observed in many predator-prey systems. Indeed, it was curiosity about the periodic swelling and contraction of populations in the Canadian fur trade and Adriatic fisheries that inspired Lotka and Volterra to work on the problem.

The 1960s and 70s brought more surprises. Studies of equations very similar to the Lotka-Volterra system revealed the phenomenon of “deterministic chaos,” where the point representing the state of the system follows an extremely complex trajectory, though it’s wandering are not random. There ensued a lively debate over complexity and stability in ecosystems. Is chaos to be found in natural populations? Is a community with many species and many links between them more or less stable than a simpler one?

Viewed as abstract mathematics, there’s much beauty in these equations, but it’s sometimes a stretch mapping the math back to the biology. For example, when the Lotka-Volterra equations are applied to species competing for resources, the resources appear nowhere in the model. The mathematical structure describes something more like a predator-predator interaction—two species that eat each other.

Even the organisms themselves are only a ghostly presence in these models. The differential equations are defined over the continuum of real numbers, giving us population levels or densities, but not individual plants or animals—discrete things that we can count with integers. The choice of number type is not of pressing importance as long as the populations are large, but it leads to some weirdness when a population falls to, say, 0.001—a millitree. Using finite-difference equations instead of differential equations avoids this problem, but the mathematics gets messier.

Another issue is that the equations are rigidly deterministic. Given the same inputs, you’ll always get exactly the same outputs—even in a chaotic model. Determinism rules out modeling anything like neutral ecological drift. Again, there’s a remedy: stochastic differential equations, which include a source of noise or uncertainty. With models of this kind, the answers produced are not numbers but probability distributions. You don’t learn the population of $x$ at time $t$; you get a probability $P(x, t)$ in a distribution with a certain mean and variance. Another approach, called Markov Chain Monte Carlo (MCMC), uses a source of randomness to sample from such distributions. But the MCMC method moves us into the realm of computational models rather than mathematical ones.

Computational methods generally allow a direct mapping between the elements of the model and the things being modeled. You can open the lid and look inside to find the trees and the resources, the births and the deaths. These computational objects are not quite tangible, but they’re discrete, and always finite. A population is neither a number nor a probability distribution but a collection of individuals. I find models of this kind intellectually less demanding. Writing a differential equation that captures the dynamics of a biological system requires insight and intuition. Writing a program to implement a few basic events in the life of a forest—a tree dies, another takes its place—is far easier.

The six little models included in this essay serve mainly as visualizations; they expend most of their computational energy painting colored dots on the screen. But larger, more ambitious models are certainly feasible, as in the work of Taal Levi et al. mentioned above.

However, if computational models are easier to create, they can also be harder to interpret. If you run a model once and species $X$ goes extinct, what can you conclude? Not much. On the next run $X$ and $Y$ might coexist. To make reliable inferences, you need to do some statistics over a large ensemble of runs—so once again the answer takes the form of a probability distribution.

The concreteness and explicitness of Monte Carlo models is generally a virtue, but it has a darker flip side. Where a differential equation model might apply to any “large” population, that vague description won’t work in a computational context. You have to name a number, even though the choice is arbitrary. The size of my forest models, 625 trees, was chosen for mere convenience. With a larger grid, say $100 \times 100$, you’d have to wait millions of time steps to see anything interesting happen. Of course the same issue arises with experiments in the lab or in the field.

Both kinds of model are always open to a charge of oversimplifying. A model is the Marie Kondo version of nature—relentlessly decluttered and tidied up. Sometime important parts get tossed out. In the case of the forest models, it troubles me that trees have no life history. One dies, and another pops up full grown. Also missing from the models are pollination and seed dispersal, and rare events such a hurricanes and fires that can reshape entire forests. Would we learn more if all those aspects of life in the woods had a place in the equations or the algorithms? Perhaps, but where do you stop?

My introduction to models in ecology came through a book of that title by John Maynard Smith, published in 1974. I recently reread it, learning more than I did the first time through. Maynard Smith makes a distinction between simulations, useful for answering questions about specific problems or situations, and models, useful for testing theories. He offers this advice: “Whereas a good simulation should include as much detail as possible, a good model should include as little as possible.”

April Fool Redux

Brian Hayes — Sat, 28 Mar 2020 16:32:53 +0000

I have a scheme to rescue the swooning U.S. economy. My idea partakes of the silliness that always accompanies the coming of April, but it’s not entirely an April Fool joke. T. S. Eliot told us that April is the cruelest month. I’m proposing that if we take a double dose of April, it might turn kinder. Let me explain.

The economy’s swan dive is truly breathtaking. In response to the coronavirus threat we have shut down entire commercial sectors: most retail stores, restaurants, sports and entertainment. Travel and tourism are moribund. Manufacturing is threatened too, not only by concerns about workplace contagion but also by softening demand and disrupted supply chains. All of the automakers have closed their assembly plants in the U.S., and Boeing has stopped production at its plants near Seattle, which employ 70,000. Thus it comes as no great surprise—though it’s still a shock—that 3,283,000 Americans filed claims for unemployment compensation last week. That’s by far the highest weekly tally since the program was created in the 1930s. It’s almost five times the previous record from 1982, and 15 times the average for the first 10 weeks of this year. The graph is a dramatic hockey stick:

New weekly claims for unemployment compensation set an all-time record in the week ending 21 March: 3,283,000 claims. The previous record was less than 700,000, and in recent years claims have generally hovered a little above 200,000 per week. When a graph looks like this one, it’s usually because somebody misplaced a decimal point in preparing the data. This one’s for real. Data from U.S. Department of Labor.

Here’s the same graph, updated to include new unemployment claims for the weeks ending 28 March and 4 April. The four-week total of new claims is over 16 million, which is roughly 10 percent of the American workforce. [Edited 2020-04-02 and 2020-04-09.]

Claims for unemployment compensation totaled almost 13.5 million for the weeks ending 28 March and 4 April. Data from U.S. Department of Labor.

I’ve been brooding about the economic collapse for a couple of weeks. I worry that the consequences of unemployment and business failures could be even more dire than the direct harm caused by the virus. Recovering from a deep recession can take years, and those who suffer most are the poor and the young. I don’t want to see millions of lives blighted and the dreams of a generation thwarted. But Covid-19 is still rampant. Relaxing our defenses could swamp the hospitals and elevate the death rate. No one is eager to take that risk (except perhaps Donald Trump, who dreams of an Easter resurrection).

The other day I was squabbling about these economic perils with the person I shelter-in-place with. Yes, she said, we’re facing a steep decline, but what makes you so sure it’s going to last for years? Why can’t the economy bounce back? I patiently mansplained about the irreversibility of events like bankruptcy and eviction and foreclosure, which are almost as hard to undo as death. That argument didn’t settle the matter, but we let the subject drop. (We’re hunkered down 24/7 here; we need to get along.)

In the middle of the night, the question came back to me. Why won’t it bounce back? Why can’t we just pause the economy like a video, then a month or two later press the play button to resume where we left off?

One problem with pausing the economy is that people can’t survive in suspended animation. They need a continuous supply of air, water, food, shelter, TV shows, and toilet paper. You’ve got to keep that stuff coming, no matter what. But people are only part of the economy. There are also companies, corporations, unions, partnerships, non-profit associations—all the entities we create to organize the great game of getting and spending. A company, considered as an abstraction, has no need for uninterrupted life support. It doesn’t eat or breathe or get bored. So maybe companies could be put in the deep freeze and then thawed when conditions improve.

Lying awake in the dark, I told myself a story:

Clare owns a little café at the corner of Main and Maple in a New England college town. In the middle of March, when the college sent the students home, she lost half her customers. Then, as the epidemic spread, the governor ordered all restaurants to close. Clare called up Rory the Roaster to cancel her order for coffee beans, pulled her ad from the local newspaper, and taped a “C U Soon” sign to the door. Then she sat down with her only employee, Barry the Barista, to talk about the bad news.

Barry was distraught. “I have rent coming due, and my student loan, and a car payment.”

“I wish I could be more help,” Clare replied. “But the rent on the café is also due. If I don’t pay it, we could lose the lease, and you won’t have a job to come back to. We’ll both be on the street.” They sat glumly in the empty shop, six feet apart. Seeing the lost-puppy look in Barry’s eyes, Clare added: “Let me call up Larry the Landlord and see if we can work something out.”

Larry was sympathetic. He’d been hearing from lots of tenants, and he genuinely wanted to help. But he told Clare what he’d told the rest: “The building has a mortgage. If I don’t pay the bank, I’ll lose the place, and we’ll all be on the street.”

You can guess what Betty the Banker said. “I have obligations to my depositors. Accounts earn interest every month. People are redeeming CDs. If I don’t maintain my cash reserves, the FDIC will come in and seize our assets. We’ll all be on the street.”

Everyone in this little fable wants to do the right thing. No one wants to put Clare out of business or leave Barry without an income. And yet my nocturnal meditations come to a dark end, in which the failure of Clare’s corner coffee shop triggers a worldwide recession. Barry gets evicted, Larry defaults on his loan, Betty’s bank goes bottom up. Rory the Roaster also goes under, and the Colombian farm that supplies the beans lays off all its workers. With Clare’s place now an empty storefront, there are fewer shoppers on Main Street, and the bookstore a few doors away folds up. The newspaper where Clare used to advertise ceases publication. The town’s population dwindles. The college closes.

At this point I feel like Ebenezer Scrooge pleading with the Ghost of Christmas Future to save Tiny Tim, or George Bailey desperate to escape the mean streets of Potterville and get back to the human warmth of Bedford Falls. Surely there must be some way to avert this catastrophe.

Here’s my idea. The rent and loan payments that cause all this economic mayhem are different from the transactions that Clare handles at her cash register. In her shopkeeper economy, money comes in only when coffee goes out; the two events are causally connected and simultaneous. And if she’s not selling any coffee, she can stop buying beans. The payment of her rent, on the other hand, is triggered by nothing but the ticking of the clock. She is literally buying time. Now the remedy is obvious: Stop the clock, or reset it. This is easier than you might think. We just go skipping down the Yellow Brick Road and petition the wizard to issue a proclamation. The wizard’s decree says this:

In the year 2020, April 30 shall be followed by April 1.

Redux is Latin for “a thing brought back or restored.” The word was introduced—or brought back—into the modern American vocabulary by John Updike’s 1971 novel Rabbit Redux, having been used earlier in titles of works by Dryden and Trollope. It’s one of those words I’ve always avoided saying aloud because of doubt about the pronunciation. The OED says it’s re-ducks.In other words, we’re going to do April, and then we’re going to do it again. April is followed by April Redux, and only after we get to the end of that month do we start on May.

How does this fiddling with the calendar help Clare? Consider what happens when the calendar flips from April 30 to April 1 Redux. It’s the first of the month, and the rent is due. But wait! No it’s not. She already paid the rent for April, a month ago. It won’t be due again until May 1, and that’s a month away. It’s the same with Larry’s mortgage payment, and Barry’s car loan. Of course stopping the clock cuts both ways. If you get a monthly pension or Social Security payment, that won’t be coming in April Redux, nor will the bank pay you interest on your deposits.

By means of this sly maneuver we have broken a vicious cycle. Larry doesn’t get a rent check from Clare, but he also doesn’t have to write a mortgage-loan check to Betty, who doesn’t have to make payments to her depositors and creditors. Each of them gets a month’s reprieve. With this extra slack, maybe Clare can keep Barry on the payroll and still have a viable business when her customers finally come out of hiding.

But isn’t this just a sneaky scheme to deprive the creditor class of money they are legally entitled to receive under the terms of contracts that both parties willingly signed? Yes it is, and a clever one at that. It is also a way to more equitably distribute the risks and costs of the present crisis. At the moment the burden falls heavily on Clare and Barry, who are forbidden to sell me a cup of coffee; but Larry and Betty are free to go on collecting their rents and loan payments. In addition to spreading around the financial pain, the scheme might also reduce the likelihood of a major, lasting economic contraction, which none of these characters would enjoy.

In spite of these appeals to the greater good of society as a whole, you may still feel there’s something dishonest about April Redux. If so, we can have the wizard issue a second decree:

In the 30 months from May 2020 through November 2022,
every month shall have one day fewer than the usual number.

During this period every scheduled payment will come due a day sooner than usual. At the end, lenders and borrowers are even-steven.

The last time anybody tinkered with the calendar in the English-speaking world was 1752, when the British isles and their colonies finally adopted the Gregorian calendar (introduced elsewhere as early as 1592). My source for this revisionist history is: Poole, Robert. “Give Us Our Eleven Days!”: Calendar Reform in Eighteenth-Century England. Past & Present, no. 149, 1995, pp. 95–139. JSTOR (paywall).By act of parliament, Wednesday September 2 was followed by Thursday September 14. Many accounts of this event tell of rioting in the streets, as ignorant mobs complained that parliament had stolen 11 days of their lifespan. Later scholarship shows that the riots were an invention of imaginative or gullible historians, but there was concern and controversy about the proper calculation of wages, rents, and interest in the abbreviated September.

Riots in the streets are clearly a no-no in this period of social distancing, so presumably we won’t have to worry about mob action when April repeats itself. Besides, who’s going to complain about having 30 days added to their lifespan? I suppose there may be some grumbling from people with April birthdays, who think they are suddenly two years older. And back-to-back April Fool days could test the nation’s patience.

Although my plan for an April do-over is presented in the spirit of the season, I do think it illuminates a serious issue—an aspect of modern commerce that makes the current situation especially dangerous. Our problem is not that we have shut down the whole economy. The problem is that we’ve shut down only half the economy. The other half carries on with business as usual, creating imbalances that leave the whole edifice teetering on the brink of collapse.

The $2 trillion rescue package enacted last week addresses some of these issues. The cash handout for individual taxpayers, and a sweetening of unemployment benefits, should help Barry muddle through and pay his bills. A program of loans for small businesses could keep Clare afloat, and the loan would be forgiven if she keeps Barry on the payroll. These are thoughtful and useful measures, and a refreshing change from earlier bailout practices. We are not sending all the funds directly to investment banks and insurance companies. But a big share will wind up there anyway, since we are effectively subsidizing the rent and mortgage payments of individuals and small businesses. I wonder if it wouldn’t be fairer, more effective, and less expensive to curtail some of those payments. I’m not suggesting that we shut down the banks along with the shops; that would make matters worse. But we might require financial institutions to defer or forgo certain payments from distressed small businesses and the employs they lay off.

Voluntary efforts along these lines promise to soften the impact for at least a few lucky workers and businesses that have lost their revenue stream. In my New England college town, some of the banks are offering to defer monthly payments on mortgage loans, and there’s social pressure on landlords to do defer rents.

But don’t count on everyone to follow that program. On March 31, following announcements of layoffs and furloughs by Macy’s, Kohl’s, and other large retailers, the New York Times reported: “Last week, Taubman, a large owner of shopping malls, sent a letter to its tenants saying that the company expected them to keep paying their rent amid the crisis. Taubman, which oversees well-known properties like the Mall at Short Hills in New Jersey, reminded its tenants that it also had obligations to meet, and was counting on the rent to pay lenders and utilities.” [Added 2020-03-31.]

The coronavirus crisis is being treated as a unique event (and I certainly hope we’ll never see the like of it again). The associated economic crisis is also unique, at least within my memory. Most panics and recessions have their roots in the financial markets. At some point investors realize that tech stocks with an infinite price-to-earnings ratio are not such a bargain after all, or that bundling together thousands of risky mortgages doesn’t actually make them less risky. When the bubble bursts, the first casualties are on Wall Street; only later do the ripple effects reach Clare’s café. Now, we are seeing a rare disturbance that travels in the opposite direction. Do we know how to fix it?

MathJax turns 3.0

Brian Hayes — Sat, 14 Mar 2020 18:06:50 +0000

When I launched bit-player.org in 2006, displaying any sort of mathematical notation on the web was torture. I would typeset an equation in LaTeX, convert the output to a JPEG image, upload the image file to a directory on the server, and embed a reference to the file in an HTML img tag. The process was cumbersome and the product was ugly. In 2009 I wrote an American Scientist article whining about this sorry state of affairs—but at just that moment an elegant solution was coming on the scene. Davide Cervone of Union College had created a program called jsMath, which could process TeX commands placed directly in an HTML document. For example, I could write:

e^x = 1 + x + \frac{x^2}{2!} + \frac{x^3}{3!} + \cdots

and it would appear on your screen as:

\[e^x = 1 + x + \frac{x^2}{2!} + \frac{x^3}{3!} + \cdots\]

All the work of parsing the TeX code and typesetting the math was done by a JavaScript program downloaded into your browser along with the rest of the web page.

Cervone’s jsMath soon evolved into MathJax, an open-source project initially supported by the AMS and SIAM. There are now about two dozen sponsors, and the project is under the aegis of NumFOCUS. From the MathJax web page: “The MathJax team consists of Davide Cervone and Volker Sorge. Regular contributors include Christian Lawson-Perfect, Omar Al-Ithawi, and Peter Krautzberger.”Cervone remains the principal author, though not the only contributor.

MathJax has made a big difference in my working life, transforming a problem into a pleasure. Putting math on the web is fun! Sometimes I do it just to show off. Furthermore, the software has served as an inspiration as well as a helpful tool. Until I saw MathJax in action, it simply never occurred to me that interesting computations could be done within the JavaScript environment of a web browser, which I had thought was there mainly to make things blink and jiggle. With the example of MathJax in front of me, I realized that I could not only display mathematical ideas but also explore and animate them within a web page.

Last fall I began hearing rumors about MathJax 3.0, “a complete rewrite of MathJax from the ground up using modern techniques.” It’s the kind of announcement that inspires both excitement and foreboding. What will the new version add? What will it take away? What will it fix? What will it break?

Before committing all of bit-player to the new version, I thought I would try a small-scale experiment. I have a standalone web page that makes particularly tricky use of MathJax. The page is a repository of the Dotster programs extracted from a recent bit-player post, My God, it’s full of dots. In January I got the Dotster page running with MathJax 3.

Most math in web documents is static content: An equation needs to be formatted once, when the page is first displayed, and it never changes after that. The initial typesetting is handled automatically by MathJax, in both the old and the new versions. As soon as the page is downloaded from the server, MathJax makes a pass through the entire text, identifying elements flagged as TeX code and replacing them with typeset math. Once that job is done, MathJax can go to sleep.

The Dotster programs are a little different; they include equations that change dynamically in response to user input. Here’s an example:

The slider on the left sets a numerical value that gets plugged into the two equation on the right. Each time the slider is moved, the equations need to be updated and reformatted. Thus with each change to the slider setting, MathJax has to wake up from its slumbers and run again to typeset the altered content.

The MathJax program running in the little demo above is the older version, 2.7. Cosmetically, the result is not ideal. With each change in the slider value, the two equations contract a bit, as if pinched between somebody’s fingers, and then snap back to their original size. They seem to wink at us. The progress banner would normally appear in the lower left corner of the browser window, where it is less intrusive. It has a more prominent position here because the demo is encapsulated in an iFrame, which acts as a window within the window. This structure is necessary in order to run two versions of MathJax in a single page. Also, a small gray progress indicator pops up in the lower left corner and hangs around just long enough to grab your attention. It says: “Typesetting math: 100%.” I find these visual distractions pretty annoying. They are analogous to the dreaded FOUC—the “flash of unstyled content”—that appears when a web browser displays the text of a page before the associated stylesheets are fully loaded and processed.

The winking effect is caused by a MathJax feature called Fast Preview. The system does a quick-and-dirty rendering of the math content without calculating the correct final sizes for the various typographic elements. (Evidently that calculation takes a little time). You can turn off Fast Preview by right-clicking or control-clicking one of the equations and then navigating through the submenus shown at right. However, you’ll probably judge the result to be worse rather than better. Without Fast Preview, you’ll get a glimpse of the raw TeX commands. Instead of winking, the equations do jumping jacks.

I am delighted to report that all of this visual noise has been eliminated in the new MathJax. On changing a slider setting, the equations are updated in place, with no unnecessary visual fuss. And there’s no need for a progress indication, because the change is so quick it appears to be instantaneous. See for yourself:

Thus version 3 looks like a big win. There’s a caveat: Getting it to work did not go quite as smoothly as I had hoped. Nevertheless, this is a story with a happy ending.

If you have only static math content in your documents, making the switch to MathJax 3 is easy. In your HTML file you change a URL to load the new MathJax version, and convert any configuration options to a new format. As it happens, all the default options work for me, so I had nothing to convert. What’s most important about the upgrade path is what you don’t need to do. In most cases you should not have to alter any of the TeX commands present in the HTML files being processed by MathJax. (There are a few small exceptions.)

With dynamic content, further steps are needed. Here is the JavaScript statement I used to reawaken the typesetting engine in MathJax version 2.7:

MathJax.Hub.Queue(["Typeset", MathJax.Hub, mathjax_demo_box]);

The statement enters a Typeset command into a queue of pending tasks. When the command reaches the front of the queue, MathJax will typeset any math found inside the HTML element designated by the identifier mathjax_demo_box, ignoring the rest of the document.

In MathJax 3, the documentation suggested I could simply replace this command with a slightly different and more direct one:

MathJax.typeset([mathjax_demo_box]);

I did that. It didn’t work. When I moved the slider, the displayed math reverted to raw TeX form, and I found an error message in the JavaScript console:

What has gone wrong here? JavaScript’s appendChild method adds a new node to the treelike structure of an HTML document. It’s like hanging an ornament from some specified branch of a Christmas tree. The error reported here indicates that the specified branch does not exist; it is null.

Let’s not tarry over my various false starts and wrong turns as I puzzled over the source of this bug. I eventually found the cause and the solution in the “issues” section of the MathJax repository on GitHub. Back in September of last year Mihai Borobocea had reported a similar problem, along with the interesting observation that the error occurs only when an existing TeX expression is being replaced in a document, not when a new expression is being added. Borobocea had also discovered that invoking the procedure MathJax.typesetClear() before MathJax.typeset() would prevent the error.

A comment by Cervone explains much of what’s going on:

You are correct that you should use MathJax.typesetClear() if you have removed previously typeset math from the page. (In version 3, there is information stored about the math in a list of typeset expressions, and if you remove typeset math from the page and replace it with new math, that list will hold pointers to math that no longer exists in the page. That is what is causing the error you are seeing . . . )

I found that adding MathJax.typesetClear() did indeed eliminate the error. As a practical matter, that solved my problem. But Borobocea pointed out a remaining loose end. Whereas MathJax.typeset([mathjax_demo_box]) operates only on the math inside a specific container, MathJax.typesetClear() destroys the list of math objects for the entire document, an act that might later have unwanted consequences. Thus it seemed best to reformat all the math in the document whenever any one expression changes. This is inefficient, but with the 20-some equations in the Dotster web page the typesetting is so fast there’s no perceptible delay.

In January a fix for this problem was merged into MathJax 3.0.1, which is now the shipping version. Cervone’s comment on this change says that it “prevents the error message,” which left me with the impression that it might suppress the message without curing the error itself. But as far as I can tell the entire issue has been cleared up. There’s no longer any need to invoke MathJax.typesetClear().

In my first experiments with version 3.0 I stumbled onto another bit of weirdness, but it turned out to be a quirk of my own code, not something amiss in MathJax.

I was seeing occasional size variations in typeset math that seemed reminiscent of the winking problem in version 2.7. Sometimes the initial, automatic typesetting would leave the equations in a slightly smaller size; they would grow back to normal as soon as MathJax.typeset() was applied. In the image at right I have superimposed the two states, with the correct, larger image colored red. It looks like Fast Preview has come back to haunt us, but that can’t be right, because Fast Preview has been removed entirely from version 3.

My efforts to solve this mystery turned into quite a debugging debacle. I got a promising clue from an exchange on the MathJax wiki, discussing size anomalies when math is composed inside an HTML element temporarily flagged display: none, a style rule that makes the math invisible. In that circumstance MathJax has no information about the surrounding text, and so it leaves the typeset math in a default state. The same mechanism might account for what I was seeing—except that my page has no elements with a display: none style.

I first observed this problem in the Chrome browser, where it is intermittent; when I repeatedly reloaded the page, the small type would appear about one time out of five. What fun! It takes multiple trials just to know whether an attempted fix has had any effect. Thus I was pleased to discover that in Firefox the shrunken type appears consistently, every time the page is loaded. Testing became a great deal easier.

I soon found a cure, though not a diagnosis. While browsing again in the MathJax issues archive and in a MathJax user forum, I came across suggestions to try a different form of output, with mathematical expressions constructed not from text elements in HTML and style rules in CSS but from paths drawn in Scalable Vector Graphics, or SVG. I found that the SVG expressions were stable and consistent in size, and in other respects indistinguishable from their HTML siblings. Again my problem was solved, but I still wanted to know the underlying cause.

Here’s where the troubleshooting report gets a little embarrassing. Thinking I might have a new bug to report, I set out to build a minimal exemplar—the smallest and simplest program that would trigger the bug. I failed. I was starting from a blank page and adding more and more elements of the original program—divs nested inside divs in the HTML, various stylesheet rules in the CSS, bigger collections of more complex equations—but none of these additions produced the slightest glitch in typesetting. So I tried working in the other direction, starting with the complex misbehaving program and stripping away elements until the problem disappeared. But it didn’t disappear, even when I reduced the page to a single equation in a plain white box.

As often happens, I found the answer not by banging my head against the problem but by going for a walk. Out in the fresh air, I finally noticed the one oddity that distinguished the failing program from all of the correctly working ones. Because the Dotster program began life embedded in a WordPress blog post, I could not include a link to the CSS stylesheet in the head section of the HTML file. Instead, a JavaScript function constructed the link and inserted it into the head. That happened after MathJax made its initial pass over the text. At the time of typesetting, the elements in which the equations were placed had no styles applied, and so MathJax had no way of determining appropriate sizes.

When Don Knuth unveiled TeX, circa 1980, I was amazed. Back then, typewriter-style word processing was impressive enough. TeX did much more: real typesetting, with multiple fonts (which Knuth also had to create from scratch), automatic hyphenation and justification, and beautiful mathematics.

Thirty years later, when Cervone created MathJax, I was amazed again—though perhaps not for the right reasons. I had supposed that the major programming challenge would be capturing all the finicky rules and heuristics for building up math expressions—placing and sizing superscripts, adjusting the height and width of parentheses or a radical sign to match the dimensions of the expression enclosed, spacing and aligning the elements of a matrix. Those are indeed nontrivial tasks, but they are just the beginning. My recent adventures have helped me see that another major challenge is making TeX work in an alien environment.

In classic TeX, the module that typesets equations has direct access to everything it might ever need to know about the surrounding text—type sizes, line spacing, column width, the amount of interword “glue” needed to justify a line of type. Sharing this information is easy because all the formatting is done by the same program. MathJax faces a different situation. Formatting duties are split, with MathJax handling mathematical content but the browser’s layout engine doing everything else. Indeed, the document is written in two different languages, TeX for the math and HTML/CSS for the rest. Coordinating actions in the two realms is not straightforward.

There are other complications of importing TeX into a web page. The classic TeX system runs in batch mode. It takes some inputs, produces its output, and then quits. Batch processing would not offer a pleasant experience in a web browser. The entire user interface (such as the buttons and sliders in my Dotster programs) would be frozen for the duration. To avoid this kind of rudeness to the user, MathJax is never allowed to monopolize JavaScript’s single thread of execution for more than a fraction of a second. To ensure this cooperative behavior, earlier versions relied on a hand-built scheme of queues (where procedures wait their turn to execute) and callbacks (which signal when a task is complete). Version 3 takes advantage of a new JavaScript construct called a promise. When a procedure cannot compute a result immediately, it hands out a promise, which it then redeems when the result becomes available.

Wait, there’s more! MathJax is not just a TeX system. It also accepts input written in MathML, a dialect of XML specialized for mathematical notation. Indeed, the internal language of MathJax is based on MathML. And MathJax can also be configured to handle AsciiMath, a cute markup language that aims to make even the raw form of an expression readable. Think of it as math with emoticons: Type `oo` and you’ll get $\infty$, or `:-` for $\div$.

MathJax also provides an extensive suite of tools for accessibility. Visually impaired readers can have an equation read aloud. As I learned at the January Joint Math Meetings, there are even provisions for generating Braille output—but that’s a subject that deserves a post of its own.

When I first encountered MathJax, I saw it as a marvel, but I also considered it a workaround or stopgap. Reading a short document that includes a single equation entails downloading the entire MathJax program, which can be much larger than the document itself. And you need to download it all again for every other mathy document (unless your browser cache hangs onto a copy). What an appalling waste of bandwidth.

Several alternatives seemed more promising as a long-term solution. The best approach, it seemed to me then, was to have support for mathematical notation built into the browser. Modern browsers handle images, audio, video, SVG, animations—why not math? But it hasn’t happened. Firefox and Safari have limited support for MathML; none of the browsers I know are equipped to deal with TeX.

Another strategy that once seemed promising was the browser plugin. A plugin could offer the same capabilities as MathJax, but you would download and install it only once. This sounds like a good deal for readers, but it’s not so attractive for the author of web content. If there are multiple plugins in circulation, they are sure to have quirks, and you need to accommodate all of them. Furthermore, you need some sort of fallback plan for those who have not installed a plugin.

Still another option is to run MathJax on the server, rather than sending the whole program to the browser. The document arrives with TeX or MathML already converted to HTML/CSS or SVG for display. This is the preferred modus operandi for several large websites, most notably Wikipedia. I’ve considered it for bit-player, but it has a drawback: Running on the server, MathJax cannot provide the kind of on-demand typesetting seen in the demos above.

As the years go by, I am coming around to the view that MathJax is not just a useful stopgap while we wait for the right thing to come along; it’s quite a good approximation to the right thing. As the author of a web page, I get to write mathematics in a familiar and well-tested notation, and I can expect that any reader with an up-to-date browser will see output that’s much like what I see on my own screen. At the same time, the reader also has control over how the math is rendered, via the context menu. And the program offers accessibility features that I could never match on my own.

To top it off, the software is open-source—freely available to everyone. That is not just an economic advantage but also a social one. The project has a community that stands ready to fix bugs, listen to suggestions and complaints, offer help and advice. Without that resource, I would still be struggling with the hitches and hiccups described above.

We Gather Together

Brian Hayes — Fri, 06 Mar 2020 17:23:35 +0000

In January I went to the Joint Mathematics Meetings, which were held in Denver for the first time ever. The main venue was the Colorado Convention Center, a building whose roof area, by my rough estimate, is well above a million square feet. Inside I found acres of patterned carpet, enough ~~folding~~ stacking My fact checker B.C. points out that in general they don’t fold.chairs to hold thousands of bottoms, and vast windowless “ballrooms” where no one waltzes (at least not during the math meetings).

The Colorado Convention Center is the dark gray blob at lower right. Based on a perusal of Google Maps, it appears to be the largest single roof in the city of Denver, and it would remain so even if the football stadium (lower left) or the baseball stadium (upper right) had a roof.

Wandering around in these cavernous spaces always leaves me feeling a little disoriented and dislocated. It’s not just that I’m lost, although often enough I am—searching for Lobby D, or Meeting Room 407, or a toilet. I’m also dumbfounded by the very existence of these huge empty boxes, monuments to the human urge to congregate. If you build it, we will come.

It seems every city needs such a place, commensurate with its civic stature or ambitions. It’s no mystery why the cities make the investment. The JMM attracted more than 5,500 mathematicians (plus a few interlopers like me). I would guess we each spent on the order of $1,000 in payments to hotels, restaurants, taxis, and such, and perhaps as much again on airfare and registration fees. The revenue flowing to the city and its businesses and citizens must be well above $5 million. Furthermore, from the city’s point of view it’s all free money; the visitors do not send their children to the local schools or add to the burden on other city services, and they don’t vote in Denver.

However, this calculation tells only half the story. Although visitors to the Colorado Convention Center leave wads of cash in Denver, at the same time Denver residents are flying off to meetings elsewhere, withdrawing funds from the local economy and spreading the money around in Phoenix, Seattle, or Boston. If the convention-going traffic is symmetrical, the exchange will come out even for everyone. So why don’t we all save ourselves a lot of bother—not to mention millions of dollars—and just stay home? From inside the convention center, you may not be able to tell what city you’re in anyway.

Convention centers are not really all alike—not as much as Walmarts or Home Depots. In Denver some large-scale artwork caught my eye. I particularly admire the little red box that serves as a unifying element, like Harold’s purple crayon. Left: Detail of “I Know You Know That I Know,” by Sandra Fettingis. Right: Detail of “The Heavy is the Root of the Light,” by Mindy Bray.

While I was in Denver, I looked at the schedule of upcoming events for the convention center. A boat show was getting underway even as the mathematicians were still roaming the corridors, and tickets were also on sale for some sort of motorcycling event. The drillers and frackers were coming to town a few weeks later, and then in March the American Physical Society would hold its biggest annual gathering, with about twice as many participants as the JMM. The APS meeting was scheduled for this week, Monday through Friday (March 2–6). But late last Saturday night the organizers decided to cancel the entire conference because of the coronavirus threat. Some attendees were already in Denver or on their way.

I was taken aback by this decision, which is not to say I believe it was wrong. A year from now, if the world is still recovering from an epidemic that killed many thousands, the decisionmakers at the APS will be seen as prescient, prudent, and public-spirited. On the other hand, if Covid-19 sputters out in a few weeks, they may well be mocked as alarmists who succumbed to panic. But the latter judgment would be a little unfair. After all, the virus might be halted precisely because those 11,000 physicists stayed home.

I have not yet heard of other large scientific conferences shutting down, but a number of meetings in the tech industry have been called off, postponed, or gone virtual, along with some sports and entertainment events. The American Chemical Society is “monitoring developments” in advance of their big annual meeting, scheduled for later this month in Philadelphia. [Update: On March 9 the ACS announced "we are cancelling (terminating) the ACS Spring 2020 National Meeting & Expo."] Even if the events go on, some prospective participants will not be able to attend. I’ve just received an email from Harvard with stern warnings and restrictions on university-related travel.

Presumably, the Covid-19 threat will run its course and dissipate, and life will return to something called normal. But it’s also possible we have a new normal, that we have crossed some sort of demographic or epidemiological threshold, and novel pathogens will be showing up more frequently. Furthermore, the biohazard is not the only reason to question the future of megameetings; the ecohazard may be even more compelling. My guestimate is based on numbers from carbonindependent.org. I assume the average attendee flies 3,000 kilometers round trip on a 737-400 aircraft.Flying 11,000 physicists to Denver burns 1,200 tonnes of fuel and injects 3,800 tonnes of carbon dioxide into the atmosphere. R. R. Wilson, the founder of Fermilab, once declared that the most important invention for the progress of modern science was the Boeing 707. But that invention is now looking like part of the problem.

All in all, it seems an apt moment to reflect on the human urge to come together in these large, temporary encampments, where we share ideas, opinions, news, gossip—and perhaps viruses—before packing up and going home until next year. Can the custom be sustained? If not, what might replace it?

Mathematicians and physicists have not always formed roving hordes to plunder defenseless cities. Until the 20th century there weren’t enough of them to make a respectable motorcycle gang. Furthermore, they had no motorcycles, or any other way to travel long distances in a reasonable time.

Before the airplane and the railroad, meetings between scientists were generally one-on-one. Consider the sad story of Neils Henrik Abel, a young Norwegian mathematician in the 1820s. Feeling cut off from his European colleagues, he undertook a two-year-long trek from Oslo to Berlin and Paris, traveling almost entirely on foot. In Paris he visited Lagrange and Cauchy, who received him coolly and did not read his proof of the unsolvability of quintic equations. So Abel walked home again. Somewhere along the way he picked up a case of tuberculosis and died two years later, at age 27, impoverished and probably unaware that his work was finally beginning to be noticed. I like to think the outcome would have been happier if he’d been able to present his results in a contributed-paper session at the JMM.

For Abel, the take-a-hike model of scholarly communication proved ineffective; perhaps more important, it doesn’t scale well. If everyone must make individual tête-à-tête visits, then forming connections between $n$ scientists would require $n(n - 1) / 2$ trips. Having everyone converge at a central point reduces the number to $n$. From this point of view, the modern mass meeting looks not like a travel extravagance but like a strategy for minimizing total air miles. Still, staying home would be even more frugal, whether the cost is measured in dollars, kelvins, or epidemiological risk.

Most of the big disciplinary conferences got their start toward the end of the 19th century, and by the 1930s and 40s had hundreds of participants. Writing about mathematical life in that era, Ralph Boas notes: “One reason for going to meetings was that photocopying hadn’t been invented; it was at meetings that one found out what was going on.” But now photocopying has been invented—and superseded. There’s no need for a cross-country trip to find out what’s new; on any weekday morning you can just check the arXiv. Yet attendance at these meetings is up by another order of magnitude.

Even in a world with faster channels of communication, there are still moments of high excitement in the big convention halls. At the 1987 March meeting of the APS, the recent discovery of high-temperature superconductivity in cuprate ceramics was presented and discussed in a lively session that lasted past 3 a.m. The event is known as the Woodstock of Physics. I missed it—as well as the original Woodstock. But I was at the JMM in 2014 when progress toward confirming the twin prime conjecture caused a big stir. The conjecture (still unproved) says there are infinitely many pairs of prime numbers, such as 11 and 13, separated by exactly 2. Yiting Zhang had just proved there are infinitely many primes separated by no more than 70 million. Several talks discussed this finding and followup work by others, and Zhang himself spoke to a packed room.

Yiting Zhang and his audience.

Boas emphasized the motive of hearing what’s new, but one must not ignore the equally important impulse to tell what’s new. At the recent JMM, with its 5,500 visitors, the book of abstracts listed 2,529 presentations. In other words, almost half the visitors came to deliver a talk, which is probably a stronger motivation than hearing what others have to say. (When I first saw those numbers, I had the thought: “So, on average every presentation had one speaker and one listener.” The truth is not quite as bad as that, but it’s still worth keeping in mind that a meeting of this kind is not like a rock concert or a football game, with only a dozen or so performers and thousands in the audience.)

At some gatherings, the aim is not so much to talk about math and science but to do it. Groups of three or four huddle around blackboards or whiteboards, collaborating. But this activity is commoner at small, narrowly focused meetings—maybe at Aspen for the physicists or Banff for the mathematicians. No doubt such things also happen at the bigger meetings, but they are not a major item on the agenda for most attendees.

For one subpopulation of meeting-goers the main motivation is very practical: getting a job. Again this is a matter of efficiency. Someone looking for a postdoc position can arrange a dozen interviews at a single meeting.

There are many reasons to make the pilgrimage to the Colorado Convention Center, but I think the most important factor is yet to be stated. Dennis Flanagan, who was my employer, friend, and mentor many years ago at Scientific American, wrote that “science is intensely social.”

Flanagan’s Version, 1988, p. 15.In an active scientific discipline everyone knows everyone else, if not in person, then by their writings and reputation. Scientists attend at least as many meetings and conventions as salesmen.

You might interpret this comment as saying that scientists—like salesmen—are a bunch of genial, gregarious party animals who like to go out on the town, drink to excess, and misbehave. But I’m pretty sure that’s not what Dennis had in mind. He was arguing that social interactions are essential to the process of science. Becoming a mathematician or a physicist is tantamount to joining a club, and you can’t do that in isolation. You have to absorb the customs, the tastes, the values of the culture. For example, you need to internalize the community standard for deciding what is true. (It’s rather different in physics and mathematics.) Even subtler is the standard for deciding what is interesting—what ideas are worth pursuing, what problems are worth solving.

Meetings and conferences are not the only way of inculcating culture; the apprenticeship system known as graduate school is clearly more imporant overall. Still, discipline-wide gatherings have a role. By their very nature they are more cosmopolitan than any one university department. They acquaint you with the norms of the population but also with the range of variance, and thereby improve the probability that you’ll figure out where you fit in.

The quintessential big-meeting event is running into someone in the hallway whom you see only once a year. You stop and shake hands, or even hug. (In future we’ll bump elbows.) You’re both in a hurry. If you chat too long, you’ll miss the opening sentences of the next talk, which may be the only sentences you’ll understand. So the exchange of words is brief and unlikely to be deep. As I and my cohort grow older, it often amounts to little more than, “Wow. I’m still alive and so are you!” But sometimes it’s worth traveling a thousand miles to get that human validation.

If we have to dispense with such gatherings, science and math will muddle through somehow. We’ll meet more in the sanitary realm of bits and pixels, less in this fraught environment of atoms. We’ll become more hierarchical, with greater emphasis on local meetings and less on national and international ones. The alternatives can be made to work, and the next generation will view them as perfectly natural, if not inevitable. But I’m going to miss the ugly carpet, the uncomfortable folding/stacking chairs, and the ballrooms where nobody dances.

The Teetering Towers of Abstraction

Brian Hayes — Mon, 13 Jan 2020 23:00:40 +0000

Abstraction is an abstraction. You can’t touch it or taste it or photograph it. You can barely talk about it without resorting to metaphors and analogies. Yet this ghostly concept is an essential tool in both mathematics and computer science. Oddly, it seems to inspire quite different feelings and responses in those two fields. I’ve been wondering why.

In mathematics abstraction serves as a kind of stairway to heaven—as well as a test of stamina for those who want to get there. West stairs to Grand View Park, San Francisco, October 2017. You begin the climb at an early age, at ground level, with things that are not at all abstract. Jelly beans, for example. You learn the important life lesson that if you have and you eat , you will have only left. After absorbing this bitter truth, you are invited to climb the stairs of abstraction as far as the first landing, where you replace the tasty tangible jelly beans with sugar-free symbols: $5 - 3 = 2$.

Some years later you reach higher ground. The symbols representing particular numbers give way to the $x$s and $y$s that stand for quantities yet to be determined. They are symbols for symbols. Later still you come to realize that this algebra business is not just about “solving for $x$,” for finding a specific number that corresponds to a specific letter. It’s a magical device that allows you to make blanket statements encompassing all numbers: $x^2 - 1 = (x + 1)(x - 1)$ is true for any value of $x$.

Continuing onward and upward, you learn to manipulate symbolic expressions in various other ways, such as differentiating and integrating them, or constructing functions of functions of functions. Keep climbing the stairs and eventually you’ll be introduced to areas of mathematics that openly boast of their abstractness. There’s abstract algebra, where you build your own collections of numberlike things: groups, fields, rings, vector spaces. Cartoon by Ben Orlin, mathwithbaddrawings.com, reprinted under Creative Commons license.Another route up the stairway takes you to category theory, where you’ll find a collection of ideas with the disarming label abstract nonsense.

Not everyone is filled with admiration for this Jenga tower of abstractions teetering atop more abstractions. Consider Andrew Wiles’s proof of Fermat’s last theorem, and its reception by the public. The theorem, first stated by Pierre de Fermat in the 1630s, makes a simple claim about powers of integers: If $x, y, z, n$ are all integers greater than $0$, then $x^n + y^n = z^n$ has solutions only if $n \le 2$. The proof of this claim, published in the 1990s, is not nearly so simple. Wiles (with contributions from Richard Taylor) went on a scavenger hunt through much of modern mathematics, collecting a truckload of tools and spare parts needed to make the proof work: elliptic curves, modular forms, Galois groups, functions on the complex plane, L-series. It is truly a tour de force.

Diagram (borrowed from Kenneth A. Ribet and Brian Hayes, “Fermat’s Last Theorem and Modern Arithmetic“) outlines the overall strategy of the Wiles proof. If you had a counterexample to FLT, you could construct an elliptic curve E with certain properties. But the properties deduced on the left and right branches of the diagram turn out to be inconsistent, implying that E does not exist, nor does the counterexample that gave rise to it.

Is all that heavy machinery really needed to prove such an innocent-looking statement? Many people yearn for a simpler and more direct proof, ideally based on methods that would have been available to Fermat himself. Ken Ribet will be presenting “A 2020 View of Fermat’s Last Theorem” at the Joint Mathematics Meetings later this week. In a preview of the talk, he notes that advances made since 1994 allow a more succinct statement of the proof. But those recent advances are no easier to understand than the original proof.At least nine attempts to construct an elementary proof have been posted on the arXiv in the past 20 years, and there are lots more elsewhere. I think the sentiment motivating much of this work is, “You shouldn’t be allowed to prove a theorem I care about with methods I don’t understand.” Marilyn vos Savant, the Parade columnist, takes an even more extreme position, arguing that Wiles strayed so far from the subject matter of the theorem as to make his proof invalid. (For a critique of her critique, see Boston and Granville.)

Almost all of this grumbling about illegimate methods and excess complexity comes from outside the community of research mathematicians. Insiders see the Wiles proof differently. For them, the wide-ranging nature of the proof is actually what’s most important. The main accomplishment, in this view, was cementing a connection between those far-flung areas of mathematics; resolving FLT was just a bonus.

Yet even mathematicians can have misgivings about the intricacy of mathematical arguments and the ever-taller skyscrapers of abstraction. Jeremy Gray, a historian of mathematics, believes anxiety over abstraction was already rising in the 19th century, when mathematics seemed to be “moving away from reality, into worlds of arbitrary dimension, for example, and into the habit of supplanting intuitive concepts (curves that touch, neighboring points, velocity) with an opaque language of mathematical analysis that bought rigor at a high cost in intelligibility.”

Quite apart from these comments on abstraction, the thesis is well worth reading. It offers alternating sections of “mathsplaining” and “laysplaining.” See also a review in MAA Focus by Adriana Salerno. The thesis was to be published in book form last fall by Birkhäuser, but the book doesn’t seem to be available yet.For a view of abstraction in contemporary mathematics, we have a vivid image from Piper Harron, a young mathematician who wrote an extraordinarily candid PhD thesis in 2016. The introductory chapter begins, “The hardest part about math is the level of abstraction required.” She goes on to explain:

I like to imagine abstraction (abstractly ha ha ha) as pulling the strings on a marionette. The marionette, being “real life,” is easily accessible. Everyone understands the marionette whether it’s walking or dancing or fighting. We can see it and it makes sense. But watch instead the hands of the puppeteers. Can you look at the hand movements of the puppeteers and know what the marionette is doing?… Imagine it gets worse. Much, much worse. Imagine that the marionettes we see are controlled by marionettoids we don’t see which are in turn controlled by pre-puppeteers which are finally controlled by actual puppeteers.

Keep all those puppetoids in mind. I’ll be coming back to them, but first I want to shift my attention to computer science, where the towers of abstraction are just as tall and teetery, but somehow less scary.

Suppose your computer is about to add two numbers…. No, wait, there’s no need to suppose or imagine. In the orange panel below, type some numbers into the $a$ and $b$ boxes, then press the “+” button to get the sum in box $c$. Now, please describe what’s happening inside the machine as that computation is performed.

You can probably guess that somewhere behind the curtains there’s a fragment of code that looks like c = a + b. And, indeed, that statement appears verbatim in the JavaScript program that’s triggered when you click on the plus button. But if you were to go poking around among the circuit boards under the keyboard of your laptop, you wouldn’t find anything resembling that sequence of symbols. The program statement is a high-level abstraction. If you really want to know what’s going on inside the computing engine, you need to dig deeper—down to something as tangible as a jelly bean.

How about an electron? In truth, electrons are not so tangible. The proper mental image is not a hard sphere like a BB but a diffuse probability distribution. In other words, the electron itself is an abstraction.During the computation, clouds of electrons drift through the machine’s circuitry, like swarms of migrating butterflies. Their movements are regulated by the switching action of transistors, and the transistors in turn are controlled by the moving electrons. It is this dance of the electrons that does the arithmetic and produces an answer. Yet it would be madness to describe the evaluation of c = a + b by tracing the motions of all the electrons (perhaps $10^{23}$ of them) through all the transistors (perhaps $10^{11}$).

To understand how electrons are persuaded to do arithmetic for us, we need to introduce a whole sequence of abstractions.

First, step back from the focus on individual electrons, and reformulate the problem in terms of continuous quantities: voltage, current, capacitance, inductance.
Replace the physical transistors, in which voltages and currents change smoothly, with idealized devices that instantly switch from totally off to fully on.
Interpret the two states of a transistor as logical values (true and false) or as numerical values ($1$ and $0$).
Organize groups of transistors into “gates” that carry out basic functions of Boolean logic, such as and, or, and not.
Assemble the gates into larger functional units, including adders, multipliers, comparators, and other components for doing base-$2$ arithmetic.
Build higher-level modules that allow the adders and such to be operated under the control of a program. This is the conceptual level of the instruction-set architecture, defining the basic operation codes (add, shift, jump, etc.) recognized by the computer hardware.
Graduating from hardware to software, design an operating system, a collection of services and interfaces for abstract objects such as files, input and output channels, and concurrent processes.
Create a compiler or interpreter that knows how to translate programming language statements such as c = a + b into sequences of machine instructions and operating-system requests.

From the point of view of most programmers, the abstractions listed above represent computational infrastructure: They lie beneath the level where you do most of your thinking—the level where you describe the algorithms and data structures that solve your problem. But computational abstractions are also a tool for building superstructure, for creating new functions beyond what the operating system and the programming language provide. For example, if your programming language handles only numbers drawn from the real number line, you can write procedures for doing arithmetic with complex numbers, such as $3 + 5i$. (Go ahead, try it in the orange box above.) And, in analogy with the mathematical practice of defining functions of functions, we can build compiler compilers and schemes for metaprogramming—programs that act on other programs.

In both mathematics and computation, rising through the various levels of abstraction gives you a more elevated view of the landscape, with wider scope but less detail. Even if the process is essentially the same in the two fields, however, it doesn’t feel that way, at least to me. In mathematics, abstraction can be a source of anxiety; in computing, it is nothing to be afraid of. In math, you must take care not to tangle the puppet strings; in computing, abstractions are a defense against such confusion. For the mathematician, abstraction is an intellectual challenge; for the programmer, it is an aid to clear thinking.

Why the difference? How can abstraction have such a friendly face in computation and such a stern mien in math? One possible answer is that computation is just plain easier than mathematics. In speaking of “computation,” what I have in mind is the design of algorithms and data structures suitable for a machine we can build out of material components. If you are playing with Turing machines and other toys of theoretical computer science, the game is altogether different. But in my view theoretical computer science is just a funny-looking branch of mathematics. (With apologies to those of my friends who grimace to hear me say it.) Anything that fits into the computer is necessarily discrete and finite. In principle, any computer program could be reduced to a big table mapping all possible inputs to the corresponding outputs. Mathematics is invulnerable to this kind of trivialization by brute force. It has infinities hiding under the bed and lurking behind the closet door, and that’s what makes it both fun and frightening.

Another possible explanation is that computer systems are engineered artifacts; we can build them to our own specifications. If a concept is just too hairy for the human mind to master, we can break it down into simpler pieces. Math is not so complaisant—not even for those who hold that mathematical objects are invented rather than discovered. We can’t just design number theory so that the Riemann hypothesis will be true.

But I think the crucial distinction between math abstractions and computer abstractions lies elsewhere. It’s not in the abstractions themselves but in the boundaries between them.

Warning from the abstraction police on the office door of Radhika Nagpal, Harvard University. (Photographed November 2013.)

I believe I first encountered the term abstraction barrier in Abelson and Sussman’s Structure and Interpretation of Computer Programs, circa 1986. The underlying idea is surely older; it’s implicit in the “structured programming” literature of the 1960s and 70s. But SICP still offers the clearest and most compelling introduction.In building computer systems, we are urged to compartmentalize, to create self-contained and sealed-off modules—black boxes whose inner workings are concealed from outside observers. In this world, information hiding is considered a virtue, not an impeachable offense. If a design has a layered structure, with abstractions piled one atop the other, the layers are separated by abstraction barriers. A high-level module can reach across the barrier to make use of procedures from lower levels, but it won’t know anything about the implementation of those procedures. When you are writing programs in Lisp or Python, you shouldn’t need to think about how the operating system carries out its chores; and when you’re writing routines for the operating system, you needn’t think about the physics of electrons meandering through the crystal lattice of a semiconductor. Each level of the hierarchy can be treated (almost) independently.

Mathematics also has its abstraction barriers, although I’ve never actually heard the term used by mathematicians. A notable example comes from Giuseppe Peano’s formulation of the foundations of arithmetic, circa 1900. Peano posits the existence of a number $0$, and a function called successor, $S(n)$, which takes a number $n$ and returns the next number in the counting sequence. Thus the natural numbers begin $0, S(0), S(S(0)), S(S(S(0)))$, and so on. Peano deliberately refrains from saying anything more about what these numbers look like or how they work. They might be implemented as sets, with $0$ being the empty set and successor the operation of adjoining an element to a set. Or they could be unary lists: (), (|), (||), (|||), . . . The most direct approach is to use Church numerals, in which the successor function itself serves as a counting token, and the number $n$ is represented by $n$ nested applications of $S$.

From these minimalist axioms we can define the rest of arithmetic, starting with addition. In calculating $a + b$, if $b$ happens to be $0$, the problem is solved: $a + 0 = a$. If $b$ is not $0$, then it must be the successor of some number, which we can call $c$. Then $a + S(c) = S(a + c)$. Notice that this definition doesn’t depend in any way on how the number $0$ and the successor function are represented or implemented. Under the hood, we might be working with sets or lists or abacus beads; it makes no difference. An abstraction barrier separates the levels. From addition you can go on to define multiplication, and then exponentiation, and again abstraction barriers protect you from the lower-level details. There’s never any need to think about how the successor function works, just as the computer programmer doesn’t think about the flow of electrons.

The importance of not thinking was stated eloquently by Alfred North Whitehead, more than a century ago:

Alfred North Whitehead, An Introduction of Mathematics, 1911, pp. 45–46.It is a profoundly erroneous truism, repeated by all copybooks and by eminent people when they are making speeches, that we should cultivate the habit of thinking of what we are doing. The precise opposite is the case. Civilisation advances by extending the number of important operations which we can perform without thinking about them. Operations of thought are like cavalry charges in a battle—they are strictly limited in number, they require fresh horses, and must only be made at decisive moments.

If all of mathematics were like the Peano axioms, we would have a watertight structure, compartmentalized by lots of leakproof abstraction barriers. And abstraction would probably not be considered “the hardest part about math.” But, of course, Peano described only the tiniest corner of mathematics. We also have the puppet strings.

In Piper Harron’s unsettling vision, the puppeteers high above the stage pull strings that control the pre-puppeteers, who in turn operate the marionettoids, who animate the marionettes. Each of these agents can be taken as representing a level of abstraction. The problem is, we want to follow the action at both the top and the bottom of the hierarchy, and possibly at the middle levels as well. The commands coming down from the puppeteers on high embody the abstract ideas that are needed to build theorems and proofs, but the propositions to be proved lie at the level of the marionettes. There’s no separating these levels; the puppet strings tie them together.

In the case of Fermat’s Last Theorem, you might choose to view the Wiles proof as nothing more than an elevated statement about elliptic curves and modular forms, but the proof is famous for something else—for what it tells us about the elementary equation $x^n + y^n = z^n$. Thus the master puppeteers work at the level of algebraic geometry, but our eyes are on the dancing marionettes of simple number theory. What I’m suggesting, in other words, is that abstraction barriers in mathematics sometimes fail because events on both sides of the barrier make simultaneous claims on our interest.

In computer science, the programmer can ignore the trajectories of the electrons because those details really are of no consequence. Indeed, the electronic guts of the computing machinery could be ripped out and replaced by fluidic devices or fiber optics or hamsters in exercise wheels, and that brain transplant would have no effect on the outcome of the computation. Few areas of mathematics can be so cleanly floated away and rebuilt on a new foundation.

Can this notion of leaky abstraction barriers actually explain why higher mathematics looks so intimidating to most of the human population? It’s surely not the whole story, but maybe it has a role.

In closing I would like to point out an analogy with a few other areas of science, where problems that cross abstraction barriers seem to be particularly difficult. Physics, for example, deals with a vast range of spatial scales. At one end of the spectrum are the quarks and leptons, which rattle around comfortably inside a particle with a radius of $10^{-15}$ meter; at the other end are galaxy clusters spanning $10^{24}$ meters. In most cases, effective abstraction barriers separate these levels. When you’re studying celestial mechanics, you don’t have to think about the atomic composition of the planets. Conversely, if you are looking at the interactions of elementary particles, you are allowed to assume they will behave the same way anywhere in the universe. But there are a few areas where the barriers break down. For example, near a critical point where liquid and gas phases merge into an undifferentiated fluid, forces at all scales from molecular to macroscopic become equally important. Turbulent flow is similar, with whirls upon whirls upon whirls. It’s not a coincidence that critical phenomena and turbulence are notoriously difficult to describe.

Biology also covers a wide swath of territory, from molecules and single cells to whole organisms and ecosystems on a planetary scale. Again, abstraction barriers usually allow the biologist to focus on one realm at a time. To understand a predator-prey system you don’t need to know about the structure of cytochrome c. But the barriers don’t always hold. Evolution spans all these levels. It depends on molecular events (mutations in DNA), and determines the shape and fate of the entire tree of life. We can’t fully grasp what’s going on in the biosphere without keeping all these levels in mind at once.

bit-player

AI and the end of programming

Large Language Models

Climbing the word ladder

Oracles and code monkeys

Version 4 as code monkey

Enough with the word ladders already!

Scoring successes and failures

The unreasonable effectiveness of LLMs

Further Reading

Transformers and Large Language Models

Word ladders

LLMs for Programming

Evaluations of LLMs as program generators

Do They Know and Think?

Other Topics

The Middle of the Square

Jotto

Words for the Wordle-Weary

First-Move Starter-Word Rankings

Full-Game Starter-Word Rankings

Notes

Note 1. History of the game and of the word lists.

Note 2. The Umpire’s scoring rule.

Note 3. The virtues of a uniform distribution.

Note 4. Understanding the Shannon entropy equation.

Does having prime neighbors make you more composite?

Riding the Covid coaster

Update 2021-09-01

Data and Source Code

Further Reading

Three Months in Monte Carlo

Magnitude of Magnetization

How did the method get the name “Monte Carlo”?

Who invented the Metropolis algorithm?

Who invented Glauber dynamics?

Who made the first pictures of an Ising lattice?

What does the Ising model model?

Further Reading

Foldable Words

Appendix: The Word-List Problem.

We Gather Together…

More Questions About Trees

Further Reading

Questions About Trees

Further Reading

April Fool Redux

MathJax turns 3.0

We Gather Together

The Teetering Towers of Abstraction