I believe the conventional idea of “writing a program” is headed for extinction, and indeed, for all but very specialized applications, most software, as we know it, will be replaced by AI systems that are trained rather than programmed. In situations where one needs a “simple” program (after all, not everything should require a model of hundreds of billions of parameters running on a cluster of GPUs), those programs will, themselves, be generated by an AI rather than coded by hand.

A few weeks later, in an online talk, Welsh broadened his deathwatch. It’s not only the art of programming that’s doddering toward the grave; all of computer science is “doomed.” (The image below is a screen capture from the talk.)

The bearer of these sad tidings does not appear to be grief-stricken over the loss. Although Welsh has made his career as a teacher and practitioner of computer science (at Harvard, Google, Apple, and elsewhere), he seems eager to move on to the next thing. “Writing code sucks anyway!” he declares.

My own reaction to the prospect of a post-programming future is not so cheery. In the first place, I’m skeptical. I don’t believe we have yet crossed the threshold where machines learn to solve interesting computational problems on their own. I don’t believe we’re even close to that point, or necessarily heading in the right direction. Furthermore, if it turns out I’m wrong about all this, my impulse is not to acquiesce but to resist. I, for one, do *not* welcome our new AI overlords. Even if they prove to be better programmers than I am, I’ll hang onto my code editor and compiler all the same, thank you. Programming sucks? For me it has long been a source of pleasure and enlightenment. I find it is also a valuable tool for making sense of the world. I’m never sure I understand an idea until I can reduce it to code. To get the benefit of that learning experience, I have to actually *write* the program, not just say some magic words and summon a genie from Aladdin’s AI lamp.

The idea that a programmable machine might write its own programs has deep roots in the history of computing. Charles Babbage hinted at the possibility as early as 1836, in discussing plans for his Analytical Engine. When Fortran was introduced in 1957, it was officially styled “The FORTRAN Automatic Coding System.” Its stated goal was to allow a computer “to code problems for itself and produce as good programs as human coders (but without the errors).”

Fortran did not abolish the craft of programming (or the errors), but it made the process less tedious. Later languages and other tools brought further improvements. And the dream of fully automatic programming has never died. Machines seem better suited to programming than most people are. Computers are methodical, rule-bound, finicky, and literal-minded—all traits that we associate (rightly or wrongly) with ace programmers.

Ironically, though, the AI systems now poised to take on programming chores are strangely uncomputerlike. Their personality is more Deanna Troi than Commander Data. Logical consistency, cause-and-effect reasoning, and careful attention to detail are not among their strengths. They have moments of uncanny brilliance, when they seem to be thinking deep thoughts, but they are also capable of spectacular failures—blatant and brazen lapses of rationality. They bring to mind the old quip: To err is human, to really foul things up requires a computer.

The latest cohort of AI systems are known as large language models (LLMs). Like most other recent AI inventions, they are built atop an artificial neural network, a multilayer structure inspired by the architecture of the brain. The nodes of the network are analogous to biological neurons, and the connections between nodes play the role of synapses, the junctions where signals pass from one neuron to another. Training the network adjusts the strength, or weight, of the connections. In a language model, the training is done by force-feeding the network huge volumes of text. When the process is complete, the connection weights encode detailed statistics on linguistic features of the training text. In the largest models the number of weights is 100 billion or more.

The term *model* in this context can be misleading. The word is not used in the sense of a scale model, or miniature, like a model airplane. It refers instead to a predictive model, like the mathematical models common in the sciences. Just as a model of the atmosphere predicts tomorrow’s weather, a language model predicts the next word in a sentence.

The best known LLM is ChatGPT, which was released to the public last fall, to great fanfare. The initials *GPT* *Generative Pre-trained Transformer*; the *Chat* version of the system comes equipped with a conversational human interface. ChatGPT was developed by OpenAI, a firm founded in 2015 with the aim of liberating artificial intelligence from the grip of a few wealthy tech companies. OpenAI has succeeded in this mission, so much so that it has become a wealthy tech company.

ChatGPT elicits both admiration and alarm for its way with words, its gift of gab, its easy fluency in English and other languages. The chatbot can imitate famous authors, tell jokes, compose love letters, translate poetry, write junk mail, “help” students with their homework assignments, and fabricate misinformation for political mischief. For better or worse, these linguistic abilities represent a startling technical advance. Computers that once struggled to frame a single intelligible sentence have suddenly become glib wordsmiths. What GPT says may or may not be true, but it’s almost always well-phrased.

Soon after ChatGPT was announced, I was surprised to learn that its linguistic mastery extends to programming languages. It seems the training set for the model included not just multiple natural languages but also quantities of program source code from public repositories such as GitHub. Based on this resource, GPT was able to write new programs on command. I found this surprising because computers are so fussy and unforgiving about their inputs. A human raeder will strive to make sense of an utterance despite small errors such as misspellings, but a computer will barf if it’s given a program with even one stray comma or mismatched parenthesis. A language model, with its underlying statistical or probabilistic nature, seemed unlikely to sustain the required precision for more than a few lines.

I was wrong about this. A key innovation in the new LLMs, known as the attention mechanism, addresses this very issue. When I began my own experiments with ChatGPT, I soon found that it can indeed produce programs free of careless syntax errors.

But other problems cropped up.

When you sit down to chat with a machine, you immediately face the awkward question: “What shall we talk about?” I was looking for a subject that would serve as a fair measure of ChatGPT’s programming prowess. I wanted a problem that can be solved by computational means, but that doesn’t call for doing much arithmetic, which is known to be one of the weak points of LLMs. I settled on a word puzzle invented 150 years ago by Lewis Carroll and analyzed in depth by Donald E. Knuth in the 1990s.

In the transcript below, my side of each exchange is labeled BR; the rosette , which is an OpenAI logo, designates ChatGPT’s responses.

As I watched these sentences unfurl on the screen—the chatbot types them out letter by letter, somewhat erratically, as if pausing to gather its thoughts—I was immediately struck by the system’s facility with the English language. In plain, sturdy prose, GPT catalogs all the essential features of a word ladder: It’s a game or puzzle, you proceed from word to word by changing one letter at a time, each rung of the ladder must be an English word, the aim is to find the shortest possible sequence from the starting word to the target word. I couldn’t explain it better myself. Most helpful of all is the worked example of `COLD -> WARM`

.

It’s not just the individual sentences that give an impression of linguistic competence. The sentences are organized in paragraphs, and the paragraphs are strung together to form a coherent discourse. Bravo!

Also remarkable is the bot’s ability to handle vague and slapdash inputs. My initial query was phrased as a yes-or-no question, but ChatGPT correctly interpreted it as a request: “Tell me what you know about word ladders.” My second instruction neglects to include any typographic hints showing that *lead* and *gold* are to be understood as words, not metals. The chatbot would have been within its rights to offer me an alchemical recipe, but instead it supplied the missing quotation marks.

Setting aside all this linguistic and rhetorical sophistication, however, what I really wanted to test was the program’s ability to solve word-ladder problems. The two examples in the transcript above can both be found on the web, and so they might well have appeared in ChatGPT’s training data. In other words, the LLM may have memorized the solutions rather than constructed them. So I submitted a tougher assignment:

At first glance, it seems ChatGPT has triumphed again, solving an instance of the puzzle that I’m pretty sure it has *not* seen previously. But look closer. `MARSH -> MARIS`

requires two letter substitutions, and so does `PARIL -> APRIL`

. The status of `MARIS`

and `PARIL`

as “valid words” might also be questioned. I lodged a complaint:

Whoa! The bot offers unctuous confessions and apologies, but the “corrected” word ladder is crazier than ever. It looks like we’re playing Scrabble with Humpty Dumpty, who declares “`APRCHI`

is a word if I say it is!” and then scatters all the tiles.

This is not an isolated, one-of-a-kind failure. All of my attempts to solve word ladders with ChatGPT have gone off the rails, although not always in the same way. In one case I asked for a ladder from `REACH`

to `GRASP`

. The AI savant proposed this solution:

`REACH -> TEACH -> PEACH -> PEACE -> PACES -> PARES -> PARSE -> `

`PARSE -> PARSE -> PARSE -> PARKS -> PARKS -> PARKS -> PARKS -> `

`PARES -> GRASP.`

And there was this one:

`SWORD -> SWARD -> REWARD -> REDRAW -> REDREW -> `

`REDREW -> REDREW -> REDREW -> REDRAW -> REPAID -> `

`REPAIR -> PEACE`

Now we are babbling like a young child just learning to count: “One, two, three, four, three, four, four, four, seven, blue, ten!”

All of the results I have shown so far were recorded with version 3.5 of ChatGPT. I also tried the new-and-improved version 4.0, which was introduced in March. The updated bot exudes the same genial self-assurance, but I’m afraid it also has the same tendency to drift away into oblivious incoherence:

The ladder starts out well, with four steps that follow all the rules. But then the AI mind wanders. From `PLAGE`

to `PAGES`

requires four letter substitutions. Then comes `PASES`

, which is not a word (as far as I know), and in any case is not needed here, since the sequence could go directly from `PAGES`

to `PARES`

. More goofiness ensues. Still, I do appreciate the informative note about `PLAGE`

.

Recently I’ve also had a chance to try out Llama 2, an LLM published by Meta (the Facebook people). Although this model was developed independently of GPT, it seems to share some of the same mental quirks, such as laying down rules and then blithely ignoring them. When I asked for a ladder connecting `REACH`

to `GRASP`

, this is what Llama 2 proposed:

`REACH -> TEACH -> DEACH -> LEACH -> SPEECH -> SEAT -> `

`FEET -> GRASP`

What can I say? I guess a bot’s `REACH`

should exceed its `GRASP`

.

Matt Welsh mentions two modes of operation for a computing system built atop a large language model. So far we’ve been working in what I’ll call oracle mode, where you ask a question and the computer returns an answer. You supply a pair of words, and the system finds a ladder connecting them, doing whatever computing is needed to get there. You deliver a shoebox full of financial records, and the computer fills out your Form 1040. You compile historical climate data, and the computer predicts the global average temperature in 2050.

The alternative to an AI oracle is an AI code monkey. In this second mode, the machine does not directly answer your question or perform your calculation; instead it creates a program that you can then run on a conventional computer. What you get back from the bot is not a word ladder but a program to generate word ladders, written in the programming language of your choice. Instead of a completed tax return you get tax-preparation software; instead of a temperature prediction, a climate model.

Let’s give it a try with ChatGPT 3.5:

Again, a cursory glance at the output suggests a successful performance. ChatGPT seems to be just as fluent in JavaScript as it is in English. It knows the syntax for `if`

, `while`

, and `for`

statements, as well as all the finicky punctuation and bracketing rules. The machine-generated program appears to bring all these components together to accomplish the specified task. Also note the generous sprinkling of explanatory comments, which are surely meant for *our* benefit, not for its. Likewise the descriptive variable names (`currentWord`

, `newWord`

, `ladder`

).

ChatGPT has also, on its own initiative, included instructions for running the program on a specific example (`MARCH`

to `APRIL`

), and it has printed out a result, which matches the answer it gave in our earlier exchange. Was that output generated by actually running the program? ChatGPT does not say so explicitly, but it does make the claim that if you run the program as instructed, you will get the result shown (in all its nonsensical glory).

We can test this claim by loading the program into a web browser or some other JavaScript execution environment. The verdict: Busted! The program *does* run, but it doesn’t yield the stated result. The program’s true output is: `MARCH -> AARCH -> APRCH -> APRIH -> APRIL`

. This sequence is a little less bizarre, in that it abides by the rule of changing only one letter at a time, and all the “words” have exactly five letters. On the other hand, none of the intermediate “words” are to be found in an English dictionary.

There’s an easy algorithm for generating the sequence `MARCH -> AARCH -> `

`APRCH -> APRIH -> APRIL`

. Just step through the start word from left to right, changing the letter at each position so that it matches the corresponding letter in the goal word. Following this rule, *any* pair of five-letter words can be laddered in no more than five steps. `MARCH -> APRIL`

takes only four steps because the `R`

in the middle doesn’t need to be changed. I can’t imagine an easier way to make word ladders—assuming, of course, that you’re willing to let any and every mishmash of letters count as a word.

The program created by ChatGPT could use this quick-and-dirty routine, but instead it does something far more tedious: It constructs all possible ladders whose first rung is the start word, and continues extending those ladders until it stumbles on one that includes the target word. This is a brute-force algorithm of breathtaking wastefulness. Each letter of the start word can be altered in 25 ways. Thus a five-letter word has 125 possible successors. By the time you get to a five-rung ladder, there are 190 million possibilities. (The examples I have presented here, such as `MARCH -> APRIL`

and `REACH -> GRASP`

, have one unchanging letter, and so the solutions require only four steps. Trying to compute full five-step solutions exhausted my patience.)

Let’s try the same code-writing exercise with ChatGPT 4. Given the identical prompt, here’s how the new bot responded:

The program has the same overall structure (a `while`

loop with two nested `for`

loops inside), and it adopts the same algorithmic strategy (generating all strings that differ from a given word at exactly one position). But there’s a big novelty in the GPT-4 version: acknowledgment that a word list is essential. And with this change we finally have some hope of generating a ladder made up of genuine words.

Although GPT-4 recognizes the need for a list, it supplies only a placeholder, namely the 10-word sequence it confected for the `REACH -> GRASP`

example given above. This stub of a word list isn’t good for much—not even for regenerating the bogus `REACH`

-to-`GRASP`

ladder. If you try to do that, the program will report that no ladder exists. This outcome is not incorrect, since the 10 given words do not form a valid pathway that changes just one letter on each step.

Even if the words in the list were well-chosen, a vocabulary of size 10 is awfully puny. Producing a bigger list of words seems like a task that would be easy for a language model. After all, the LLM was trained on a vast corpus of text, in which nearly all English words are likely to appear at least once, and common words would turn up millions of times. Can’t the bot just extract a representative sample of those words? The answer, apparently, is no. Although GPT might be said to have “read” all that text, it has not stored the words in any readily accessible form. (The same is true of a human reader. Can you, by thinking back over a lifetime of reading, list the 10 most common five-letter words in your vocabulary?)

When I asked ChatGPT 4 to produce a word list, it demurred apologetically: “I’m sorry for the confusion, but as an AI developed by OpenAI, I don’t have direct access to a database of words or the ability to fetch data from external sources….” So I tried a bit of trickery, asking the bot to write a 1,000-word story and then sort the story’s words in order of frequency. This ruse worked, but the sample was too small to be of much use. With persistence, I could probably coax an acceptable list from GPT, but instead I took a shortcut. After all, *I* am not an AI developed by OpenAI, and I have access to external resources. I appropriated the list of 5,757 five-letter English words that Knuth compiled for his experiments with word ladders. Equipped with this list, the program written by GPT-4 finds the following nine-step ladder:

`REACH -> PEACH -> PEACE -> PLACE -> PLANE -> `

`PLANS -> GLANS -> GLASS -> GRASS -> GRASP`

This result exactly matches the output of Knuth’s own word-ladder program, which he published 30 years ago in *The Stanford Graphbase*.

At this point I must concede that ChatGPT, with a little outside help, has finally done what I asked of it. It has written a program that can construct a valid word ladder. But I still have reservations. Although the programs written by GPT-4 and by Knuth produce the same output, the programs themselves are not equivalent, or even similar.

Knuth approaches the problem from the opposite direction, starting not with the set of all possible five-letter strings—there are not quite 12 million of them—but with his much smaller list of 5,757 common English words. He then constructs a graph (or network) in which each word is a node, and two nodes are connected by an edge if and only if the corresponding words differ by a single letter. The diagram below shows a fragment of such a graph.

In the graph, a word ladder is a sequence of edges leading from a start node to a goal node. The optimal ladder is the shortest such path, traversing the smallest number of edges. For example, the optimal path from `leash`

to `retch`

is `leash -> leach -> reach -> retch`

, but there are also longer paths, such as `leash -> leach -> beach -> peach -> reach -> retch`

. For finding shortest paths, Knuth adopts an algorithm devised in the 1950s by Edsger W. Dijkstra.

Knuth’s word-ladder program requires an up-front investment to convert a simple list of words into a graph. On the other hand, it avoids the wasteful generation of thousands or millions of five-letter strings that cannot possibly be elements of a word latter. In the course of solving the `REACH -> GRASP`

problem, the GPT-4 program produces 219,180 such strings; only 2,792 of them (a little more than 1 percent) are real words.

If the various word-ladder programs I’ve been describing were student submissions, I would give a failing grade to the versions that have no word list. The GPT-4 program with a list would pass, but I’d give top marks only to the Knuth program, on grounds of both efficiency and elegance.

Why do the chatbots favor the inferior algorithm? You can get a clue just by googling “word-ladder program.” Almost all the results at the top of the list come from websites such as Leetcode, GeeksForGeeks and RosettaCode. These sites, which apparently cater to job applicants and competitors in programming contests, feature solutions that call for generating all 125 single-letter variants of each word, as in the GPT programs. Because sites like these are numerous—there seem to be hundreds of them—they outweigh other sources, such as Knuth’s book (if, indeed, such texts even appear in the training set). Does that mean we should blame the poor choice of algorithm not on GPT but on Leetcode? I would point instead to the unavoidable weakness of a protocol in which the most frequent answer is, by default, the right answer.

A further, related concern nags at me whenever I think about a world in which LLMs are writing all our software. Where will new algorithms come from? An LLM might get creative in remixing elements of existing programs, but I don’t see any way it would invent something wholly new and better.

I admit I’ve gone overboard, torturing ChatGPT with too many variations of one peculiar (and inconsequential) problem. Maybe the LLMs would perform better on other computational tasks. I have tried several, with mixed results. I want to discuss just one of them, where I find ChatGPT’s efforts rather poignant.

Working with ChatGPT 3.5, I ask for the value of the 100th Fibonacci number. Note that my question is phrased in oracle mode; I ask for the number, not for a program to calculate it. Nevertheless, ChatGPT volunteers to write a Fibonacci program, and then presents what purports to be the program’s output.

The algorithm implemented by this program is mathematically correct; it follows directly from the definition of a Fibonacci number as a member of the sequence beginning {0, 1}, with every subsequent element equal to the sum of the previous two terms. The answer given is also correct: 354224848179261915075 is indeed the 100th Fibonacci number. So what’s the problem? It’s the middle statement: “When you run this code, it will output the 100th Fibonacci number.” That’s not true. If you run the code, what you’ll get back is the incorrect value 354224848179262000000. `BigInt`

data type that solves this problem, but `BigInts`

must be specified explicitly, and the ChatGPT program does not do so.

Giving the same task to ChatGPT 4.0 takes us on an even stranger journey. In the following interaction I had activated Code Interpreter, a ChatGPT plugin that allows the system to test and run some of the code it writes. Apparently making use of this facility, the bot first proposes a program that fails for unknown reasons:

Here ChatGPT is writing in Python, which is the main programming language supported by Code Interpreter. The first attempt at writing a program is based on exponentiation of the Fibonacci matrix:

$$\mathrm{F}_1=\left[\begin{array}{ll}

1 & 1 \\

1 & 0

\end{array}\right].$$

This is a well-known and efficient method, and the program implements it correctly. For mysterious reasons, however, Code Interpreter fails to execute the program. (The code runs just fine in a standard Python environment, and returns correct answers.)

At this point the bot pivots and takes off in a wholly new direction, proposing to calculate the desired Fibonacci value via the mathematical identity known as Binet’s formula. It gets as far as writing out the mathematical expression, but then has another change of mind. It correctly foresees problems with numerical precision: The formula would yield an exact result if it were given an exact value for the square root of 5, but that’s not feasible.

And so now ChatGPT takes yet another tack, resorting to the same iterative algorithm used by version 3.5. This time we get a correct answer, because Python (unlike JavaScript) has no trouble coping with large integers.

I’m impressed by this performance, not just by the correct answer but also by the system’s courageous perseverance. ChatGPT soldiers on even as it flails about, puzzled by unexpected difficulties but refusing to give up. “Hmm, that matrix method should have worked. But, no matter, we’ll try Binet’s formula instead… Oh, wait, I forgot… Anyway, there’s no need to be so fancy-pants about this. Let’s just do it the obvious, plodding way.” This strikes me as a very *human* approach to problem-solving. It’s weird to see such behavior in a machine.

My little experiments have left me skeptical of claims that AI oracles and AI code monkeys are about to elbow aside human programmers. I’ve seen some successes, but more failures. And this dismal record was compiled on relatively simple computational tasks for which solutions are well known and widely published.

Others have conducted much broader and deeper evaluations of LLM code generation. In the bibliography at the end of this article I’ve listed five such studies. I’d like to briefly summarize a few of the results they report.

Two years ago, Mark Chen and a roster of 50+ colleagues at OpenAI devoted much effort to measuring the accuracy of Codex, an offshoot of ChatGPT 3 specialized for writing program code. (Codex has since become the engine driving GitHub Copilot, a “programmer’s assistant.”) Chen *et al.* created a suite of 164 tasks to be accomplished by writing Python programs. The tasks were mainly of the sort found in textbook exercises, programming contests, and the (strangely voluminous) literature on how to ace an interview for a coding job. Most of the tasks can be accomplished with just a few lines of code. Examples: Count the vowels in a given word, determine whether an integer is prime or composite.

The Chen group also gave some thought to defining criteria of success and failure. Because the LLM process is nondeterministic—word choices are based on probabilities—the model might generate a defective program on the first try but eventually come up with a correct response if it’s allowed to keep trying. A parameter called temperature controls the amount of nondeterminism. At zero temperature the model invariably chooses the most probable word at every step; as the temperature rises, randomness is introduced, allowing less probable words to be selected. Chen *et al.* took account of this potential for variation by adopting three benchmarks of success:

pass@1: | the LLM generates a correct program on the first try |

pass@10: | at least one of 10 generated programs is correct |

pass@100: | at least one of 100 generated programs is correct |

Pass@1 tests were conducted at zero temperature, so that the model would always give its best-guess result. The pass@10 and pass@100 trials were done at higher temperatures, allowing the system to explore a wider variety of potential solutions.

The authors evaluated several versions of Codex on all 164 tasks. For the largest and most capable version of Codex, the pass@1 rate was about 29 percent, the pass@10 rate 47 percent, and pass@100 reached 72 percent. Looking at these numbers, should we be impressed or appalled? Is it cause for celebration that Codex gets it right on the first try almost a third of the time (when the temperature is set to zero)? Or that the success ratio climbs to almost three-fourths if you’re willing to sift through 100 proposed programs looking for one that’s correct? My personal view is this: If you view the current generation of LLMs as pioneering efforts in a long-term research program, then the results are encouraging. But if you consider this technology as an immediate replacement for hand-coded software, it’s quite hopeless. We’re nowhere near the necessary level of reliability.

Other studies have yielded broadly similar results. Federico Cassano *et al.* assess the performance of several LLMs generating code in a variety of programming languages; they report a wide range of pass@1 rates, but only two exceed 50 percent. Alessio Buscemi tested ChatGPT 3.5 on 40 coding tasks, asking for programs written in 10 languages, and repeating each query 10 times. Of the 4,000 trials, 1,833 resulted in code that could be compiled and executed. Zhijie Liu *et al.* based their evaluation of ChatGPT on problems published on the Leetcode website. The results were judged by submitting the generated code to the automated Leetcode grading process. The acceptance rate, averaged across all problems, ranged from 31 percent for programs written in C to 50 percent for Python programs. Liu *et al.* make another interesting observation: ChatGPT scores much worse on problems published after September 2021, which was the cutoff date for the GPT training set. They conjecture that the bot may do better with the earlier problems because it already saw the solutions during training.

A very recent paper by Li Zhong and Zilong Wang goes beyond the basic question of program correctness to consider robustness and reliability. Can the generated program respond appropriately to malformed input or to external errors, as when trying to open a file that doesn’t exist? Even when the prompt to the LLM included an example showing how to handle such issues correctly, Zhong and Wang found that the generated code failed to do so in 30 to 50 percent of cases.

Beyond these discouraging results, I have further misgivings of my own. Almost all the tests were conducted with short code snippets. An LLM that stumbles when writing a 10-line program is likely to have much greater difficulty with 100 lines or 1,000. Also, simple pass/fail grading is a very crude measure of code quality. Consider the primality test that was part of the Chen group’s benchmarking suite. Here is one of the Codex-written programs:

```
def is_prime(n):
prime = True
if n==1:
return False
for i in range(2, n):
if n % i == 0:
prime = False
return prime
```

This code is scored as correct—as it should be, since it will never misclassify a prime as composite, or vice versa. However, you may not have the patience—or the lifespan—to wait for a verdict when \(n\) is large. The algorithm attempts to divide \(n\) by every integer from 2 to \(n-1\).

It’s still early days for large language models. ChatGPT was published less than a year ago; the underlying technology is only about six years old. Although I feel pretty sure of myself declaring that LLMs are not yet ready to conquer the world of coding, I can’t be quite so confident predicting that they never will be. The models will surely improve, and we’ll get better at using them. Already there is a budding industry offering guidance on “prompt engineering” as a way to get the most out of every query.

Another way to bolster the performance of an LLM might be to form a hybrid with another computational system, one equipped with tools for logic and reasoning rather than purely verbal analysis. Doug Lenat, just before his recent death, proposed combining an LLM with Cyc, a huge database of common knowledge that he had labored to build over four decades. And Stephen Wolfram is working to integrate ChatGPT into Wolfram|Alpha, an online collection of curated data and algorithms.

Still, some of the obstacles that trip up LLM program generators look difficult to overcome.

Language models work their magic through simple means: In the course of composing a sentence or a paragraph, the LLM chooses its next word based on the words that have come before. It’s like writing a text message on your phone: You type “I’ll see you…” and the software suggests a few alternative continuations: “tomorrow,” “soon,” “later.” In the LLM each candidate is assigned a probability, calculated from an analysis of all the text in the model’s training set.

The idea of text generation through this kind of statistical analysis was first explored more than a century ago by the Russian mathematician A. A. Markov. His procedure is now known as an *n*-gram model, where *n* is the number of words (or characters, or other symbols) to be considered in choosing the next element of the sequence. I have long been fascinated by the *n*-gram process, though mostly for its comedic possibilities. (In an essay published 40 years ago, I referred to it as “the fine art of turning literature into drivel.”)

Of course ChatGPT and the other recent LLMs are not merely *n*-gram models. Their neural networks capture statistical features of language that go well beyond a sequence of *n* contiguous symbols. Of particular importance is the attention mechanism, which is capable of tracking dependencies between selected symbols at arbitrary distances. In natural languages this device is useful for maintaining agreement of subject and verb, or for associating a pronoun with its referent. In a programming language, the attention mechanism ensures the integrity of multipart syntactic constructs such as `if... then... else`

, and it keeps brackets properly paired and nested.

ChatGPT and other LLMs also benefit from reinforcement learning supervised by human readers. When a reader rates the quality and accuracy of the model’s output, the positive or negative feedback helps shape future responses.

Even with these refinements, however, an LLM remains, at bottom, a device for constructing a new text based on probabilities of word occurrence in existing texts. To my way of thinking, that’s not thinking. It’s something shallower, focused on words rather than ideas. Given this crude mechanism, I am both amazed and perplexed at how much the LLMs can accomplish.

For decades, architects of AI believed that true intelligence (whether natural or artificial) requires a mental model of the world. To make sense of what’s going on around you (and inside you), you need intuition about how things work, how they fit together, what happens next, cause and effect. Lenat insisted that the most important kinds of knowledge are those you acquire long before you start reading books. You learn about gravity by falling down. You learn about entropy when you find that a tower of blocks is easy to knock over but harder to rebuild. You learn about pain and fear and hunger and love—all this in infancy, before language begins to take root. Experiences of this kind are unavailable to a brain in a box, with no direct access to the physical or the social universe.

LLMs appear to be the refutation of these ideas. After all, they are models of language, not models of the world. They have no *embodiment*, no physical presence that would allow them to learn via the school of hard knocks. Ignorant of everything but mere words, how do they manage to sound so smart, so worldly?

On this point opinions differ. Critics of the technology say it’s all fakery and illusion. A celebrated (or notorious?) paper by Emily Bender, Timnit Gebru, and others dubs the models “stochastic parrots.” Although an LLM may speak clearly and fluently, the critics say, it has no idea what it’s talking about. Unlike a child, who learns mappings between words and things—Look! a cow, a cat, a car—the LLM can only associate words with other words. During the training process it observes that *umbrella* often appears in the same context as *rain*, but it has no experience of getting wet. The model’s modus operandi is akin to the formalist approach in mathematics, where you push symbols around on the page, moving them from one side of the equation to the other, without ever asking what the symbols symbolize. To paraphrase Saul Gorn: A formalist can’t understand a theory unless it’s meaningless.

Proponents of LLMs see it differently. One of their stronger arguments is that language models learn syntax, or grammar,

Ilya Sutskever, the chief scientist at OpenAI, made this point forcefully in a conversation with Jensen Huang of Nvidia:

When we train a large neural network to accurately predict the next word, in lots of different texts from the internet, what we are doing is that we are learning a world model… It may look on the surface that we are just learning statistical correlations in text, but it turns out that to just learn the statistical correlations in text, to compress them really well, what the neural network learns is some representation of the process that produced the text. This text is actually a projection of the world. There is a world out there, and it has a projection on this text. And so what the neural network is learning is more and more aspects of the world, of people, of the human conditions, their hopes, dreams and motivations, their interactions and situations that we are in, and the neural network learns a compressed, abstract, usable representation of that.

Am I going to tell you which of these contrasting theories of LLM mentality is correct? I am not. Neither alternative appeals to me. If Bender *et al.* are right, then we must face the fact that a gadget with no capacity to reason or feel, no experience of the material universe or social interactions, no sense of self, can do a passable job of writing college essays, composing rap songs, and giving advice to the lovelorn. Knowledge and logic and emotion count for nothing; glibness is all. It’s a subversive proposition. If ChatGPT can fool us with this mindless showmanship, perhaps we too are impostors whose sound and fury signifies nothing.

On the other hand, if Sutskever is right, then much of what we prize as the human experience—the sense of personhood that slowly evolves as we grow up and make our way through life—can be acquired just by reading gobs of text scraped from the internet. If that’s true, then I didn’t actually have to endure the unspeakable agony that is junior high school; I didn’t have to make all those idiotic blunders that caused such heartache and hardship; there was no need to bruise my ego bumping up against the world. I could have just read about all those things, in the comfort of my armchair; mere words would have brought me to a state of clear-eyed maturity without all the stumbling and suffering through the vale of soul-making.

Both the critics and the defenders of LLMs tend to focus on natural-language communication between the machine and a human conversational partner—the chatty part of ChatGPT. In trying to figure out what’s happening inside the neural network, it might be helpful to pay more attention to the LLM in the role of programmer, where both parties to the conversation are computers. I can offer three arguments in support of this suggestion. First, people are too easy to fool. We make allowances, fill in blanks, silently correct mistakes, supply missing context, and generally bend over backwards to make sense of an utterance, even when it makes no sense. Computers, in contrast, are stern judges, with no tolerance for bullshit.

Second, if we are searching for evidence of a mental model inside the machine, it’s worth noting that a model of a digital computer ought to be simpler than a model of the whole universe. Because a computer is an artifact we design and build to our our own specifications, there’s not much controversy about how it works. Mental models of the universe, in contrast, are all over the map. Some of them begin with a big bang followed by cosmic inflation; some are ruled by deities and demons; some feature an epic battle between east and west, Jedi and Sith, red and blue, us and them. Which of these models should we expect to find imprinted on the great matrix of weights inside GPT? They could all be there, mixed up willy-nilly.

Third, programming languages have an unusual linguistic property that binds them tightly to actions inside the computer. The British philosopher J. L. Austin called attention to a special class of words he designated *performatives*. These are words that don’t just declare or describe or request; they actually *do* something when uttered. Austin’s canonical example is the statement “I do,” which, when spoken in the right context, changes your marital status. In natural language, performative words are very rare, but in computer programs they are the norm. Writing `x = x + 1`

, in the right context, actually causes the value of `x`

to be incremented. That direct connection between words and actions might be helpful when you’re testing whether a conceptual model matches reality.

I remain of two minds (or maybe more than two!) about the status and the implications of large language models for computer science. The AI enthusiasts could be right. The models may take over programming, along with many other kinds of working and learning. Or they may fizzle, as other promising AI innovations have done. I don’t think we’ll have too long to wait before answers begin to emerge.

Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ?ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA. https://arxiv.org/abs/1706.03762.

Amatriain, Xavier. 2023. Transformer models: an introduction and catalog. https://arxiv.org/abs/2302.07730.

Goldberg,Yoav. 2015. A Primer on neural network models for natural language processing. https://arxiv.org/abs/1510.00726.

Carroll, Lewis. 1879. *Doublets: A Word-Puzzle*. London: Macmillan and Co. (Available online at Google Books.)

Dewdney, A. K. 1987. Computer Recreations: Word ladders and a tower of Babel lead to computational heights defying assault. *Scientific American* 257(2):108–111.

Gardner, Martin. 1994. Word Ladders: Lewis Carroll’s Doublets. *Math Horizons*, November, 1994, pp. 18–19.

Knox, John. 1927. *Word-Change Puzzles*. Chicago: Laird and Lee. (Available online at archive.org.)

Knuth, Donald E. 1993. *The Stanford Graphbase: A Platform for Combinatorial Computing*. New York: ACM Press (Addison-Wesley Publishing).

Nabokov, Vladimir. 1962. *Pale Fire*. New York: G. P. Putnam’s Sons. [See the index under "word golf."]

Dowdell, Thomas, and Hongyu Zhang. 2020. Language modelling for source code with Transformer-XL. https://arxiv.org/abs/2007.15813.

Karpathy, Andrej. 2017. Software 2.0 Medium.

Zhang, Shun, Zhenfang Chen, Yikang Shen, Mingyu Ding, Joshua B. Tenenbaum, and Chuang Gan. 2023. Planning with large language models for code generation. https://arxiv.org/abs/2303.05510.

Buscemi, Alessio. 2023. A comparative study of code generation using ChatGPT 3.5 across 10 programming languages. https://arxiv.org/abs/2308.04477.

Cassano, Federico, John Gouwar, Daniel Nguyen, Sydney Nguyen, Luna Phipps-Costin, Donald Pinckney, Ming-Ho Yee, Yangtian Zi, Carolyn Jane Anderson, Molly Q Feldman, Arjun Guha, Michael Greenberg, Abhinav Jangda. 2022. MultiPL-E: A scalable and extensible approach to benchmarking neural code generation. https://arxiv.org/abs/2208.08227

Chen, Mark, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, Wojciech Zaremba. 2021. Evaluating large language models trained on code. https://arxiv.org/abs/2107.03374.

Liu, Zhijie, Yutian Tang, Xiapu Luo, Yuming Zhou, and Liang Feng Zhang. 2023. No need to lift a finger anymore? Assessing the quality of code generation by ChatGPT. https://arxiv.org/abs/2308.04838.

Zhong, Li, and Zilong Wang. 2023. A study on robustness and reliability of large language model code generation. https://arxiv.org/abs/2308.10335.

Bubeck, Sébastien, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, Harsha Nori, Hamid Palangi, Marco Tulio Ribeiro, and Yi Zhang. 2023. Sparks of artificial general intelligence: early experiments with GPT-4. https://arxiv.org/abs/2303.12712.

Bayless, Jacob. 2023. It’s not just statistics: GPT does reason. https://jbconsulting.substack.com/p/its-not-just-statistics-gpt-4-does

Arkoudas, Konstantine. 2023. GPT-4 can’t reason. https://arxiv.org/abs/2308.03762.

Yiu, Eunice, Eliza Kosoy, and Alison Gopnik. 2023. Imitation versus innovation: What children can do that large language and language-and-vision models cannot (yet)? https://arxiv.org/abs/2305.07666.

Piantadosi, Steven T., and Felix Hill. 2022. Meaning without reference in large language models. https://arxiv.org/abs/2208.02957.

From word models to world models: Translating from natural language to the probabilistic language of thought. 2023. Wong, Lionel, Gabriel Grand, Alexander K. Lew, Noah D. Goodman, Vikash K. Mansinghka, Jacob Andreas, and Joshua B. Tenenbaum. https://arxiv.org/abs/2306.12672.

Lenat, Doug. 2008. The voice of the turtle: Whatever happened to AI? *AI Magazine* 29(2):11–22.

Lenat, Doug, and Gary Marcus. 2023. Getting from generative AI to trustworthy AI: What LLMs might learn from Cyc. https://arxiv.org/abs/2308.04445

Wolfram, Stephen. 2023. Wolfram|Alpha as the way to bring computational knowledge superpowers to ChatGPT. Stephen Wolfram Writings. See also Introducing chat notebooks: Integrating LLMs into the notebook paradigm.

Austin, J. L. 1962. *How to Do Things with Words*. Second edition, 1975. Edited by J. O. Urmson and Marina Sbisà. Cambridge, MA: Harvard University Press.

The von Neumann algorithm is known as the middle-square method. You start with an *n*-digit number called the seed, which becomes the first element of the pseudorandom sequence, designated \(x_0\). Squaring \(x_0\) yields a number with 2*n* digits (possibly including leading zeros). Now select the middle *n* digits of the squared value, and call the resulting number \(x_1\). The process can be repeated indefinitely, each time squaring \(x_i\) and taking the middle digits as the value of \(x_{i+1}\). If all goes well, the sequence of \(x\) values *looks* random, with no obvious pattern or ordering principle. The numbers are called *pseudo*random because there is in fact a hidden pattern: The sequence is completely predictable if you know the algorithm and the seed value.

Program 1, below, shows the middle-square method in action, in the case of \(n = 4\). Type a four-digit number into the seed box at the top of the panel, press Go, and you’ll see a sequence of eight-digit numbers, with the middle four digits highlighted. (If you press the Go button without entering a seed, the program will choose a random seed for you—using a pseudorandom-number generator other than the one being demonstrated here!)

(Source code on GitHub.)

Program 1 also reveals the principal failing of the middle-square method. If you’re lucky in your choice of a seed value, you’ll see 40 or 50 or maybe even 80 numbers scroll by. The sequence of four-digit numbers highlighted in blue should provide a reasonably convincing imitation of pure randomness, bouncing around willy-nilly among the integers in the range from 0000 to 9999, with no obvious correlation between a number and its nearby neighbors in the list. But this chaotic-seeming behavior cannot last. Sooner or later—and all too often it’s sooner—the sequence becomes repetitive, cycling through some small set of integers, such as 6100, 2100, 4100, 8100, and back to 6100. Sometimes the cycle consists of a single fixed point, such as 0000 or 2500, repeated monotonously. Once the algorithm has fallen into such a trap, it can never escape. In Program 1, the repeating elements are flagged in red, and the program stops after the first full cycle.

The emergence of cycles and fixed points in this process should not be surprising; as a matter of fact, it’s inevitable. The middle-square function maps a finite set of numbers into itself. The set of four-digit decimal numbers has just \(10^4 = 10{,}000\) elements, so the algorithm cannot possibly run for more than 10,000 steps without repeating itself. What’s disappointing about Program 1 is not that it becomes repetitive but that it does so prematurely, never coming anywhere near the limit of 10,000 iterations.

The middle-square system of Program 1 has three cycles, each of length four, as well as five fixed points. Figure 1 reveals the fates of all 10,000 seed values.

Each of the numbers printed in a blue lozenge is a *terminal value* of the middle-square process. Whenever the system reaches one of these numbers, it will thereafter remain trapped forever in a repeating pattern. The arrays of red dots show how many of the four-digit seed patterns feed into each of the terminal numbers. In effect, the diagram divides the space of 10,000 seed values into 17 watersheds, one for each terminal value. Once you know which watershed a seed belongs to, you know where it’s going to wind up.

The watersheds vary in size. The area draining into terminal value 6100 includes more than 3,100 seeds, and the fixed point 0000 gathers contributions from almost 2,000 seeds. At the other end of the scale, the fixed point 3792 has no tributaries at all; it is the successor of itself but of no other number.

For a clearer picture of what’s going on in the watersheds, we need to zoom in close enough to see the individual rills and streams and rivers that drain into each terminal value. Figure 2 provides such a view for the smallest of the three watersheds that terminate in cycles.

The four terminal values forming the cycle itself are again shown as blue lozenges; all the numbers draining into this loop are reddish brown. The entire set of numbers forms a directed graph, with arrows that lead from each node to its middle-square successor. Apart from the blue cycle, the graph is treelike: A node can have multiple incoming arrows but only one outgoing arrow, so that streams can merge but never split. Thus from any node there is only one path leading to the cycle. This structure has another consequence: Since some nodes have multiple inputs, other nodes must have none at all. They are shown here with a heavy outline.

The ramified nature of this graph helps explain why the middle-square algorithm tends to yield such short run lengths. The component of the graph shown in Figure 2 has 86 nodes (including the four members of the repeating cycle). But the longest pathways following arrows through the graph cover just 15 nodes before repetition sets in. The median run length is just 10 nodes.

A four-digit random number generator is little more than a toy; even if it provided the full run length of 10,000 numbers, it would be too small for modern applications. Working with “wider” numerals allows the middle-square algorithm to generate longer runs, but the rate of increase is still disappointing. For example, with eight decimal digits, the maximum potential cycle length is 100 million, but the actual median run length is only about 2,700. More generally, if the maximum run length is \(N\), the observed median run length seems to be \(c \sqrt{N}\), where \(c\) is a constant less than 1. This relationship is plotted in Figure 3, on the left for decimal numbers of 4 to 12 digits and on the right for binary numbers of 8 to 40 bits.

The graphs make clear that the middle-square algorithm yields only a puny supply of random numbers compared to the size of the set from which those numbers are drawn. In the case of 12-digit decimal numbers, for example, the pool of possibilities has \(10^{12}\) elements, yet a typical 12-digit seed yields only about 300,000 random numbers before getting stuck in a cycle. That’s only a third of a millionth of the available supply.

Von Neumann came up with the middle-square algorithm circa 1947, while planning the first computer simulations of nuclear fission. This work was part of a postwar effort to build ever-more-ghastly nuclear weapons. Random numbers were needed for a simulation of neutrons diffusing through various components of a bomb assembly. When a neutron strikes an atomic nucleus, it can either bounce off or be absorbed; in some cases absorption triggers a fission event, splitting the nucleus and releasing more neutrons. The bomb explodes only if the population of neutrons grows rapidly enough. The simulation technique, in which random numbers determined the fates of thousands of individual neutrons, came to be known as the Monte Carlo method.

The simulations were run on ENIAC, the pioneering vacuum-tube computer built at the University of Pennsylvania and installed at the Army’s Aberdeen Proving Ground in Maryland. The machine had just been converted to a new mode of operation. ENIAC was originally programmed by plugging patch cords into switchboard-like panels; the updated hardware allowed a sequence of instructions to be read into computer memory from punched cards, and then executed.

The first version of the Monte Carlo program, run in the spring of 1948, used middle-square random numbers with eight decimal digits; as noted above, this scheme yields a median run length of about 2,700. A second series of simulations, later that year, had a 10-digit pseudorandom generator, for which the median run length is about 30,000.

When I first read about the middle-square algorithm, I assumed it was used for these early ENIAC experiments and then supplanted as soon as something better came along. I was wrong about this.

One of ENIAC’s successors was a machine at the Los Alamos Laboratory called MANIAC.

Whereas ENIAC did decimal arithmetic, MANIAC was a binary machine. Each of its internal registers and memory cells could accommodate 40 binary digits, which Jackson and Metropolis called *bigits*.*bit* had been introduced in the late 1940s by John Tukey and Claude Shannon, but evidently it had not yet vanquished all competing alternatives.

Here is the program listing for the middle-square subroutine:

The entries in the leftmost column are line numbers; the next column shows machine instructions in symbolic form, with arrows signifying transfers of information from one place to another. The items in the third column are memory addresses or constants associated with the instructions, and the notations to the right are explanatory comments. In line 3 the number \(x_i\) is squared, with the two parts of the product stored in registers `R2`

and `R4`

. Then a sequence of left and right shifts (encoded `L(1)`

, `R(22)`

, and `L(1)`

, isolate the middle bigits, numbered 20 through 57, which become the value of \(x_{i+1}\).

MANIAC’s standard subroutine was probably still in use at Los Alamos as late as 1957. In that year the laboratory published *A Practical Manual on the Monte Carlo Method for Random Walk Problems* by E. D. Cashwell and C. J. Everett, which describes a 38-bit middle-square random number generator that sounds very much like the same program. Cashwell and Everett indicate that standard practice was always to use a particular seed known to yield a run length of “about 750,000.”

Another publication from 1957 also mentions a middle-square program in use at Los Alamos. The authors were W. W. Wood and F. R. Parker, who were simulating the motion of molecules confined to a small volume; this was one of the first applications of a new form of the Monte Carlo protocol, known as Markov chain Monte Carlo. Wood and Parker note: “The pseudo-random numbers were obtained as appropriate portions of 70 bit numbers generated by the middle square process.” This work was done on IBM 701 and 704 computers, which means the algorithm must have been implemented on yet another machine architecture.

John W. Mauchly, one of ENIAC’s principal designers, also spread the word of the middle-square method. In a 1949 talk at a meeting of the American Statistical Association he presented a middle-square variant for the BINAC, a one-off predecessor of the UNIVAC line of computers. I have also seen vague, anecdotal hints suggesting that some form of the middle-square algorithm was used on the UNIVAC 1 in the early 1950s at the Lawrence Livermore laboratory.

I have often wondered how it came about that one of the great mathematical minds of the 20th century conceived and promoted such a lame idea. It’s even more perplexing that the middle-square algorithm, whose main flaw was recognized from the start, remained in the computational toolbox for at least a decade. How did they even manage to get the job done with such a scanty supply of randomness?

With the benefit of further reading, I think I can offer some plausible guesses in answer to these questions. It helps to keep in mind the well-worn adage of L. P. Hartley: “The past is a foreign country. They do things differently there.”

If you are writing a Monte Carlo program today, you can take for granted a reliable and convenient source of high-quality random numbers. Whenever you need one, you simply call the function `random()`

. The syntax differs from one programming language to another, but nearly every modern language has such a function built in. Moreover, it can be counted on to produce vast quantities of randomness—cosmic cornucopias of the stuff. A generator called the Mersenne Twister, popular since the 1990s and the default choice in several programming languages, promises \(2^{19997} - 1\) values before repeating itself.

Of course no such software infrastructure existed in 1948. In those days, the place to find numbers of all kinds—logarithms, sines and cosines, primes, binomial coefficients—was a lookup table. Compiling such tables was a major mathematical industry. Indeed, the original purpose of ENIAC was to automate the production of tables for aiming artillery. The reliance on precomputed tables extended to random numbers. In the 1930s two pairs of British statisticians created tables of 15,000 and 100,000 random decimal digits; a decade later the RAND Corporation launched an even larger endeavor, generating a million random digits. In 1949 the RAND table became available on punched cards (50 digits to a card, 20,000 cards total); in 1955 it was published in book form. (The exciting opening paragraphs are reproduced in Figure 4.)

It would have been natural for the von Neumann group to adapt one of the existing tables to meet their needs. They avoided doing so because there was no room to store the table in computer memory; they would have to read a punched card every time a number was needed, which would have been intolerably slow. What they did instead was to use the middle-square procedure as if it were a table of a few thousand random numbers. *ENIAC in Action*.

In the preliminary stages of the project, the group chose a specific seed value (not disclosed in any documents I’ve seen), then generated 2,000 random iterates from this seed. They checked to be sure the program had not entered a cycle, and they examined the numbers to confirm that they satisfied various tests of randomness. At the start of a Monte Carlo run, the chosen seed was installed in the middle-square routine, which then supplied random numbers as needed until the 2,000th request was satisfied. At that point the algorithm was reset and restarted with the same seed. In this way the simulation could continue indefinitely, cycling through the same set of numbers in the same order, time after time. The second round of simulations worked much the same way, but with a block of 3,000 numbers from the 10-digit generator.

Reading about this reuse of a single, small block of numbers, I could not suppress a tiny gasp of horror. Even if those 2,000 numbers pass basic tests of randomness, the concatenation of the sequence with multiple copies of itself is surely ill-advised. Suppose you are using the random numbers to estimate the value of \(\pi\) by choosing random points in a square and determining whether they fall inside an inscribed circle. Let each random number specify the two coordinates of a single point. After running through the 2,000 middle-square values you will have achieved a certain accuracy. But repeating the same values is pointless: It will never improve the accuracy because you will simply be revisiting all the same points.

We can learn more about the decisions that led to the middle-square algorithm from the proceedings of a 1949 symposium on Monte Carlo methods. The symposium was convened in Los Angeles by a branch of the National Bureau of Standards. It served as the public unveiling of the work that had been going on in the closed world of the weapons labs. Von Neumann’s contribution was titled “Various Techniques Used in Connection With Random Digits.”

Von Neumann makes clear that the middle-square algorithm was not a spur-of-the-moment, off-the-cuff choice; he considered serveral alternatives. Hardware was one option. “We . . . could build a physical instrument to feed random digits directly into a high-speed computing machine,” he wrote. “The real objection to this procedure is the practical need for checking computations. If we suspect that a calculation is wrong, almost any reasonable check involves repeating something done before. . . . I think that the direct use of a physical supply of random digits is absolutely inacceptable for this reason and for this reason alone.”

The use of precomputed tables of random numbers would have solved the replication problem, but, as noted above, von Neumann dismissed it as too slow if the numbers had to be supplied by punch card. Such a scheme, he said, would rely on “the weakest portion of presently designed machines—the reading organ.”

Von Neumann also mentions one other purely arithmetic procedure, originally suggested by Stanislaw Ulam: iterating the function \(f(x) = 4x(1 – x)\), with the initial value \(x_0\) lying between 0 and 1. This formula is a special case of a function called the logistic map, which 25 years later would become the jumping off point for the field of study known as chaos theory. The connection with chaos suggests that the function might well be a good source of randomness, and indeed if \(x_0\) is an irrational real number, the successive \(x_i\) values are guaranteed to be uniformly distributed on the interval (0, 1), and will never repeat. But that guarantee applies only if the computations are performed with infinite precision.

Two other talks at the Los Angeles symposium also touched on sources of randomness. Preston Hammer reported on an evaluation of the middle-square algorithm performed by a group at Los Alamos, using IBM punch-card equipment. Starting from the ten-digit (decimal) seed value 1111111111, they produced 10,000 random values, checking along the way for signs of cyclic repetition; they found none. (The sequence runs for 17,579 steps before descending into the fixed point at 0.) They also applied a few statistical tests to the first 3,284 numbers in the sequence. “These tests indicated nothing seriously amiss,” Hammer concluded—a rather lukewarm endorsement.

Another analysis was described by George Forsythe (the amanuensis for von Neumann’s paper). His group at the Bureau of Standards studied the four-digit version of the middle-square algorithm (the same as Program 1), tracing 16 trajectories. Twelve of the sequences wound up in the 6100-2100-4100-8100 loop. The run lengths ranged from 11 to 104, with an average of 52. All of these results are consistent with those shown in Figure 1. Presumably, the group stopped after investigating just 16 sequences because the computational process—done with punch cards and desk calculators—was too arduous for a larger sample. (With a modern laptop, it takes about 40 milliseconds to determine the fates of all 10,000 four-digit seeds.)

Forsythe and his colleagues also looked at a variant of the middle-square process. Instead of setting \(x_{i+1}\) to the middle digits of \(x_i^2\), the variant sets \(x_{i+2}\) to the middle digits of \(x_i \times x_{i+1}\). One might call this the middle-product method. In a variant of the variant, the two factors that form the product are of different sizes. Runs are somewhat longer for these routines than for the squaring version, but Forsythe was less than enthusiastic about the results of statistical tests of uniform distribution.

Von Neumann was undeterred by these ambivalent reports. Indeed, he argued that a common failure mode of the middle-square algorithm—“the appearance of self-perpetuating zeros at the ends of the numbers \(x_i\)”—was not a bug but a feature. Because this event was easy to detect, it might guard the Monte Carlo process against more insidious forms of corruption.

Amid all these criticisms of the middle-square method, I wish to insert a gripe of my own that no one else seems to mention. The middle-square algorithm mixes up numbers and numerals in a way that I find disagreeable. In part this is merely a matter of taste, but the conflating of categories does have consequences when you sit down to convert the algorithm into a computer program.

I first encountered the distinction between numbers and numerals at a tender age; it was the subject of chapter 1 in my seventh-grade “New Math” textbook. I learned that a number is an abstraction that counts the elements of a set or assigns a position within a sequence. A numeral is a written symbol denoting a number. The decimal numeral 6 and the binary numeral 110 both denote the same number; so does the Roman numeral VI and the Peano numeral S(S(S(S(S(S(0)))))). Mathematical operations apply to numbers, and they give the same results no matter how those numbers are instantiated as numerals. For example, if \(a + b = c\) is a true statement mathematically, then it will be true in any system of numerals: decimal 4 + 2 = 6, binary 100 + 10 = 110, Roman IV + II = VI.

In the middle-square process, squaring is an ordinary mathematical operation: \(n^2\) or \(n \times n\) should produce the same number no matter what kind of numeral you choose for representing \(n\). But “extract the middle digits” is a different kind of procedure. There is no mathematical operation for accessing the individual digits of a number, because numbers don’t have digits. Digits are components of numerals, and the meaning of “middle digits” depends on the nature of those numerals. Even if we set aside exotica such as Roman numerals and consider only ordered sequences of digits written in place-value notation, the outcome of the middle-digits process depends on the radix, or base. The decimal numeral 12345678 and the binary numeral 101111000110000101001110 denote the same number, but their respective middle digits 3456 and 000110000101 are *not* equal.

It appears that von Neumann was conscious of the number-numeral distinction. In a 1948 letter to Alston Householder he referred to the products of the middle-square procedure as “digital aggregates.” What he meant by this, I believe, is that we should not think of the \(x_i\) as magnitudes or positions along the number line but rather as mere collections of digits, like letters in a written word.

This nice philosophical distinction has practical consequences when we write a computer program to implement the middle-square method. What is the most suitable data type for the \(x_i\)? The squaring operation makes numbers an obvious choice. Multiplication and raising to integer powers are basic functions offered by almost any programming language—but they apply only to numbers. For extracting the middle digits, other data structures would be more convenient. If we collect a sequence of digits in an array or a character string—an embodiment of von Neumann’s “digital aggregate”—we can then easily select the middle elements of the sequence by their position, or index. But squaring a string or an array is not something computer systems know how to do.

The JavaScript code that drives Program 1 resolves this conflict by planting one foot in each world. An \(x\) value is initially represented as a string of four characters, drawn from the digits 0 through 9. When it comes time to calculate the square of \(x\), the program invokes a built-in JavaScript procedure, `parseInt`

, to convert the string into a number. Then \(x^2\) is converted back into a string (with the `toString`

method) and padded on the left with as many 0 characters as needed to make the length of the string eight digits. Extracting the middle digits of the string is easy, using a `substring`

method. The heart of the program looks like this:

```
function mid4(xStr) {
const xNum = parseInt(xStr, 10); // convert string to number
const sqrNum = xNum * xNum; // square the number
let sqrStr = sqrNum.toString(10); // back to decimal string
sqrStr = padLeftWithZeros(sqrStr, 8); // pad left to 8 digits
const midStr = sqrStr.substring(2, 6); // select middle 4 digits
return midStr;
}
```

The program continually shuttles back and forth between arithmetic operations and methods usually associated with textual data. In effect, we abduct a number from the world of mathematics, do some extraterrestrial hocus-pocus on it, and set it back where it came from.

I don’t mean to suggest it’s impossible to implement the middle-square rule with purely mathematical operations. Here is a version of the function written in the Julia programming language:

```
function mid4(x)
return (x^2 % 1000000) ÷ 100
end
```

In Julia the `%`

operator computes the remainder after integer division; for example, `12345678 % 1000000`

yields the result `345678`

. The `÷`

operator returns the quotient from integer division: `345678 ÷ 100`

is `3456`

. Thus we have extracted the middle digits just by doing some funky grade-school arithmetic.

We can even create a version of the procedure that works with numerals of any width and any radix, although those parameters have to be specified explicitly:

```
function midSqr(x, radix, width) # width must be even
modulus = radix^((3 * width) ÷ 2)
divisor = radix^(width ÷ 2)
return (x^2 % modulus) ÷ divisor
end
```

This program is short, efficient, and versatile—but hardly a model of clarity. If I encountered it out of context, I would have a hard time figuring out what it does.

There are still more ways to accomplish the task of plucking out the middle digits of a numeral. The mirror image of the all-arithmetic version is a program that eschews numbers entirely and performs the computation in the realm of digit sequences. To make that work, we need to write a multiplication procedure for indexable strings or arrays of digits.

The MANIAC program shown above takes yet another approach, using a single data type that can be treated as a number one moment (during multiplication) and as a sequence of ones and zeros the next (in the left-shift and right-shift operations).

Let us return to von Neumann’s talk at the 1949 Los Angeles symposium. Tucked into his remarks is the most famous of all pronouncements on pseudorandom number generators:

Any one who considers arithmetical methods of producing random digits is, of course, in a state of sin. For, as has been pointed out several times, there is no such thing as a random number—there are only methods to produce random numbers, and a strict arithmetic procedure of course is not such a method.

Today we live under a new dispensation. Faking randomness with deterministic algorithms no longer seems quite so naughty, and the concept of pseudorandomness has even shed its aura of paradox and contradiction. Now we view devices such as the middle-square algorithm not as generators of *de novo* randomness but as amplifiers of pre-existing randomness. All the genuine randomness resides in the seed you supply; the machinations of the algorithm merely stir up that nubbin of entropy and spread it out over a long sequence of numbers (or numerals).

When von Neumann and his colleagues were designing the first Monte Carlo programs, they were entirely on their own in the search for an algorithmic source of randomness. There was no body of work, either practical or theoretical, to guide them. But that soon changed. Just three months after the Los Angeles symposium, a larger meeting on computation in the sciences was held at Harvard.

After reviewing the drawbacks of the middle-square method, Lehmer proposed the following simple function for generating a stream of pseudorandom numbers, each of eight decimal digits:

\[x_{i+1} = 23x_i \bmod 10000001\]

For any seed value in the range \(0 \lt x_0 \lt 10000001\) this formula generates a cyclic pseudorandom sequence with a period of 5,882,352, more than 2,000 times the median length of eight-digit middle-square sequences. And there are advantages apart from the longer runs. The middle-square algorithm produces lollipop loops, with a stem traversed just once before the system enters a short loop or becomes stuck at a fixed point. Only the stem portion is useful for generating random numbers. Lehmer’s algorithm, in contrast, yields long, simple, stemless loops; the system follows the entire length of 5,882,352 numbers before returning to its starting point.

Simple loops also make it far easier to detect when the system has in fact begun to repeat itself. All you have to do is compare each \(x_i\) with the seed value, \(x_0\), since that initial value will always be the first to repeat. With the stem-and-loop topology, cycle-finding is tricky. The best-known technique, attributed to Robert W. Floyd, is the hare-and-tortoise algorithm. It requires \(3n\) evaluations of the number-generating function to identify a cycle with run length \(n\).

Lehmer’s algorithm is an example of a linear congruential generator, with the general formula:

\[x_{i+1} = (ax_i + b) \bmod m.\]

Linear congruential algorithms have the singular virtue of being governed by well-understood principles of number theory. Just by inspecting the constants in the formula, you can know the complete structure of cycles in the output of the generator. Lehmer’s specific recommendation (\(a = 23\), \(b = 0\), \(m = 10000001\)) is no longer considered best practice, but other members of the family are still in use. With a suitable choice of constants, the generator achieves the maximum possible period of \(m - 1\), so that the algorithm marches through the entire range of numbers in a single long cycle.

At this point we come face to face with the big question: If the middle-square method is so feeble and unreliable, why did some very smart scientists persist in using it, even after the flaws had been pointed out and alternatives were on offer? Part of the answer, surely, is habit and inertia. For the veterans of the early Monte Carlo studies, the middle-square method was a known quantity, a path of least resistance. For those at Los Alamos, and perhaps a few other places, the algorithm became an off-the-shelf component, pretested and optimized for efficiency. People tend to go with the flow; even today, most programmers who need random numbers accept whatever function their programming language provides. That’s certainly my own practice, and I can offer an excuse beyond mere laziness: The default generator may not be ideal, but it’s almost certainly better than what I would cobble together on my own.

Another part of the answer is that the middle-square method was probably good enough to meet the needs of the computations in which it was used. After all, the bombs they were building *did* blow up.

Many simulations work just fine even with a crude randomness generator. In 2005 Richard Procassini and Bret Beck of the Livermore lab tested this assertion on Monte Carlo problems of the same general type as the neutron-diffusion studies done on ENIAC in the 1940s. They didn’t include the middle-square algorithm in their tests; all of their generators were of the linear congruential type. But their main conclusion was that “periods as low as \(m = 2^{16} = 65{,}536\) are sufficient to produce unbiased results.”

Not all simulations are so easy-going, and it’s not always easy to tell which ones are finicky and which are tolerant. Simple tasks can have stringent requirements, as with the estimation of \(\pi\) mentioned above. And even with repetitions, subtle biases or correlations can skew the results of Monte Carlo studies in ways that are hard to foresee. Thirty years ago an incident that came to be known as the Ferrenberg affair revealed flaws in random number generators that were then considered the best of their class. So I’m not leading a movement to revive the middle-square method.

I would like to close with two oddities.

First, in discussing Figure 1, I mentioned the special status of the number 3792, which is an *isolated* fixed point of the middle-square function. Like any other fixed point, 3792 is its own successor: \(3792^2\) yields 14,379,264, whose four middle digits are 3792 again. Unlike other fixed points, 3792 is not the successor of any other number among the 10,000 possible seeds. The only way it will ever appear in the output of the process is if you supply it as the input—the seed value.

In 1954 Nicholas Metropolis encountered the same phenomenon when he plotted the fates of all 256 seeds in the eight-bit binary middle-square generator. Here is his graph of the trajectories:

At the lower left is the isolated fixed point 165. In binary notation this number is 10100101, and its square is 0110101001011001, so it is indeed a fixed point. Metropolis comments: “There is one number, \(x = 165\), which, when expressed in binary form and iterated, reproduces itself and is never reproduced by any other number in the series. Such a number is called ‘samoan.’” The diagram also includes this label.

The mystery is: Why “samoan”? Metropolis offers no explanation. I suppose the obvious hypothesis is that an isolated fixed point is like an island, but if that’s the intended sense, why not pick some lonely island off by itself in the middle of an ocean? St. Helena, maybe. Samoa is a *group* of islands, and not particularly isolated. I have scoured the internets for other uses of “samoan” in a mathematical context, and come up empty. I’ve looked into whether it might refer not to the island nation but to the Girl Scout cookies called Samoas (but they weren’t introduced until the 1970s). I’ve searched for some connection with Margaret Mead’s famous book *Coming of Age in Samoa*. No clues. Please share if you know anything!

On a related note, when I learned that both the eight-bit binary and the four-digit decimal systems each have a single “samoan,” I began to wonder if this was more than coincidence. I found that the two-digit decimal function also has a single isolated fixed point, namely \(x_0 = 50\). Perhaps *all* middle-square functions share this property? Wouldn’t that be a cute discovery? Alas, it’s not so. The six-digit decimal system is a counterexample.

The second oddity concerns a supposed challenge to von Neumann’s priority as the inventor of the middle-square method. In The Broken Dice, published in 1993, the French mathematician Ivar Ekeland tells the story of a 13th-century monk, “a certain Brother Edvin, from the Franciscan monastery of Tautra in Norway, of whom nothing else is known to us.” Ekeland reports that Brother Edvin came up with the following procedure for settling disputes or making decisions randomly, as an alternative to physical devices such as dice or coins:

The player chooses a four-digit number and squares it. He thereby obtains a seven- or eight-digit number of which he eliminates the two last and the first one or two digits, so as to once again obtain a four-digit number. If he starts with 8,653, he squares this number (74,874,409) and keeps the four middle digits (8,744). Repeating the operation, he successively obtains:

8,653 8,744 4,575 9,306 6,016

Thus we have a version of the middle-square method 700 years before von Neumann, and only 50 years after the introduction of Indo-Arabic numerals into Europe. The source of the tale, according to Ekeland, is a manuscript that has been lost to the world, but not before a copy was made in the Vatican archives at the direction of Jorge Luis Borges, who passed it on to Ekeland.

The story of Brother Edvin has not yet made its way into textbooks or the scholarly literature, as far as I know, but it has taken a couple of small steps in that direction. In 2010 Bill Gasarch retold the tale in a blog post titled “The First pseudorandom generator- probably.” And the Wikipedia entry for the middle-square method presents Ekeland’s account as factual—or at least without any disclaimer expressing doubts about its veracity.

To come along now and insist that Brother Edvin’s algorithm is a literary invention, a fable, seems rather unsporting. It’s like debunking the Tooth Fairy. So I won’t do that. You can make up your own mind. But I wonder if von Neumann might have welcomed this opportunity to shuck off responsibility for the algorithm.

]]>The paper is “Information Theory and the Game of Jotto,” issued in August of 1971 as Artificial Intelligence Memo No. 28 from the AI Lab at MIT. The author was Michael D. Beeler, known to me mainly as one of the three principal authors of HAKMEM (the others were Bill Gosper and Rich Schroeppel). Beeler later worked at Bolt, Baranek, and Newman, an MIT spinoff.

Wikipedia tells me that Jotto was invented in 1955 by Morton M. Rosenfeld as a game for two players. As in Wordle, you try to discover a secret word by submitting guess words and getting feedback about how close you have come to the target. The big difference is that JOTTO’s feedback offers only a crude measure of closeness. You learn the number of letters in your guess word that match one of the letters in the target word. You get no indication of *which* letters match, or whether they are in the correct positions.

The unit of measure for closeness is the jot. Beeler gives the example of playing GLASS against SMILE, which earns a closeness score of two jots, since there are matches for the letter L and for one S. Unlike the Wordle feedback rule, this scoring scheme is symmetric: The score remains the same if you switch the roles of guess and target word.

A defect of the game, in my view, is that you can max out the score at five jots and still not know the target word. For example, when a five-jot score tells you that the letters of the target are {A, E, G, L, R}, the word could be GLARE, LAGER, LARGE, or REGAL. Your only way to pin down the answer is to guess them in sequence.

Beeler’s main topic is not how the game proceeds between human players but how a computer can be programmed to take the role of a player. He reports that “A JOTTO program has existed for a couple of years at MIT’s A.I. Lab,” meaning it was created sometime in the late 1960s. He says nothing about who wrote this program. I’m going to make the wild surmise that Beeler himself might have been the author, particularly given his intimate knowledge of the program’s innards.

Here’s the crucial passage, lifted directly from the memo:

The strategy described here—maximizing the information gain from each guess—is exactly what’s recommended for Wordle. But where Wordle divvies up the potential target words into \(3^5 = 243\) subsets, the JOTTO scoring rule defines only six categories (0 through 5 jots). As a result, the maximum possible information gain is only about 2.6 bits in JOTTO, compared with almost 8 bits in Wordle.

Beeler also recognized a limitation of this “greedy” strategy. “It is conceivable that the test word with the highest expectation at the current point in the game has a good chance of getting us to a point where we will NOT have any particularly good test words available . . . I am indebted to Bill Gosper for pointing out this possibility; the computation required, however, is impractical, and besides, the program seems to do acceptably as is.”

The JOTTO program was written in the assembly language of the PDP-6 and PDP-10 family of machines from the Digital Equipment Corporation, which were much loved in that era at MIT. (Beeler praises the instruction set as “very symmetrical, complete, powerful and easy to think in.”) But however elegant the architecture, physical resources were cramped, with a maximum memory capacity of about one megabyte. Nevertheless, Beeler found room for the program itself, for a dictionary of about 7,000 words, and for tables of precomputed responses to the first two or three guesses.

Humbling.

]]>I hasten to add that my motivation in this project is not to cheat. Josh Wardle, the creator of Wordle, took all the fun out of cheating by making it way too easy. Anybody can post impressive results like this one:

That lonely row of green squares indicates that you’ve solved the puzzle with a single brilliant guess. The Wordle app calls that performance an act of “Genius,” but if it happens more than once every few years, your friends may suspect another explanation.

The software I’ve written will not solve today’s puzzle for you. You can’t run the program unless you already know the answer. Thus the program won’t tell you, in advance, how to play; but it may tell you, retrospectively, how you *should have* played.

The aim in Wordle is to identify a hidden five-letter word. When you make a guess, the Wordle app gives you feedback by coloring the letter tiles. A green tile indicates that you have put the right letter in the right place.

The sequence of grids above records the progress of a game I played several weeks ago. The initial state of the puzzle is a blank grid with space for six words of five letters each. My first guess, CRATE *(far left)*, revealed that the letter T is present in the target word, but not in the fourth position; the other letters, CRAE, are absent. I then played BUILT and learned where the T goes, and that the letters I and L are also present. The third guess, LIMIT, got three letters in their correct positions, which was enough information to rule out all but one possibility. The four green tiles in row four confirm that LIGHT is the target word.

The official Wordle app gives you six chances to guess the Wordle-of-the-day. If you fill the grid with six wrong answers, the game ends and the app reveals the word you failed to find.

It’s important to bear in mind that Wordle is played in a closed universe. *Times*‘s revisions.

Both of the word lists are downloaded to your web browser whenever you play the game, and you can examine or copy them by peeking at the JavaScript source code with your browser’s developer tools.

Conceptually, a Wordle-playing program conducts a dialogue between two computational agents: the Umpire and the Player. The Umpire knows the target word, and responds to submitted guess words in much the same way the Wordle app does. It marks each of the five letters in the guess as either *correct*, *present*, or *absent*, corresponding to the Wordle color codes of green, gold, and gray.

The Player component of the program does *not* know the target word, but it has access to the lists of common and arcane words. On each turn, the Player selects a word from one of these lists, submits it to the Umpire, and receives feedback classifying each of the letters as *correct*, *present*, or *absent*. This information can then be used to help choose the next guess word. The guessing process continues until the Player identifies the target word or exhausts its six turns.

Here’s a version of such a program. Type in a start word and a target word, then press *Go*. Briefly displayed in gray letters are the words the program evaluates as possible next guesses. The most promising candidate is then submitted to the Umpire and displayed with the green-gold-gray color code.

Feel free to play with the buttons below the grid. The “`?`

” button will pop up a brief explanation. If you’d like to peek behind the curtain, the source code for the project is available on GitHub. Also, a standalone version of the program may be more convenient if you’re reading on a small screen.

This program is surely not the best possible Wordler, but it’s pretty good. Most puzzle instances are solved in three or four guesses. Failures to finish within six guesses are uncommon if you choose a sensible starting word. The algorithms behind Program 1 will be the main subject of this essay, but before diving into them I would like to consider some simpler approaches to playing the game.

When I first set out to write a Wordler program, I chose the easiest scheme I could think of. At the start of a new game, the Player chooses a word at random from the list of 2,315 potential target words. (The arcana will not be needed in this program.) After submitting this randomly chosen word as a first guess, the Player takes the Umpire’s feedback report and uses it to sift through the list of potential targets, retaining those words that are still viable candidates and setting aside all those that conflict in some way with the feedback. The Player then selects another word at random from among the surviving candidates, and repeats the process. On each round, the list of candidates shrinks further.

The winnowing of the candidate list is the heart of the algorithm. Suppose the first guess word is COULD, and the Umpire’s evaluation is equivalent to this Wordle markup: . You can now go through the list of candidate words and discard all those that have a letter other than L in the fourth position. You can also eliminate all words that include any instance of the letters C, U, or D. The gold letter O rules out all words in two disjoint classes: those that *don’t* have an O anywhere in their spelling, and those that *do* have an O in the second position. After this winnowing process, only seven words remain as viable candidates.

One aspect of these rules often trips me up when I’m playing. Wordle is not Wheel of Fortune: The response to a guess word might reveal that the target word has an L, but it doesn’t necessarily show you *all* the Ls. If you play COULD and get the feedback displayed above, you should not assume that the green L in the fourth position is the *only* L. The target word could be ATOLL, HELLO, KNOLL, or TROLL. (When both the guess and the target words have multiple copies of the same letter, the rules get even trickier. If you want to know the gory details, see Note 2.)

The list-winnowing process is highly effective. Even with a randomly chosen starter word, the first guess typically eliminates more than 90 percent of the target words, leaving only about 220 viable candidates, on average. In subsequent rounds the shrinkage is not as rapid, but it’s usually enough to identify a unique solution within the allotted six guesses.

I wrote the random Wordler as a kind of warmup exercise, and I didn’t expect much in the way of performance. Plucking words willy-nilly from the set of all viable candidates doesn’t sound like the shrewdest strategy. However, it works surprisingly well. The graph below shows the result of running the program on each of the 2,315 target words, with the experiment repeated 10,000 times to reduce statistical noise.

Guess pool: remaining target words

Start word: randomly chosen

The average number of guesses needed to find the solution is 4.11, and only 2 percent of the trials end in failure (requiring seven or more guesses).

Incidentally, this result is quite robust, in the sense that the outcome doesn’t depend at all on the composition of the words on the target list. If you replace the official Wordle list with 2,315 strings of random letters, the graph looks the same.

The surprising strength of a random player might be taken as a sign that Wordle isn’t really a very hard game. If you are guided by a single, simple rule—play any word that hasn’t already been excluded by feedback from prior guesses—you will win most games. From another point of view the news is not so cheering: If a totally mindless strategy can often solve the puzzle in four moves, you may have to work really hard to do substantially better.

Still another lesson might be drawn from the success of the random player: The computer’s complete ignorance of English word patterns is not a major handicap and might even be an advantage. A human player’s judgment is biased by ingrained knowledge of differences in word frequency. If I see the partial solution BE _ _ _, I’m likely to think first of common words such as BEGIN and BENCH, rather than the rarer BELLE and BEZEL. In the Wordle list of potential targets, however, each of these words occurs with exactly the same frequency, namely 1/2315. A policy favoring common words over rare ones is not helpful in this game.

But a case can be made for a slightly different strategy, based on a preference not for common words but for common letters. A guess word made up of high-frequency letters should elicit more information than a guess consisting of rare letters. By definition, the high-frequency letters are more likely to be present in the target word. Even when they are absent, that fact becomes valuable knowledge. If you make JIFFY your first guess and thereby learn that the target word does not contain a J, you eliminate 27 candidates out of 2,315. Playing EDICT and learning that the target has no E rules out 1,056 words.

The table below records the number of occurrences of each letter from A to Z in all the Wordle target words, broken down according to position within the word. For example, A is the first letter of 141 target words, and Z is the last letter of four words.

The same information is conveyed in the heatmap below, where lighter colors indicate more common letters.

These data form the basis of another Wordle-playing algorithm. Given a list of candidate words, we compute the sum of each candidate’s letter frequencies, and select the word with the highest score. For example, the word SKUNK gets 366 points for the initial S, then 10 points for the K in the second position, and so on through the rest of the letters for an aggregate score of 366 + 10 + 165 + 182 + 113 = 836. SKUNK would win over KOALA, which has a score of 832, but lose to PIGGY (851).

Testing the letter-frequency algorithm against all 2,315 target words yields this distribution of results:

Guess pool: remaining target words

Start word: determined by algorithm (SLATE)

The mean number of guesses is 3.83, noticeably better than the random-choice average of 4.11. The failure rate—words that aren’t guessed within six turns—is pushed down to 1.1 percent (28 out of 2,315).

There’s surely room for improvement in the letter-frequency algorithm. One weakness is that it treats the five positions in each word as independent variables, whereas in reality there are strong correlations between each letter and its neighbors. For example, the groups CH, SH, TH, and WH are common in English words, and Q is constantly clinging to U. Other pairs such as FH and LH are almost never seen. These attractions and repulsions between letters play a major role in human approaches to solving the Wordle puzzle (or at least they do in *my* approach). The letter-frequency program could be revised to take advantage of such knowledge, perhaps by tabulating the frequencies of bigrams (two-letter combinations). I have not attempted anything along these lines. Other ideas hijacked my attention.

Like other guessing games, such as Bulls and Cows, Mastermind, and Twenty Questions, Wordle is all about acquiring the information needed to unmask a hidden answer. Thus if you’re looking for guidance on playing the game, an obvious place to turn is the body of knowledge known as information theory, formulated 75 years ago by Claude Shannon.

Shannon’s most important contribution (in my opinion) was to establish that information is a quantifiable, measurable substance. The unit of measure is the *bit*, defined as the amount of information needed to distinguish between two equally likely outcomes. If you flip a fair coin and learn that it came up heads, you have acquired one bit of information.

This scheme is easy to apply to the game of Twenty Questions, where the questions are required to have just two possible answers—yes or no. If each answer conveys one bit of information, 20 questions yield 20 bits, which is enough to distinguish one item among \(2^{20} \approx 1\) million equally likely possibilties. In general, if a collection of things has \(N\) members, then the quantity of information needed to distinguish one of its members is the base-2 logarithm of \(N\), written \(\log_2 N\).

In Wordle we ask no more than six questions, but answers are not just yes or no. Each guess word is a query submitted to the Umpire, who answers by coloring the letter tiles. There are more than two possible answers; in fact, there are 243 of them. As a result, a Wordle guess has the potential to yield much more than one bit of information. But there’s also the possibility of getting *less* than one bit.

Where does that curious number 243 come from? In the Umpire’s response to a guess, each letter is assigned one of three colors, and there are five letters in all. Hence the set of all possible responses has \(3^5 = 243\) members. Here they are, in all their polychrome glory:

These color codes represent every feedback message you could ever possibly receive when playing Wordle. (And then some! The five codes outlined in pink can never occur. Note 2 explains why.) Each color pattern can be represented as a five-digit numeral in ternary (base-3) notation, with the digit 0 signifying *gray* or *absent*, 1 indicating *gold* or *present*, and 2 corresponding to *green* or *correct*. Because these are five-digit numbers, I’ve taken to calling them Zip codes. The all-gray 00000 code at the upper left appears in the Wordle grid when your guess has no letters in common with the target word. A solid-green 22222, at the bottom right, marks the successful conclusion of a game. In the middle of the grid is the all-gold 11111, which is reserved for “deranged anagrams”:

When you play a guess word in Wordle, you can’t know in advance which of the 243 Zip codes you’ll receive as feedback, but you can know the spectrum of possibilities for that particular word. Program 2,

Each of the slender bars that grow from the baseline of the chart represents a single Zip-code, from all-gray at the far left, through all-gold in the middle, and on to all-green at the extreme right.

The pattern of short and tall bars in Program 2 is something like an atomic spectrum for Wordle queries. Each guess word generates a unique pattern. A word like MOMMY or JIFFY yields a sparse distribution, with a few very tall bars and wide empty spaces between them. The sparsest of all is QAJAQ (a word on the arcane list that seems to be a variant spelling of KAYAK): All but 18 of the 243 Zip codes are empty. Words such as SLATE, CRANE, TRACE, and RAISE, on the other hand, produce a dense array of shorter bars, like a lush carpet of grass, where most of the categories are occupied by at least one target word.

Can these spectra help us solve Wordle puzzles? Indeed they can! The guiding principle is simple: Choose a guess word that has a “flat” spectrum, distributing the target words uniformly across the 243 Zip codes. The ideal is to divide the set of candidate words into 243 equal-size subsets. Then, when the Umpire’s feedback singles out one of these categories for further attention, the selected Zip code is sure to have relatively few occupants, making it easier to determine which of those occupants is the target word. In this way we maximize the expected information gain from each guess.

This advice to spread the target words broadly and thinly across the space of Zip codes may seem obvious and in no need of further justification. If you feel that way, read on. Those who need persuading (as I did) should consult Note 3.

The ideal of distributing words with perfect uniformity across the Wordle spectrum is, lamentably, unattainable. None of the 12,972 available query words achieves this goal. The spectra all have lumps and bumps and bare spots. The best we can do is choose the guess that comes closest to the ideal. But how are we to gauge this kind of closeness? What mathematical criterion should we adopt to compare those thousands of spiky spectra?

Information theory again waves its hand to volunteer an answer, promising to measure the number of bits of information one can expect to gain from any given Wordle spectrum. The tool for this task is an equation that appeared in Shannon’s 1948 paper “A Mathematical Theory of Communication.” I have taken some liberties with the form of the equation, adapting it to the peculiarities of the Wordle problem and also trying to make its meaning more transparent. For the mathematical details see Note 4.

Here’s the equation in my preferred format:

\[H_w = \sum_{\substack{i = 1\\n_i \ne 0}}^{242} \frac{n_i}{m} (\log_2 m - \log_2 n_i) .\]

\(H_w\), the quantity we are computing, is the amount of information we can expect to acquire by playing word \(w\) as our next guess. *entropy* at the suggestion of John von Neumann, who pointed out that “no one knows what entropy really is, so in a debate you will always have the advantage.” The letter \(H\) (which might be a Greek Eta) was introduced by Ludwig Boltzmann.*entropy*, in analogy with the measure of disorder in physics. A Wordle spectrum has higher entropy when the target words are more broadly dispersed among the Zip codes.

On the righthand side of the equation we sum contributions from all the occupied Zip codes in \(w\)’s Wordle spectrum. The index \(i\) on the summation symbol \(\Sigma\) runs from 0 to 242 (which is 00000 to 22222 in ternary notation), enumerating all the Zip codes in the spectrum. The variable \(n_i\) is the number of target words assigned to Zip code \(i\), and \(m\) is the total number of target words included in the spectrum. At the start of the game, \(m = 2{,}315\), but after each guess it gets smaller.

Now let’s turn to the expression in parentheses. Here \(\log_2 m\) is the amount of information—the number of bits—needed to distinguish the target word among all \(m\) candidates. Likewise \(\log_2 n_i\) is the number of bits needed to pick out a single word from among the \(n_i\) words in Zip code \(i\). The difference between these quantities, \(\log_2 m - \log_2 n_i\), is the amount of information we gain if Zip code \(i\) turns out to be the correct choice—if it harbors the target word and is therefore selected by the Umpire.

Perhaps a numerical example will help make all this clearer. If \(m = 64\), we have \(\log_2 m = 6\): It would take 6 bits of information to single out the target among all 64 candidates. If the target is found in Zip code \(i\) and \(n_i = 4\), we still need \(\log_2 4 = 2\) bits of information to pin down its identity. Thus in going from \(m = 64\) to \(n_i = 4\) we have gained \(6 - 2 = 4\) bits of information.

So much for what we stand to gain if the target turns out to live in Zip code \(i\). We also have to consider the probability of that event. Intuitively, if there are more words allocated to a Zip code, the target word is more likely to be among them. The probability is simply \(n_i / m\), the fraction of all \(m\) words that land in code \(i\). Hence \(n_i / m\) and \(\log_2 m - \log_2 n_i\) act in opposition. Piling more words into Zip code \(i\) increases the probability of finding the target there, but it also increases the difficulty of isolating the target among the \(n_i\) candidates.

Each Zip code makes a contribution to the total expected information gain; the contribution of Zip code \(i\) is equal to the product of \(n_i / m\) and \(\log_2 m - \log_2 n_i\). Summing the contributions coming from all 243 categories yields \(H_w\), the amount of information we can expect to gain from playing word \(w\).

In case you find programming code more digestible than either equations or words, here is a JavaScript function for computing \(H_w\). The argument `spectrum`

is an array of 243 numbers representing counts of words assigned to all the Zip codes.

```
function entropy(spectrum) {
const nzspectrum = spectrum.filter(x => x > 0);
const m = sum(spectrum);
let H = 0;
for (let n of spectrum) {
H += n/m * (log2(m) - log2(n));
}
return H;
}
```

In Program 2, this code calculates the entropy value labeled \(H\) in the Statistics panel. The same function is invoked in Program 1 when the “Maximize entropy” algorithm is selected.

Before leaving this topic behind, it’s worth pausing to consider a few special cases. If \(n_i = 1\) (*i.e.*, there’s just a single word in Zip code \(i\)), then \(\log_2 n_i = 0\), and the information gain is simply \(\log_2 m\); we have acquired all the information we need to solve the puzzle. This result makes sense: If we have narrowed the choice down to a single word, it has to be the target. Now suppose \(n_i = m\), meaning that all the candidate words have congregated in a single Zip code. Then we have \(\log_2 m - \log_2 n_i = 0\), and we gain nothing by choosing word \(w\). We had \(m\) candidates before the guess, and we still have \(m\) candidates afterward.

What if \(n_i = 0\)—that is, Zip code \(i\) is empty? `spectrum.filter(x => x > 0)`

does the same thing. This exclusion does no harm because if \(n_i\) is zero, then \(n_i / m\) is also zero, meaning there’s no chance that category \(i\) holds the winner. (If a Zip code has no words at all, it can’t have the winning word.)

\(H_w\) attains its maximum value when all the \(n_i\) are equal. As mentioned above, there’s no word \(w\) that achieves this ideal, but we can certainly calculate how many bits such a word would produce if it did exist. In the case of a first guess, each \(n_i\) must equal \(2315 / 243 \approx 9.5\), and the total information gain is \(\log_2 2315 - \log_2 9.5 \approx 7.9\) bits. This is an upper bound for a Wordle first guess; it’s the most we could possibly get out of the process, even if we were allowed to play any arbitrary string of five letters as a guess. As we’ll see below, no real guess gains as much as six bits.

These mathematical tools of information theory suggest a simple recipe for a Wordle-playing computer program. At any stage of the game the word to play is the one that elicits the most information about the target. We can estimate the information yield of each potential guess by computing its Wordle spectrum and applying the \(H_w\) equation to the resulting sequence of numbers. That’s what the JavaScript code below does.

```
function maxentropy(guesswordlist, targetwordlist) {
maxH = 0
bestguessword = ""
for (g in guesswordlist) { // outer loop
spectrum = []
for t in targetwordlist { // inner loop
zipcode = score(g, t)
spectrum(zipcode) += 1
}
H = entropy(spectrum)
if (H > maxH) {
maxH = H
bestguessword = g
}
}
return bestguessword
}
```

In this function the outer loop visits all the guess words; then for each of these words the inner loop sifts through all the potential target words, constructing a spectrum. When the spectrum for guess word `g`

is complete, the `entropy`

procedure computes the information `H`

to be gained by playing word `w`

. The `maxH`

and `bestguessword`

variables keep track of the best results seen so far, and ultimately `bestguessword`

is returned as the result of the function.

This procedure can be applied at any stage of the Wordling process, from the first guess to the last. When choosing the first guess—the word to be entered in the blank grid at the start of the game—all 2,315 common words are equally likely potential targets. We also have 12,972 guess words to consider, drawn from both the common and arcane lists. Calculating the information gain for each such starter word reveals that the highest score goes to SOARE, at 5.886 bits. (SOARE is apparently either a variant spelling of SORREL, a reddish-brown color, or an obsolete British term for a young hawk.) ROATE, a variant of ROTE, and RAISE are not far behind. At the bottom of the list is the notorious QAJAQ, providing just 1.892 bits of information.

Adopting SOARE as a starter word, we can then play complete games of Wordle with this algorithm, recording the number of guesses required to find all 2,315 target words. The result is a substantial improvement over the random and the letter-frequency methods.

Guess pool: common and arcane

Start word: SOARE

The average number of guesses per game is down to about 3.5, and almost 96 percent of all games are won in either three or four guesses. Only one target word requires six guesses, and there are no lost games, requiring seven or more guesses. As a device for Wordling, Shannon’s information theory is quite a success!

Shannon’s entropy equation was explicitly designed for the function it performs here—finding the distribution with maximum entropy. But if we consider the task more broadly as looking for the most widely dispersed and most nearly uniform distribution, other approaches come to mind. For example, a statistician might suggest variance or standard deviation as an alternative to Shannon entropy. The standard deviation is defined as:

\[\sigma = \sqrt{\frac{\Sigma_i (n_i - \mu)^2}{N}},\]

where \(\mu\) is the average of the \(n_i\) values and \(N\) is the number of values. In other words, we are measuring how far the individual elements of the spectrum differ from the average value. If the target words were distributed uniformly across all the Zip codes, every \(n_i\) would be equal to \(\mu\), and the standard deviation \(\sigma\) would be zero. A large \(\sigma\) indicates that many \(n_i\) differ greatly from the mean; some Zip codes must be densely populated and others empty or nearly so. Our goal is to find the guess word whose spectrum has the smallest possible standard deviation.

In the Wordling program, it’s an easy matter to substitute minimum standard deviation for maximum entropy as the criterion for choosing guess words. The JavaScript code looks like this:*don’t* exclude empty Zip codes. Doing so would badly skew the results. A spectrum with all words crowded into a single Zip code would have \(\sigma = 0\), making it seem the most—rather than the least—desirable configuration.

```
function stddev(spectrum) {
const mu = sum(spectrum) / 243;
const diffs = spectrum.map(x => x - mu);
const variance = sum(diffs.map(x => x * x)) / 243;
return Math.sqrt(variance);
}
```

In Program 1, you can see the standard deviation algorithm in action by selecting the button labeled “Minimum std dev.” In Program 2, standard deviation values are labeled \(\sigma\) in the Statistics panel.

Testing the algorithm with all possible combinations of a starting word and a target word reveals that the smallest standard deviation is 22.02, and this figure is attained only by the spectrum of the word ROATE. Not far behind are RAISE, RAILE, and SOARE. At the bottom of the list, the worst choice by this criterion is IMMIX, with \(\sigma = 95.5\).

Using ROATE as the steady starter word, I ran a full set of complete games, surveying the standard-deviation program’s performance across all 2,315 target words. I was surprised at the outcome. Although the chart looks somewhat different—more 4s, fewer 3s—the average number of guesses came within a whisker of equalling the max-entropy result: 3.54 vs. 3.53. Of particular note, there are fewer words requiring five guesses, and a few more are solved with just two guesses.

Guess pool: common and arcane

Start word: ROATE

Why was I surprised by the strength of the standard-deviation algorithm? Well, as I said, Shannon’s \(H\) equation is a tool designed specifically for this job. Its mathematical underpinnings assert that no other rule can extract information with greater efficiency. That property seems like it ought to promise superior performance in Wordle. Standard deviation, on the other hand, is adapted to problems that take a different form. In particular, it is meant to measure dispersion in distributions with a normal or Gaussian shape. There’s no obvious reason to expect Wordle spectra to follow the normal law. Nevertheless, the \(\sigma\) rule is just as successful in choosing winners.

Following up on this hint that the max-entropy algorithm is not the only Wordle wiz, I was inspired to try a rule even simpler than standard deviation. In this algorithm, which I call “max scatter,” we choose the guess word whose spectrum has the largest number of occupied Zip codes. In other words, we count the bars that sprout up in Program 2, but we ignore their height. In the Statistics panel of Program 2, the max-scatter results are designated by the letter \(\chi\) (Greek chi), which I chose by analogy with the indicator function of set theory. In Program 1, choose “Maximize scatter” to Wordle this way.

If we adopt the \(\chi\) standard, the best first guess in Wordle is TRACE, which scatters target words over 150 of the 243 Zip codes. Other strong choices are CRATE and SALET (148 codes each) and SLATE and REAST (147). The bottom of the heap is good ole QAJAQ, with 18.

Using TRACE as the start word and averaging over all target words, the performance of the scatter algorithm actually exceeds that of the max-entropy program. The mean number of guesses is 3.49. There are no failures requiring 7+ guesses, and only one target word requires six guesses.

Guess pool: common and arcane

Start word: TRACE

What I find most noteworthy about these results is how closely the programs are matched, as measured by the average number of guesses needed to finish a game. It really looks as if the three criteria are all equally good, and it’s a matter of indifference which one you choose. This (tentative) conclusion is supported by another series of experiments. Instead of starting every game with the word that appears best for each algorithm, I tried generating random pairs of start words and target words, and measured each program’s performance for 10,000 such pairings. Having traded good start words for randomly selected ones, it’s no surprise that performance is somewhat weaker across the board, with the average number of guesses hovering near 3.7 instead of 3.5. But all three algorithms continue to be closely aligned, with figures for average outcome within 1 percent. And in this case it’s not just the averages that line up; the three graphs look very similar, with a tall peak at four guesses per game.

As I looked at these results, it occurred to me that the programs might be so nearly interchangeable for a trivial reason: because they are playing identical games most of the time. Perhaps the three criteria are similar enough that they often settle on the same sequence of guess words to solve a given puzzle. This *does* happen: There are word pairs that elicit exactly the same response from all three programs. It’s not a rare occurrence. But there are also plenty of examples like the trio of game grids in Figure 11, where each program discovered a unique pathway from FALSE to VOWEL.

A few further experiments show that the three programs arrive at three distinct solutions in about 35 percent of random games; in the other 65 percent, at least two of the three solutions are identical. In 20 percent of the cases all three are alike.

I puzzled over these observations for some time. If the algorithms discover wholly different paths through the maze of words, why are those paths so often the same length? I now have a clue to an answer, but it may not be whole story, and I would welcome alternative explanations.

My mental model of what goes on inside a Wordling program runs like this: The program computes the spectrum of each word to be considered as a potential guess, then computes some function—\(H\), \(\sigma\), or \(\chi\)—that reduces the spectrum to a single number. The number estimates the expected quality or efficiency of the word if it is taken as the next guess in the game. Finally we choose the word that either maximizes or minimizes this figure of merit (the word with largest value of \(H\) or \(\chi\), or the smallest value of \(\sigma\)).

So far so good, but there’s an unstated assumption in that scheme: I take it for granted that evaluating the spectra will always yield a single, unique extreme value of \(H\), \(\sigma\), or \(\chi\). What happens if two words are tied for first place? One could argue, of course, that if the words earn the same score, they must be equally good choices, and we should pick one of them at random or by some arbitrary rule. Even if there are three or four or a dozen co-leaders, the same reasoning should apply. But when there are two thousand words all tied for first place, I’m not so sure.

Can such massive traffic jams at the finish line actually happen in a real game of Wordle? At first I didn’t even consider the possibility. When I rated all possible starting words for the Shannon max-entropy algorithm, the ranking turned out to be a total order: In the list of 12,792 words there were no ties—no pairs of words whose spectra have the same \(H\) value. Hence there’s no ambiguity about which word is best, as measured by this criterion.

But this analysis applies only to the opening play—the choice of a first guess, when all the common words are equally likely to be the target. In the endgame the situation is totally different. Suppose the list of 2,315 common words has been whittled down to just five viable candidates for Wordle-of-the-day. When the five target words are sorted into Zip-code categories and the entropy of these patterns is calculated (excluding empty Zip codes), there are only seven possible outcomes, as shown in Figure 13. The patterns correspond to the seven ways of partitioning the number 5 (namely 5, 4+1, 3+2, 3+1+1, 2+2+1, 2+1+1+1, and 1+1+1+1+1).

In this circumstance, the 12,972 guess words cannot all have unique values of \(H\). On the contrary, with only seven distinct values to go around, there must be thousands of tie scores in the ranking of the words. Thus the max-entropy algorithm cannot pick a unique winner; instead it assembles a large class of words that, from the program’s point of view, are all equally good choices. Which of those words ultimately becomes the next guess is more or less arbitrary. For the most part, my programs pick the one that comes first in alphabetical order.

The same arguments apply to the minimum-standard-deviation algorithm. As for the max-scatter function, that has numerous tied scores even when the number of candidates is large. Because the variable \(\chi\) takes on integer values in the range from 1 to 243 (and in practice never strays outside the narrower range 18 to 150), there’s no avoiding an abundance of ties.

The presence of all these co-leaders offers an innocent explanation of how the three algorithms might arrive at solutions that are different in detail but generally equal in quality. Although the programs pick different words, those words come from the same equivalence class, and so they yield equally good outcomes.

But a doubt persists. Can it really be true that hundreds or thousands of words are all equally good guesses in some Wordle positions? Can you choose any one of those words and expect to finish the game in the same number of plays?

Let’s go back to Figure 11, where the maximum-entropy program follows the opening play of FALSE with DETER and then BINGO. Some digging through the entrails of the program reveals that BINGO is not the uniquely favored guess at this point in the progress of the game; VENOM, VINYL, and VIXEN are tied with BINGO at \(H = 3.122\) bits. The program chooses BINGO simply because it comes first alphabetically. As it happens, the choice is an unfortunate one. Any of the other three words would have concluded the game more quickly, in four guesses rather than five.

Does that mean we now have unequivocal evidence that the program is suboptimal? Not really. At the stage of the game where BINGO was chosen, there were still 10 viable possibilities for the target word. The true target might have been LIBEL, for example, in which case BINGO would have been superior to VENOM or VIXEN.

Human players see Wordle as a word game. What else could it be? To solve the puzzle you scour the dusty corners of your vocabulary, searching for words that satisfy certain constraints or fit a given template. You look for patterns of vowels and consonants and other curious aspects of English orthography.

For the algorithmic player, on the other hand, Wordle is a numbers game. What counts is maximizing or minimizing some mathematical function, such as entropy or standard deviation. The words and letters all but disappear.

The link between words and numbers is the Umpire’s scoring rule, with the spectrum of Zip codes that comes out of it. Every possible combination of a guess word and a target word gets reduced to a five-digit ternary number. Instead of computing each of these numbers as the need arises, we can precompute the entire set of Zip codes and store it in a matrix. The content of the matrix is determined by the letters of the words we began with, but once all the numbers have been filled in, we can dispense with the words themselves. Operations on the matrix depend only on the numeric indices of the columns and rows. We can retrieve any Zip code by a simple table lookup, without having to think about the coloring of letter tiles.

In exploring this matrix, let’s set aside the arcana for the time being and work only with common words. With 2,315 possible guesses and the same number of possible targets, we have \(2{,}315^2 \approx 5.4\) million pairings of a guess and a target. Each such pairing gets a Zip code, which can be represented by a decimal integer in the range from 0 to 242. Because these small numbers fit in a single byte of computer memory, the full matrix occupies about five-and-a-half megabytes.

Figure 14 is a graphic representation of the matrix. Each column corresponds to a guess word, and each row to a target word. The words are arranged in alphabetical order from left to right and top to bottom. The matrix element where a column intersects a row holds the Zip code for that combination. The color scheme is the same as in Program 2. The main diagonal (upper left to lower right) is a bright green stripe because each word when played against itself gets a score of 22222, equivalent to five green tiles. The blocks of lighter green all along the diagonal mark regions where guess words and target words share the same first letter and hence have scores with at least one green tile. The largest of these blocks is for words beginning with the letter S. A few other blocks are quite tiny, reflecting the scarcity of words beginning with J, K, Q, Y, and Z. A box for X is lacking entirely; the Wordle common list has no words beginning with X.

Gazing deeply into the tweedy texture of Figure 14 reveals other curious structures, but some features are misleading. The matrix appears to be symmetric across the main diagonal: The point at (*a, b*) always has the same color as the point at (*b, a*). But the symmetry is an artifact of the graphic presentation, where colors are assigned based only on the *number* of gray, gold, and green tiles, ignoring their positions within a word. The underlying mathematical matrix is not symmetric.

I first built this matrix as a mere optimization, a way to avoid continually recomputing the same Zip codes in the course of a long series of Wordle games. But I soon realized that the matrix is more than a technical speed boost. It encapsulates pretty much everything one might want to know about the entire game of Wordle.

Figure 15 presents a step-by-step account of how matrix methods lead to a Wordle solution. In the first stage *(upper left)* the player submits an initial guess of SLATE, which singles out column 1778 in the matrix (1778 being the position of SLATE in the alphabetized list of common words). The second stage *(upper right)* reveal’s the Umpire’s feedback, coloring the tiles as follows: . To the human solver this message would mean: “Look for words that have an S (but not in front) and a T (but not in fourth position).” To the computer program the message says: “Look for rows in the matrix whose 1778th entry is equal to 10010 (base 3) or 84 (base 10). It turns out there are 24 rows meeting this condition, which are highlighted in blue. (Some lines are too closely spaced to be distinguished.) I have labeled the highlighted rows with the words they represent, but the labels are for human consumption only; the program has no need of them.

In the third panel *(lower left)*, the program has chosen a second guess word, TROUT, which earns the feedback . I should mention that this is not a word I would have considered playing at this point in the game. The O and U make sense, because the first guess eliminated A and E and left us in need of a vowel; but the presence of two Ts seems wasteful. We already know the target word has a T, and pinning down where it is seems less important than identifying the target word’s other constituent letters. I might have tried PROUD here. Yet TROUT is a brilliant move! It eliminates 7 words that start with T, 12 words that don’t have an R, 17 words that have either an O or a U or both, and 4 words that don’t end in T. In the end only one word remains in contention: FIRST.

All this analysis of letter positions goes on only in my mind, not in the algorithm. The computational process is simpler. The guess TROUT designates column 2118 of the matrix. Among the 24 rows identified by the first guess, SLATE, only row 743 has the Zip code value 01002 (or 29 decimal) at its intersection with column 2118. Row 743 is the row for FIRST. Because it is the only remaining viable candidate, it must be the target word, and this is confirmed when it is played as the third guess *(lower right)*.

There’s something disconcerting—maybe even uncanny—about a program that so featly wins a word game while ignoring the words. Indeed, you can replace all the words in Wordle with utter gibberish—with random five-letter strings like SCVMI, AHKZB, and BOZPA—and the algorithm will go on working just the same. It has to work a little harder—the average number of guesses is near 4—but a human solver would find the task almost impossible.

The square matrix of Figure 14 includes only the common words that serve as both guesses and targets in Wordle. Including the arcane words expands the matrix to 12,972 columns, with a little more than 30 million elements. At low resolution this wide matrix looks like this:

If you’d like to see all 30 million pixels up close and personal, download the 90-megabyte TIFF image.

Commentators on Wordle have given much attention to the choice of a first guess: the start word, the sequence of letters you type when facing a blank grid. At this point in the game you have no clues of any kind about the composition of the target word; all you know is that it’s one of 2,315 candidates. Because this list of candidates is the same in every game, it seems there might be one universal best starter word—an opening play that will lead to the smallest average number of guesses, when the average is taken over all the target words. Furthermore, because 2,315 isn’t a terrifyingly large number, a brute-force search for that word might be within the capacity of a less-than-super computer.

Start-word suggestions from players and pundits are a mixed lot. A website called Polygon offers some whimsical ideas, such as FARTS and OUIJA. GameRant also has oddball options, including JUMBO and ZAXES. Tyler Glaiel offers better advice, reaching deep into the arcana to come up with SOARE and ROATE. Grant Sanderson skillfully deduces that CRANE is the best starter, then takes it all back.

For each of the algorithms I have mentioned above (except the random one) the algorithm itself can be pressed into service to suggest a starter. For example, if we apply the max-entropy algorithm to the entire set of potential guess words, it picks out the one whose spectrum has the highest \(H\) value. As I’ve already noted, that word is SOARE, with \(H = 5.886\). Table 1 gives the 15 highest-scoring choices, based on this kind of analysis, for the max-entropy, min–standard deviation and max-scatter algorithms.

Max Entropy | |
---|---|

SOARE | 5.886 |

ROATE | 5.883 |

RAISE | 5.878 |

RAILE | 5.866 |

REAST | 5.865 |

SLATE | 5.858 |

CRATE | 5.835 |

SALET | 5.835 |

IRATE | 5.831 |

TRACE | 5.831 |

ARISE | 5.821 |

ORATE | 5.817 |

STARE | 5.807 |

CARTE | 5.795 |

RAINE | 5.787 |

Min Std Dev | |
---|---|

ROATE | 22.02 |

RAISE | 22.14 |

RAILE | 22.22 |

SOARE | 22.42 |

ARISE | 22.72 |

IRATE | 22.73 |

ORATE | 22.76 |

ARIEL | 23.05 |

AROSE | 23.20 |

RAINE | 23.41 |

ARTEL | 23.50 |

TALER | 23.55 |

RATEL | 23.97 |

AESIR | 23.98 |

ARLES | 23.98 |

Max Scatter | |
---|---|

TRACE | 150 |

CRATE | 148 |

SALET | 148 |

SLATE | 147 |

REAST | 147 |

PARSE | 146 |

CARTE | 146 |

CARET | 145 |

PEART | 145 |

CARLE | 144 |

CRANE | 142 |

STALE | 142 |

EARST | 142 |

HEART | 141 |

REIST | 141 |

Perusing these lists reveals that they all differ in detail, but many of the same words appear on two or more lists, and certains patterns turn up everywhere. A disproportionate share of the words have the letters A, E, R, S, and T; and half the alphabet never appears in any of the lists.

Figure 17 looks at the distribution of starter quality ratings across the entire range of potential guess words, both common and arcane. The 12,972 starters have been sorted from best to worst, as measured by the max-entropy algorithm. The plot gives the number of bits of information gained by playing each of the words as the initial guess, in each case averaging over all 2,315 target words.

The peculiar shape of the curve tells us something important about Wordling strategies. A small subset of words in the upper left corner of the graph make exceptionally good starters. In the first round of play they elicit almost six bits of information, which is half of what’s needed to finish the game. At the other end of the curve, a slightly larger cohort of words produce really awful results as starters, plunging down to the 1.8 bits of QAJAQ. In between these extremes, the slope of the curve is gradual, and there are roughly 10,000 words that don’t differ greatly in their performance.

These results might be taken as the last word on first words, but I would be a cautious about that. The methodology makes an implicit “greedy” assumption: that the strongest first move will always lead to the best outcome in the full game. It’s rather like assuming that the tennis player with the best serve will always win the match. Experience suggests otherwise. Although a strong start is usually an advantage, it’s no guarantee of victory.

We can test the greedy assumption in a straightforward if somewhat laborious way: For each proposed starter word, we run a complete set of 2,315 full games—one for each of the target words—and we keep track of the average number of guesses needed to complete a game. Playing 2,315 games takes a few minutes for each algorithm; doing that for 12,792 starter words exceeds the limits of my patience. But I have compiled full-game results for 30 starter words, all of them drawn from near the top of the first-round rankings.

Table 2 gives the top-15 results for each of the algorithms. Comparing these lists with those of Table 1 reveals that first-round supremacy is not in fact a good predictor of full-game excellence. None of the words leading the pack in the first-move results remain at the head of the list when we measure the outcomes of complete games.

Max Entropy | |
---|---|

REAST | 3.487 |

SALET | 3.492 |

TRACE | 3.494 |

SLATE | 3.495 |

CRANE | 3.495 |

CRATE | 3.499 |

SLANE | 3.499 |

CARTE | 3.502 |

CARLE | 3.503 |

STARE | 3.504 |

CARET | 3.505 |

EARST | 3.511 |

SNARE | 3.511 |

STALE | 3.512 |

TASER | 3.514 |

Min Std Dev | |
---|---|

TRACE | 3.498 |

CRATE | 3.503 |

REAST | 3.503 |

SALET | 3.508 |

SLATE | 3.508 |

SLANE | 3.511 |

CARTE | 3.512 |

CARLE | 3.516 |

CRANE | 3.517 |

CARSE | 3.523 |

STALE | 3.524 |

CARET | 3.524 |

STARE | 3.525 |

EARST | 3.527 |

SOREL | 3.528 |

Max Scatter | |
---|---|

SALET | 3.428 |

REAST | 3.434 |

CRANE | 3.434 |

SLATE | 3.434 |

CRATE | 3.435 |

TRACE | 3.435 |

CARLE | 3.437 |

SLANE | 3.438 |

CARTE | 3.444 |

STALE | 3.450 |

TASER | 3.450 |

CARET | 3.451 |

EARST | 3.452 |

CARSE | 3.453 |

STARE | 3.454 |

What’s the lesson here? Do I recommend that when you get up in the morning to face your daily Wordle, you always start the game with REAST or TRACE or SALET or one of the other words near the top of these lists? That’s not *bad* advice, but I’m not sure it’s the *best* advice. One problem is that each of these algorithms has its own favored list of starting words. Your own personal Wordling algorithm—whatever it may be—might respond best to some other, idiosyncratic, set of starters.

Moreover, my spouse, who Wordles and Quordles and Octordles and Absurdles, reminds me gently that it’s all a game, meant to be fun, and some people may find that playing the same word day after day gets boring.

Figure 18 presents the various algorithms at their shiny best, each one using the starter word that brings out its best performance. For comparison I’ve also included my own Wordling record, based on 126 games I’ve played since January. I’m proud to say that I Wordle better than a random number generator.

I turn now from the opening move to the endgame, which I find the most interesting part of Wordle—but also, often, the most frustrating. It seems reasonable to say that the endgame begins when the list of viable targets has been narrowed down to a handful, or when most of the letters in the target are known, and only one or two letters remain to be filled in.

Occasonally you might find yourself entering the endgame on the very first move. There’s the happy day when your first guess comes up all green—an event I have yet to experience. Or you might have a close call, such as playing the starter word EIGHT and getting this feedback: . Having scored four green letters, it looks like you’ve got an easy win. With high hopes, you enter a second guess of FIGHT, but again you get the same four greens and one gray. So you type out LIGHT next, and then MIGHT. After two more attempts, you are left with this disappointing game board:

What seemed like a sure win has turned into a wretched loss. You have used up your six turns, but you’ve not found the Wordle-of-the-day. Indeed, there are still three candidate targets yet to be tried: SIGHT, TIGHT, and WIGHT.

The _ IGHT words are by no means the only troublemakers. Similar families of words match the patterns _ ATCH, _ OUGH, _ OUND, _ ASTE and _ AUNT. You can get into even deeper endgame woes when an initial guess yields three green tiles. For example, 12 words share the template S_ _ ER, 25 match _ O_ ER, and 29 are consistent with _ A_ ER.

These sets of words are challenging not only for the human solver but also, under some circumstances, for Wordling algorithms. In Program 1, choose the word list “Candidates only” and then try solving the puzzle for target words such as FOYER, WASTE, or VAUNT. Depending on your starter word, you are likely to see the program struggle through several guesses, and it may fail to find the answer within six tries.

The “Candidates only” setting requires the program to choose each guess from the set of words that have not yet been excluded from consideration as the possible target. For example, if feedback from an earlier guess has revealed that the target ends in T and has an I, then every subsequent guess must end in T and have an I. (Restricting guesses to candidates only is similar to the Wordle app’s “hard mode” setting, but a little stricter.)

Compelling the player to choose guesses that might be winners doesn’t seem like much of a hardship or handicap. However, trying to score a goal with every play is seldom the best policy. Other guesses, although they can’t possibly be winners, may yield more information.

An experienced and wordly-wise human Wordler, on seeing the feedback , would know better than to play for an immediate win. The prudent strategy is play a word that promises to reveal the identity of the one missing letter. Here’s an example of how that works.

At left the second-guess FLOWN detects the presence of a W, which means the target can only be WIGHT. In the middle, FLOWN reveals only absences, but the further guess MARSH finds an S, which implies the target must be SIGHT. At right, the second and third guesses have managed to eliminate F, L, W, N, M, R, and S as initial letters, and all that’s left is the T of TIGHT.

This virtuoso performance is not the work of some International Grandmaster of Wordling. It is produced by the max-entropy algorithm, when the program is allowed a wider choice of potential guesses. The standard-deviation and max-scatter algorithms yield identical results. There is no special logic built into any of these programs for deciding when to play to win and when to hunker down and gather more information. It all comes out of the Zip code spectrum: FLOWN and MARSH are the words that maximize \(H\) and \(\chi\), and that minimize \(\sigma\). And yet, when you watch the game unfold, it looks mysterious or magical.

The cautious strategy of accumulating intelligence before committing to a line of play yields better results on average, but it comes with a price. All of the best algorithms achieve their strong scores by reducing the number of games that linger for five or six rounds of guessing. However, those algorithms also reduce the number of two-guess games, an effect that has to be counted as collateral damage. Two-guess triumphs make up less than 3 percent of the games played by the Zip code–based programs. Contrast that with the letter-frequency algorithm: In most respects it is quite mediocre, but it wins more than 6 percent of its games in two guesses. And among my personal games, 7 percent are two-guess victories. (I don’t say this to brag; what it suggests is that my style of play is a tad reckless.)

The Zip code–based programs would have even fewer two-guess wins without a heuristic that improves performance in a small fraction of cases. Whenever the list of candidate target words has dwindled down to a length of two, there’s no point in seeking more information to distinguish between the two remaining words. Suppose that after the first guess you’re left with the words SWEAT and SWEPT as the only candidates. For your second guess you could play a word such as ADEPT, where either the A or the P would light up to indicate which candidate is the target. You would then have a guaranteed win in three rounds. But if you simply played one of the candidate words as your second guess, you would have a 50 percent chance of winning in two rounds, and otherwise would finish in three, for an average score of 2.5.

This heuristic is already implemented in the programs discussed above. It makes a noticeable difference. Removing it from the max-entropy program drops the number of two-guess games from 51 down to 35 (out of 2,315).

Can we go further with this idea? Suppose there are three candidates left as you’re preparing for the second round of guessing. The cautious, information-gathering strategy would bring consistent victory on the third guess. Playing one of the candidates leads to a game that lasts for two, three, or four turns, each with probability one-third, so the average is again three guesses. The choice appears to be neutral. In practice, playing one of the candidates brings a tiny improvement in average score—too tiny to be worth the bother.

Another optimization says you should always pick one of the remaining candidates for your sixth guess. Gathering additional information is pointless in this circumstance, because you’ll never have a chance to use it. However, the better algorithms almost never reach the sixth guess, so this measure has no payoff in practice.

Apart from minor tricks and tweaks like these, is there any prospect of building significantly better Wordling programs? I have no doubt that improvement is possible, even though all my own attempts to get better results have failed miserably.

Getting to an average performance of 3.5 guesses per game seems fairly easy; getting much beyond that level may require new ideas. My impression is that existing methods work well for choosing the first guess and perhaps the second, but are less effective in closing out the endgame. When the number of candidates is small, the Zip code–based algorithms cannot identify a single best next guess; they merely divide the possibilities into a few large classes of better and worse guesses. We need finer means of discrimination. We need tiebreakers.

I’ll briefly mention two of my failed experiments. I thought I would try going beyond the Zip code analysis and computing for each combination of a guess word and a potential target word how much the choice would shrink the list of candidates. After all, the point of the game is to shrink that list down to a single word. But the plague of multitudinous ties afflicts this algorithm too. Besides, it’s computationally costly.

Another idea was to bias the ranking of the Zip code spectra, favoring codes that have more gold and green letter tiles, on the hypothesis that we learn more when a letter is present or correct. The hypothesis is disproved! Even tiny amounts of bias are detrimental.

My focus has been on reducing the average number of guesses, but maybe there are other goals worth pursuing. For example, can we devise an algorithm that will solve every Wordle with no more than four guesses? It’s not such a distant prospect. Already there are algorithms that exceed four guesses only in about 2 percent of the cases.

Perhaps progress will come from another quarter. I’ve been expecting someone to put one of the big machine-learning systems to work on Wordle. All I’ve seen so far is a small-scale study, based on the technique of reinforcement learning, done by Benton J. Anderson and Jesse G. Meyer. Their results are not impressive, and I am led to wonder if there’s something about the problem that thwarts learning techniques just as it does other algorithmic approaches.

Wordle falls into the class of combinatorial word games. All you need to know is how letters go together to make a word; meaning is beside the point. Most games of this kind are highly susceptible to computational *force majeure*. A few lines of code and a big word list will find exhaustive solutions in milliseconds. For example, another New York *Times* game called Spelling Bee asks you to make words out of seven given letters, with one special letter required to appear in every word. I’m not very good at Spelling Bee, but my computer is an ace. The same code would solve the Jumble puzzles on the back pages of the newspaper. With a little more effort the program could handle Lewis Carroll’s Word Links (better known today as Don Knuth’s Word Ladders). And it’s a spiffy tool for cheating at Scrabble.

In this respect Wordle is different. One can easily write a program that plays a competent game. Even a program that chooses words at random can turn in a respectable score. But this level of proficiency is nothing like the situation with Spelling Bee or Jumble, where the program utterly annihilates all competition, leaving no shred of the game where the human player could cling to claims of supremacy. In Wordle, every now and then I beat my own program. How can that happen, in this day and age?

The answer might be as simple and boring as computational complexity. If I want my program to win, I’ll have to invest more CPU cycles. Or there might be a super-clever Wordle-wrangling algorithm, and I’ve just been too dumb to find it. Then again, there might be something about Wordle that sets it apart from other combinatorial word games. That would be interesting.

The charming story of Wordle’s creation was told last January in the New York *Times*. “Wordle Is a Love Story” read the headline. Josh Wardle, a software developer formerly at Reddit, created the game as a gift to his partner, Palak Shah, a fan of word games. The *Times* story, written by Daniel Victor, marvelled at the noncommercial purity of the website: “There are no ads or flashing banners; no windows pop up or ask for money.” Three weeks later the *Times* bought the game, commenting in its own pages, “The purchase . . . reflects the growing importance of games, like crosswords and Spelling Bee, in the company’s quest to increase digital subscriptions to 10 million by 2025.” So much for love stories.

As far as I can tell, the new owners have not fiddled with the rules of the game, but there have been a few revisions to the word lists. Here’s a summary based on changes observed between February and May.

Six words were removed from the list of common words (a.k.a. target words), later added to the list of arcane words, then later still removed from that list as well, so that they are no longer valid as either a guess or a target:

AGORA, PUPAL, LYNCH, FIBRE, SLAVE, WENCH

Twenty-two words were moved from various positions in the common words list to the end of that list (effectively delaying their appearance as Wordle-of-the-day until sometime in 2027):

BOBBY, ECLAT, FELLA, GAILY, HARRY, HASTY, HYDRO,

LIEGE, OCTAL, OMBRE, PAYER, SOOTH, UNSET, UNLIT,

VOMIT, FANNY, FETUS, BUTCH, STALK, FLACK, WIDOW,

AUGUR

Two words were moved forward from near the end of the common list to a higher position where they replaced FETUS and BUTCH:

SHINE, GECKO

Two words were removed from the arcane list and not replaced:

KORAN, QURAN

Almost all the changes to the common list affect words that would have been played at some point in 2022 if they had been left in place. I expect further purges when the editors get around to vetting the rest of the list.

When I first started playing Wordle, I had a simple notion of what the tile colors meant. If a tile was colored green, then that letter of the guess word appeared in the same position in the target word. A gold tile meant the letter would be found elsewhere in the target word. A gray tile indicated the letter was entirely absent from the target word. For my first Wordler program I wrote a procedure implementing this scheme, although its output consisted of numbers rather than colors: green = 2, gold = 1, gray = 0.

```
function scoreGuess(guess, target)
score = [0, 0, 0, 0, 0]
for i in 1:5
if guess[i] == target[i]
score[i] = 2
elseif occursin(guess[i], target)
score[i] = 1
end
end
return score
end
```

This rule and its computer implementation work correctly as long as we never apply them to words with repeated letters. But suppose the target word is MODEM and you play the guess MUDDY. Following the rule above, the Umpire would offer this feedback: . Note the gold coloring of the second D. It’s the correct marking according to the stated rule, because the target word has a D not in the fourth position. But that D in MODEM is already “spoken for”; it is matched with the green D in the middle of MUDDY. The gold coloring of the second D could be misleading, suggesting that there’s another D in the target word.

The Wordle app would color the second D gray, not gold: . The rule, apparently, is that each letter in the target word can be matched with only one letter in the guess word. Green matches take precedence.

There remains some uncertainty about which letter gets a gold score when there are multiple options. If MUDDY is the target word and ADDED is the guess word, we know that middle D will be colored green, but which of the other two Ds in ADDED gets a gold tile? I have not been able to verify how the Wordle app handles this situation, but my program assigns priority from left to right: .

This minor amendment to the rules brings a considerable cost in complexity to the code. We need an extra data structure (the array `tagged`

) and an extra loop. The first loop (with index `i`

) finds and marks all the green matches; the second loop (indices `j`

and `k`

) identifies gold tiles that have not been preempted by earlier green or gold matches.

```
function scoreGuess(guess, target)
score = [0, 0, 0, 0, 0]
tagged = [0, 0, 0, 0, 0]
for i in 1:5
if guess[i] == target[i]
score[i] = 2
tagged[i] = 2
end
end
for j in 1:5
for k in 1:5
if guess[j] == target[k] && score[j] == 0 && tagged[k] == 0
score[j] = 1
tagged[k] = 1
end
end
end
return score
end
```

It is this element of the scoring rules that forbids Zip codes such as and its permutations in Figure 6. Under the naive rules, the guess STALL played against the target STALE would have been scored . The refined rule renders it as .

Let’s think about a game that’s simpler than Wordle, though doubtless less fun to play. You are given a deck of 64 index cards, each with a single word written on it; one of the words, known only to the Umpire, is the target. Your instructions are to “cut” the deck, dividing it into two heaps. Then the all-seeing Umpire will tell you which heap includes the target word. For the next round of play, you set aside the losing heap, and divide the winning heap into two parts; again the Umpire indicates which pile holds the winning word. The process continues until the heap approved by the Umpire consists of a single card, which must necessarily be the target. The question is: How many rounds of play will it take to reach this decisive state?

If you always divide the remaining stack of cards into two *equal* heaps, this question is easy to answer. On each turn, the number of cards and words remaining in contention is cut in half: from 64 to 32, then on down to 16, 8, 4, 2, 1. It takes six halvings to resolve all uncertainty about the identity of the target. This bisection algorithm is a central element of information theory, where the fundamental unit of measure, the *bit*, is defined as the amount of information needed to choose between two equally likely alternatives. Here the choices are the two heaps of equal size, which have the same probability of holding the target word. When the Umpire points to one heap or the other, that signal provides exactly one bit of information. To go all the way from 64 equally likely candidates to one identified target takes six rounds of guessing, and six bits of information.

Bisection is known to be an optimal strategy for many problem-solving tasks, but the source of its strength is sometimes misunderstood. What matters most is not that we split the deck into *two* parts but that we divide it into *equal* subsets. Suppose we cut a pack of 64 cards into four equal heaps rather than two. When the Umpire points to the pile that includes the target, we get twice as much information. The search space has been reduced by a factor of four, from 64 cards to 16. We still need to acquire a total of six bits to solve the problem, but because we are getting two bits per round, we can do it in three splittings rather than six. In other words, we have a tradeoff between more simple decisions and fewer complex decisions.

Figure 21 illustrates the nature of this tradeoff by viewing a decision process as traversing a tree from the root node (at the top) to one of 16 leaf nodes (at the bottom). For the binary tree on the left, finding your way from the root to a leaf requires making four decisions, in each case choosing one of two paths. In the quaternary tree on the right, only two decisions are needed, but there are four options at each level. Assuming a four-way choice “costs” twice as much as a two-way choice, the information content is the same in both cases: four bits.

But what if we decide to split the deck unevenly, producing a larger and a smaller heap? For example, a one-quarter/three-quarter division would yield a pile of 16 cards on the left and 48 cards on the right. If the target happens to lie in the smaller heap, we are better off than we would be with an even split: We’ve gained two bits of information instead of one, since the search space has shrunk from 64 cards to 16. However, the probability of this outcome is only 1/4 rather than 1/2, and so our expected gain is only one bit. When the target card is in the larger pile, we acquire less than one bit of information, since the search space has fallen from 64 cards only as far as 48, and the probability of this event is 3/4. Averaging across the two possible outcomes, the loss outweighs the gain.

Figure 22 shows the information budget for every possible cut point in a deck of 64 cards. The red curves labeled *left* and *right* show the number of bits obtained from each of the two piles as their size varies. (The *x* axis labels the size of the left pile; wherever the left pile has \(n\) cards, the right pile has \(64 - n\).) The overarching green curve is the sum of the left and right values. Note that the green curve has its peak in the center, where the deck has been split into two equal subsets of 32 cards each. This is the optimum strategy.

I have made this game go smoothly by choosing 64 as the number of cards in the deck. Because 64 is a power of 2, you can keep dividing by 2 and never hit an odd number until you come to 1. With decks of other sizes the situation gets messy; you can’t always split the deck into two equal parts. Nevertheless, the same principles apply, with some compromises and approximations. Suppose we have a deck of 2,315 cards, each inscribed with one of the Wordle common words. (The deck would be about two feet high.) We repeatedly split it as close to the middle as possible, allowing an extra card in one pile or the other when necessary. Eleven bisections would be enough to distinguish one card out of \(2^{11} = 2{,}048\), which falls a little short of the goal. With 12 bisections we could find the target among \(2^{12} = 4{,}096\) cards, which is more than we need. There’s a mathematical function that interpolates between these fenceposts: the base-2 logarithm. Specifically, \(\log_2 2{,}315\) is about 11.18, which means the amount of information we need to solve a Wordle puzzle is 11.18 bits. (Note that \(2^{11.18} \approx 2315\). Of the 2,315 positions where the target card might lie within the deck, 1,783 are resolved in 11 bisections, and 532 require a 12th cut.)

The foundational document of information theory, Claude Shannon’s 1948 paper “A Mathematical Theory of Communication,” gives the entropy equation is this form:

\[H = -\sum_{i} p_i \log_2 p_i .\]

The equation’s pedigree goes back further, to Josiah Willard Gibbs and Ludwig Boltzmann, who introduced it in a different context, the branch of physics called statistical mechanics.

Over the years I’ve run into this equation many times, but I do not count it among my close friends. Whenever I come upon it, I have to stop and think carefully about what the equation means and how it works. The variable \(p_i\) represents a probability. We are asked to multiply each \(p_i\) by the logarithm of \(p_i\), then sum up the products and negate the result. Why would one want to do such a thing? What does it accomplish? How do these operations reveal something about entropy and information?

When I look at the righthand side of the equation, I see an expression of the general form \(n \log n\), which is a familiar trope in several areas of mathematics and computer science. It generally denotes a function that grows faster than linear but slower than quadratic. For example, sorting \(n\) items might require \(n \log n\) comparison operations. (For any \(n\) greater than 2, \(n \log_2 n\) lies between \(n\) and \(n^2\).)

That’s *not* what’s going on here. Because \(p\) represents a probability, it must lie in the range \(0 \le p \le 1\). The logarithm of a number in this range is negative or at most zero. This gives \(p \log p\) a different complexion. It also draws attention to that weird minus sign hanging out in front of the summation symbol.

I believe another form of the equation is easier to understand, even though the transformation introduces more bristly mathematical notation. Let’s begin by moving the minus sign from outside to inside the summation, attaching it to the logarithmic factor:

\[H = \sum_{i} p_i (-\log_2 p_i) .\]

The rules for logarithms and exponents tell us that \(-\log x = \log x^{-1}\), and that \(x^{-1} = 1/x\). We can therefore rewrite the equation again as

\[H = \sum_{i} p_i \log_2 \frac{1}{p_i} .\]

Now the nature of the beast becomes a little clearer.

We can now bring in some of the particulars of the Wordle problem. The probability \(p_i\) is in fact the probability that the target word is found in Zip code \(i\).

\[H = \sum_{i} \frac{n_i}{m} \log_2 \frac{m}{n_i} .\]

And now for the final transformation. The laws of logarithms state that \(\log x/y\) is equal to \(\log x - \log y\), so we can rewrite the Shannon equation as

\[H = \sum_{i} \frac{n_i}{m} (\log_2 m - \log_2 n_i) .\]

In this form the equation is easy to relate to the problem of solving a Wordle puzzle. The quantity \(\log_2 m\) is the total amount of information (measured in bits) that we need to acquire in order to identify the target word; \(\log_2 n_i\) is the amount of information we will still need to ferret out if Zip code \(i\) turns out the hold the target word. The difference of these two quantities is what we gain if the target is in Zip code \(i\). The coefficient \(n_i / m\) is the probability of this event.

One further emendation is needed. The logarithm function is undefined at 0, as \(\log x\) diverges toward negative infinity as \(x\) approaches 0 from above. Thus we have to exclude from the summation all terms with \(n_i = 0\). We can also fill in the limits of the index \(i\).

\[H = \sum_{\substack{i = 1\\n_i \ne 0}}^{242} \frac{n_i}{m} (\log_2 m - \log_2 n_i) \]

Shannon’s discussion of the entropy equation is oddly diffident. He introduces it by listing a few assumptions that the \(H\) function should satisfy, and then asserting as a theorem that \(H = -\Sigma p \log p\) is the only equation meeting these requirements. In an appendix to the 1948 paper he proves the theorem. But he also goes on to write, “This theorem, and the assumptions required for its proof, are in no way necessary for the present theory [*i.e.*, information theory]. It is given chiefly to lend a certain plausibility to some of our later definitions.” I’m not sure what to make of Shannon’s cavalier dismissal of his own theorem, but in the case of Wordle analysis he seems to be right. Other measures of dispersion work just as well as the entropy function.

Babylonian accountants and land surveyors did their arithmetic in base 60, presumably because sexagesimal numbers help with wrangling fractions. When you organize things in groups of 60, you can divide them into halves, thirds, fourths, fifths, sixths, tenths, twelfths, fifteenths, twentieths, thirtieths, and sixtieths. No smaller number has as many divisors, a fact that puts 60 in an elite class of “highly composite numbers.” (The term and the definition were introduced by Srinivasan Ramanujan in 1915.)

There’s something else about 60 that I never noticed until a few weeks ago—although the Babylonians might well have known it, and Ramanujan surely did. The number 60, with its extravagant wealth of divisors, sits wedged between two other numbers that have no divisors at all except for 1 and themselves. Both 59 and 61 are primes. Such pairs of primes, separated by a single intervening integer, are known as twin primes. Other examples are (5, 7), (29, 31), and (1949, 1951). Over the years twin primes have gotten a great deal of attention from number theorists. Less has been said about the number in the middle—the interloper that keeps the twins apart. At the risk of sounding slightly twee, I’m going to call such middle numbers *twin tweens*, or more briefly just *tweens*.

Is it just a fluke that a number lying between two primes is outstandingly *un*prime? Is 60 unusual in this respect, or is there a pattern here, common to all twin primes and their twin tweens? One can imagine some sort of fairness principle at work: If \(n\) is flanked by divisor-poor neighbors, it must have lots of divisors to compensate, to balance things out. Perhaps every pair of twin primes forms a chicken salad sandwich, with two solid slabs of prime bread surrounding a squishy filling that gets chopped up into many small pieces.

As a quick check on this hypothesis, let’s plot the number of divisors \(d(n)\) for each integer in the range from \(n = 1\) to \(n = 75\):

In Figure 1 twin primes are marked by light blue dots, and their associated tweens are dark blue. Highly composite numbers—those \(n\) that have more divisors than any smaller \(n\)—are distinguished by golden haloes.

The interval 1–75 is a very small sample of the natural numbers, and an unusual one, since twin primes are abundant among small integers but become quite rare farther out along the number line. For a broader view of the matter, consider the number of divisors for all positive integers up to \(n = 10^8\).

Figure 2 plots \(d(n)\) over the same range, breaking up the sequence of 100 million numbers into 500 blocks of size 200,000, and taking the average value of \(d(n)\) within each block.

A glance at the graph leaves no doubt that numbers living next door to primes have many more divisors, on average, than numbers without prime neighbors. It’s as if the primes were heaving all their divisors over the fence into the neighbor’s yard. Or maybe it’s the twin tweens who are the offenders here, vampirishly sucking all the divisors out of nearby primes.

Allow me to suggest a less-fanciful explanation. All primes (with one notorious exception) are odd numbers, which means that all nearest neighbors of primes (again with one exception) are even. In other words, the neighbors of primes have 2 as a divisor, which gives them an immediate head start in the race to accumulate divisors. Twin tweens have a further advantage: All of them (with one exception) are divisible by 3 as well as by 2. Why? Among any three consecutive integers, one of them must be a multiple of 3, and it can’t be either of the primes, so it must be the tween.

Being divisible by both 2 and 3, a twin tween is also divisible by 6. Any other prime factors of the tween combine with 2 and 3 to produce still more divisors. For example, a number divisible by 2, 3, and 5 is also divisible by 10, 15, and 30.

However, a closer look at Figure 3 gives reason for caution. In the graph the mean \(d(n)\) for numbers divisible by 6 is about 43, but we already know that for tweens—for the subset of numbers divisible by 6 that happen to live between twin primes—\(d(n)\) is greater than 51. That further enhancement argues that nearby primes do, after all, have some influence on divisor abundance.

Further evidence comes from another plot of \(d(n)\) for numbers with and without prime neighbors, but this time limited to integers divisible by 6. Thus all members of the sample population have the same “head start.” Figure 4 presents the results of this experiment.

If prime neighbors had no effect (other than ensuring divisibility by 6), the blue, green, and red curves would all trace the same trajectory, but they do not. Although the tweens’ lead in the divisor race is narrowed somewhat, it is certainly not abolished. Numbers with two prime neighbors have about 20 percent more divisors than the overall average for numbers divisible by 6. Numbers with one prime neighbor are also slightly above average. Thus factors of 2 and 3 can’t be the whole story.

Here’s a hand-wavy attempt to explain what might be going on. In the same way that any three consecutive integers must include one that’s a multiple of 3, any five consecutive integers must include a multiple of 5. If you choose an integer \(n\) at random, you can be certain that exactly one member of the set \(\{n - 2, n - 1, n, n + 1, n + 2\}\) is divisible by 5. Since \(n\) was chosen randomly, all members of the set are equally likely to assume this role, and so you can plausibly say that 5 divides \(n\) with probability 1/5.

But suppose \(n\) is a twin tween. Then \(n - 1\) and \(n + 1\) are known to be prime, and neither of them can be a multiple of 5. You must therefore redistribute the probability over the remaining three members of the set. Now it seems that \(n\) is divisible by 5 with probability 1/3. You can make a similar argument about divisibility by 7, or 11, or any other prime. In each case the probability is enhanced by the presence of nearby primes.

The same argument works just as well if you turn it upside down. Knowing that \(n\) is even tells you that \(n - 1\) and \(n + 1\) are odd. If \(n\) is also divisible by 3, you know that \(n - 1\) and \(n + 1\) do not have 3 as a factor. Likewise with 5, 7, and so on. Thus finding an abundance of divisors in \(n\) raises the probability that \(n\)’s neighbors are prime.

Does this scheme make sense? There’s room for more than a sliver of doubt. Probability has nothing to do with the distribution of divisors among the integers. The process that determines divisibility is as simple as counting, and there’s nothing random about it. Imagine you are dealing cards to players seated at a very long table, their chairs numbered in sequence from 1 to infinity. First you deal a 1 card to every player. Then, starting with player 2, you deal a 2 card to every second player. Then a 3 card to player 3 and to every third player thereafter, and so on. When you finish (*if* you finish!), each player holds cards for all the divisors of his or her chair number, and no other cards.

This card-dealing routine sounds to me like a reasonable description of how integers are constructed. Adding a probabilistic element modifies the algorithm in a crucial way. As you are dealing out the divisors, every now and then a player refuses to accept a card, saying “Sorry, I’m prime; please give it to one of my neighbors.” You then select a recipient at random from the set of neighbors who lie within a suitable range.

Building a number system by randomly handing out divisors like raffle tickets might make for an amusing exercise, but it will not produce the numbers we all know and love. Primes do not wave off the dealer of divisors; on the contrary, a prime is prime because none of the cards between 1 and \(n\) land at its position. And integers divisible by 5 are not scattered along the number line according to some local probability distribution; they occur with absolute regularity at every fifth position. Introducing probability in this context seems misleading and unhelpful.

And yet… And yet…! It works.

Figure 5 shows the proportion of all \(n \le 10^8\) that are divisible by 5, classified according to the number of primes adjacent to \(n\). The overall average is 1/5, as it must be. But among twin tweens, with two prime neighbors, the proportion is close to 1/3, as the probabilistic model predicts. And about 1/4 of the numbers with a single prime neighbor are multiples of 5, which again is in line with predictions of the probabilistic model. And note that values of \(n\) with no prime neighbors have a below-average fraction of multiples of 5. In one sense this fact is unsurprising and indeed inescapable: If the average is fixed and one subgroup has an excess, then the complement of that subgroup must have a deficit. Nevertheless, it seems strange. How can an absence of nearby primes depress the density of multiples of 5 below the global average? After all, we know that multiples of 5 invariably come along at every fifth integer.

In presenting these notes on the quirks of tweens, I don’t mean to suggest there is some deep flaw or paradox in the theory of numbers. The foundations of arithmetic will not crumble because I’ve encountered more 5s than I expected in prime-rich segments of the number line. No numbers have wandered away from their proper places in the sequence of integers; we don’t have to track them down and put them back where they belong. What needs adjustment is my understanding of their distribution. In other words, the question is not so much “What’s going on?” but “What’s the right way to think about this?”

I know several wrong ways. The idea that twin primes repel divisors and tweens attract them is a just-so story, like the one about the crocodile tugging on the elephant’s nose. It might be acceptable as a metaphor, but not as a mechanism. There are no force fields acting between integers. Numbers cannot sense the properties of their neighbors. Nor do they have personalities; they are not acquisitive or abstemious, gregarious or reclusive.

The probabilistic formulation seems better than the just-so story in that it avoids explicit mention of causal links between numbers. But that idea still lurks beneath the surface. What does it mean to say “The presence of prime neighbors increases a number’s chance of being divisible by 5”? Surely not that the prime is somehow influencing the outcome of a coin flip or the spin of a roulette wheel. The statement makes sense only as an empirical, statistical observation: In a survey of many integers, more of those near primes are found to have 5 as a divisor than those far from primes. This is a true statement, but it doesn’t tell us why it’s true. (And there’s no assurance the observation holds for *all* numbers.)

Probabilistic reasoning is not a novelty in number theory. In 1936 Harald Cramér wrote:

With respect to the ordinary prime numbers, it is well known that, roughly speaking, we may say that the chance that a given integer \(n\) should be a prime is approximately \(1 / \log n\).

Cramér went on to build a whole probabalistic model of the primes, ignoring all questions of divisibility and simply declaring each integer to be prime or composite based on the flip of a coin (biased according to the \(1 / \log n\) probability). In some respects the model works amazingly well. As Figure 6 shows, it not only matches the overall trend in the distribution of primes, but it also gives a pretty good estimate of the prevalence of twin primes.

However, much else is missing from Cramér’s random model. In particular, it totally misses the unusual properties of twin tweens. In the region of the number line shown in Figure 6, from 1 to \(10^7\), true tweens have an average of 44 divisors. The Cramér tweens are just a random sampling of ordinary integers, with about 16 divisors on average.

Fundamentally, the distribution of twin tweens is inextricably tangled up with the distribution of twin primes. You can’t have one without the other. And that’s not a helpful development, because the distribution of primes (twin and otherwise) is one of the deepest enigmas in modern mathematics.

Throughout these musings I have characterized the distinctive properties of tweens by counting their divisors. There are other ways of approaching the problem that yield similar results. For example, you can calculate the sum of the divisors, \(\sigma(n)\) rather than the count \(d(n)\), a technique that leads to the classical notions of abundant, perfect, and deficient numbers. A number \(n\) is abundant if \(\sigma(n) > 2n\), perfect if \(\sigma(n) = 2n\) and deficient if \(\sigma(n) \lt 2n\). When I wrote a program to sort tweens into these three categories, I was surprised to discover that apart from 4 and 6, every twin tween is an abundant number. Such a definitive result seemed remarkable and important. But then I learned that *every* number divisible by 6, other than 6 itself, is abundant.

Another approach is to count the prime factors of \(n\), rather than the divisors. These two quantities are highly correlated, although the number of divisors is not simply a function of the number of factors; it also depends on the diversity of the factors.

As Figure 7 shows, counting primes tells a story that’s much the same as counting divisors. A typical integer in the range up to \(10^8\) has about four factors, whereas a typical tween in the same range has more than six.

We could also look at the size of \(n\)’s largest prime factor, \(f_{\max}(n)\), which is connected to the concept of a smooth number. A number is smooth if all of its prime factors are smaller than some stated bound, which might be a fixed constant or a function of \(n\), such as \(\sqrt n\). One measure of smoothness is \(\log n\, / \log f_{\max}(n)\). Computations show that by this definition tweens are smoother than average: The ratio of logs is about 2.0 for the tweens and about 1.7 for all numbers.

One more miscellaneous fact: No tween except 4 is a perfect square. Proof: Suppose \(n = m^2\) is a tween. Then \(n - 1 = m^2 - 1\), which has the factors \(m - 1\) and \(m + 1\), and so it cannot be prime. An extension of this argument rules out cubes and all higher perfect powers as twin tweens.

When I first began to ponder the tweens, I went looking to see what other people might have said on the subject. I didn’t find much. Although the literature on twin primes is immense, it focuses on the primes themselves, and especially on the question of whether there are infinitely many twins—a conjecture that’s been pending for 170 years. The numbers sandwiched between the primes are seldom mentioned.

The many varieties of highly composite numbers also have an enthusiastic fan club, but I have found little discussion of their frequent occurrence as neighbors of primes.

Could it be that I’m the first person ever to notice the curious properties of twin tweens? No. I am past the age of entertaining such droll thoughts, even transiently. If I have not found any references, it’s doubtless because I’m not looking in the right places. (Pointers welcome.)

I did eventually find a handful of interesting articles and letters. The key to tracking them down, unsurprisingly, was the Online Encyclopedia of Integer Sequences, which more and more seems to function as the Master Index to Mathematics. I had turned there first, but the entry on the tween sequence, titled “average of twin prime pairs,” has only one reference, and it was not exactly a gold mine of enlightenment. It took me to a 1967 volume of *Eureka*, the journal of the Cambridge Archimedians. All I found there (on page 16) was a very brief problem, asking for the continuation of the sequence 4, 6, 12, 18, 30, 42,…

There the matter rested for a few weeks, but eventually I came back to the OEIS to follow cross-links to some related sequences. Under “highly composite numbers” I found a reference to “Prime Numbers Generated From Highly Composite Numbers,” by Benny Lim (*Parabola*, 2018). Lim looked at the neighbors of the first 1,000 highly composite numbers. At the upper end of this range, the numbers are very large \((10^{76})\) and primes are very rare—but they are not so rare among the nearest neighbors of highly composite numbers, Lim found.

Another cross reference took me off to sequence A002822, labeled “Numbers \(m\) such that \(6m-1\), \(6m+1\) are twin primes.” In other words, this is the set of numbers that, when multiplied by 6, yield the twin tweens. The first few terms are: 1, 2, 3, 5, 7, 10, 12, 17, 18, 23, 25, 30, 32, 33, 38. The OEIS entry includes a link to a 2011 paper by Francesca Balestrieri, which introduces an intriguing idea I have not yet fully absorbed. Balestrieri shows that \(6m + 1\) is composite if \(m\) can be expressed as \(6xy + x - y\) for some integers \(x\) and \(y\), and otherwise is prime. There’s a similar but slightly more complicated rule for \(6m - 1\). She then proceeds to prove the following theorem:

The Twin Prime Conjecture is true if, and only if, there exist infinitely many \(m \in N\) such that \(m \ne 6xy + x - y\) and \(m \ne 6xy + x + y\) and \(m \ne 6xy - x - y\), for all \(x, y \in N\).

Other citations took me to three papers by Antonie Dinculescu, dated 2012 to 2018, which explore related themes. But the most impressive documents were two letters written to Neil J. A. Sloane, the founder and prime mover of the OEIS. In 1984 the redoubtable Solomon W. Golomb wrote to point out several publications from the 1950s and 60s that mention the connection between \(6xy \pm x \pm y\) and twin primes. The earliest of these appearances was a problem in the *American Mathematical Monthly*, proposed and solved by Golomb himself. He was 17 when he made the discovery, and this was his first mathematical publication. To support his claim of priority, he offered a $100 reward to anyone who could find an earlier reference.

The second letter, from Matthew A. Myers of Spruce Pine, North Carolina, supplies two such prior references. One is a well-known history of number theory by L. E. Dickson, published in 1919. The other is an “Essai sur les nombres premiers” by Wolfgang Ludwig Krafft, a colleague of Euler’s at the St. Petersburg Academy of Sciences. The essay was read to the academy in 1798 and published in Volume 12 of the *Nova Acta Academiae Scientiarum Imperialis Petropolitanae*. It deals at length with the \(6xy \pm x \pm y\) matter. This was 50 years before the concept of twin primes, and their conjectured infinite supply, was introduced by Alphonse de Polignac.

Myers reported these archeological findings in 2018. Sadly, Golomb had died two years earlier.

]]>Peaks and troughs, lumps and slumps, wave after wave of surge and retreat: I have been following the ups and downs of this curve, day by day, for a year and a half. The graph records the number of newly reported cases of Covid-19 in the United States for each day from 21 January 2020 through 20 July 2021. That’s 547 days, and also exactly 18 months. The faint, slender vertical bars in the background give the raw daily numbers; the bold blue line is a seven-day trailing average. (In other words, the case count for each day is averaged with the counts for the six preceding days.)

I struggle to understand the large-scale undulations of that graph. If you had asked me a few years ago what a major epidemic might look like, I would have mumbled something about exponential growth and decay, and I might have sketched a curve like this one:

My imaginary epidemic is so much simpler than the real thing! The number of daily infections goes up, and then it comes down again. It doesn’t bounce around like a nervous stock market. It doesn’t have seasonal booms and busts.

The graph tracing the actual incidence of the disease makes at least a dozen reversals of direction, along with various small-scale twitches and glitches. The big mountain in the middle has foothills on both sides, as well as some high alpine valleys between craggy peaks. I’m puzzled by all this structural embellishment. Is it mere noise—a product of random fluctuations—or is there some driving mechanism we ought to know about, some switch or dial that’s turning the infection process on and off every few months?

I have a few ideas about possible explanations, but I’m not so keen on any of them that I would try to persuade you they’re correct. However, I *do* hope to persuade you there’s something here that needs explaining.

Before going further, I want to acknowledge my sources. The data files I’m working with are curated by *The New York Times*, based on information collected from state and local health departments. Compiling the data is a big job; the *Times* lists more than 150 workers on the project. They need to reconcile the differing and continually shifting policies of the reporting agencies, and then figure out what to do when the incoming numbers look fishy. (Back in June, Florida had a day with –40,000 new cases.) The entire data archive, now about 2.3 gigabytes, is freely available on GitHub. Figure 1 in this article is modeled on a graph updated daily in the *Times*.

I must also make a few disclaimers. In noodling around with this data set I am not trying to forecast the course of the epidemic, or even to retrocast it—to develop a model accurate enough to reproduce details of timing and magnitude observed over the past year and a half. I’m certainly not offering medical or public-health advice. I’m just a puzzled person looking for simple mechanisms that might explain the overall shape of the incidence curve, and in particular the roller coaster pattern of recurring hills and valleys.

So far, four main waves of infection have washed over the U.S., with a fifth wave now beginning to look like a tsunami. Although the waves differ greatly in height, they seem to be coming at us with some regularity. Eyeballing Figure 1, I get the impression that the period from peak to peak is pretty consistent, at roughly four months.

Periodic oscillations in epidemic diseases have been noticed many times before. The classical example is measles in Great Britain, for which there are weekly records going back to the early 18th century. In 1917 John Brownlee studied the measles data with a form of Fourier analysis called the periodogram. He found that the strongest peak in the frequency spectrum came at a period of 97 weeks, reasonably consistent with the widespread observation that the disease reappears every second year. But Brownlee’s periodograms bristle with many lesser peaks, indicating that the measles rhythm is not a simple, uniform drumbeat.

The mechanism behind the oscillatory pattern in measles is easy to understand. The disease strikes children in the early school years, and the virus is so contagious that it can run through an entire community in a few weeks. Afterwards, another outbreak can’t take hold until a new cohort of children has reached the appropriate age. No such age dependence exists in Covid-19, and the much shorter period of recurrence suggests that some other mechanism must be driving the oscillations. Nevertheless, it seems worthwhile to try applying Fourier methods to the data in Figure 1.

The Fourier transform decomposes any curve representing a function of time into a sum of simple sine and cosine waves of various frequencies. In textbook examples, the algorithm works like magic. Take a wiggly curve like this one:

Feed it into the Fourier transformer, turn the crank, and out comes a graph that looks like this, showing the coefficients of various frequency components:

It would be illuminating to have such a succinct encoding of the Covid curve—a couple of numbers that explain its comings and goings. Alas, that’s not so easy. When I poured the Covid data into the Fourier machine, this is what came out:

More than a dozen coefficients have significant magnitude; some are positive and some are negative; no obvious pattern leaps off the page. This spectrum, like the simpler one in Figure 4, holds all the information needed to reconstruct its input. I confirmed that fact with a quick computational experiment. But looking at the jumble of coefficients doesn’t help me to understand the structure of the Covid curve. The Fourier-transformed version is even more baffling than the original.

One lesson to be drawn from this exercise is that the Fourier transform is indeed magic: If you want to make it work, you need to master the dark arts. I am no wizard in this department; as a matter of fact, most of my encounters with Fourier analysis have ended in tears and trauma. No doubt someone with higher skills could tease more insight from the numbers than I can. But I doubt that any degree of Fourier finesse will lead to some clear and concise description of the Covid curve. Even with \(200\) years of measles records, Brownlee wasn’t able to isolate a clear signal; with just a year and a half of Covid data, success is unlikely.

Yet my foray into the Fourier realm was not a complete waste of time. Applying the inverse Fourier transform to the first 13 coefficients (for wavenumbers 0 through 6) yields this set of curves:

It looks a mess, but the sum of these 13 sinusoidal waves yields quite a handsome, smoothed version of the Covid curve. In Figure 7 below, the pink area in the background shows the *Times* data, smoothed with the seven-day rolling average. The blue curve, much smoother still, is the waveform reconstructed from the 13 Fourier coefficients.

The reconstruction traces the outlines of all the large-scale features of the Covid curve, with serious errors only at the end points (which are always problematic in Fourier analysis). The Fourier curve also fails to reproduce the spiky triple peak atop the big surge from last winter, but I’m not sure that’s a defect.

Let’s take a closer look at that triple peak. The graph below is an expanded view of the two-month interval from 20 November 2020 through 20 January 2021. The light-colored bars in the background are raw data on new cases for each day; the dark blue line is the seven-day rolling average computed by the *Times*.

The peaks and valleys in this view are just as high and low as those in Figure 1; they look less dramatic only because the horizontal axis has been stretched ninefold. My focus is not on the peaks but on the troughs between them. (After all, there wouldn’t be three peaks if there weren’t two troughs to separate them.) Three data points marked by pink bars have case counts far lower than the surrounding days. Note the dates of those events. November 26 was Thanksgiving Day in the U.S. in 2020; December 25 is Christmas Day, and January 1 is New Years Day. It looks like the virus went on holiday, but of course it was actually the medical workers and public health officials who took a day off, so that many cases did not get recorded on those days.

There may be more to this story. Although the holidays show up on the chart as low points in the progress of the epidemic, they were very likely occasions of higher-than-normal contagion, because of family gatherings, religious services, revelry, and so on. (I commented on the Thanksgiving risk last fall.) Those “extra” infections would not show up in the statistics until several days later, along with the cases that went undiagnosed or unreported on the holidays themselves. Thus each dip appears deeper because it is followed by a surge.

All in all, it seems likely that the troughs creating the triple peak are a reporting anomaly, rather than a reflection of genuine changes in the viral transmission rate. Thus a curve that smooths them away may give a better account of what’s really going on in the population.

There’s another transformation—quite different from Fourier analysis—that might tell us something about the data. The time derivative of the Covid curve gives the rate of change in the infection rate—positive when the epidemic is surging, negative when it’s retreating. Because we’re working with a series of discrete values, computing the derivative is trivially easy: It’s just the series of differences between successive values.

The derivative of the raw data *(blue)* looks like a seismograph recording from a jumpy day along the San Andreas. The three big holiday anomalies—where case counts change by 100,000 per day—produce dramatic excursions. The smaller jagged waves that extend over most of the 18-month interval are probably connected with the seven-day cycle of data collection, which typically show case counts increasing through the work week and then falling off on the weekend.

The seven-day trailing average is designed to suppress that weekly cycle, and it also smooths over some larger fluctuations. The resulting curve *(red)* is not only less jittery but also has much lower amplitude. (I have expanded the vertical scale by a factor of two for clarity.)

Finally, the reconstituted curve built by summing 13 Fourier components yields a derivative curve *(green)* whose oscillations are mere ripples, even when stretched vertically by a factor of four.

The points where the derivative curves cross the zero line—going from positive to negative or vice versa—correspond to peaks or troughs in the underlying case-count curve. Each zero crossing marks a moment when the epidemic’s trend reversed direction, when a growing daily case load began to decline, or a falling one turned around and started gaining again. The blue raw-data curve has 255 zero crossings, and the red averaged curve has 122. Even the lesser figure implies that the infection trend is reversing course every four or five days, which is not plausible; most of those sign changes must result from noise in the data.

The silky smooth green curve has nine zero crossings, most of which seem to signal real changes in the course of the epidemic. I would like to understand what’s causing those events.

You catch a virus. (Sorry about that.) Some days later you infect a few others, who after a similar delay pass the gift on to still more people. This is the mechanism of exponential (or geometric) growth. With each link in the chain of transmission the number of new cases is multiplied by a factor of \(R\), which is the natural growth ratio of the epidemic—the average number of cases spawned by each infected individual. Starting with a single case at time \(t = 0\), the number of new infections at any later time \(t\) is \(R^t\). If \(R\) is greater than 1, even very slightly, the number of cases increases without limit; if \(R\) is less than 1, the epidemic fades away.

The average delay between when you become infected and when you infect others is known as the serial passage time, which I am going to abbreviate T_{SP} and take as the basic unit for measuring the duration of events in the epidemic. For Covid-19, one T_{SP} is probably about five days.

Exponential growth is famously unmanageable. If \(R = 2\), the case count doubles with every iteration: \(1, 2, 4, 8, 16, 32\dots\). It increases roughly a thousandfold after 10 T_{SP}, and a millionfold after 20 T_{SP}. The rate of increase becomes so steep that I can’t even graph it except on a logarithmic scale, where an exponential trajectory becomes a straight line.

What is the value of \(R\) for the SARS-CoV-2 virus? No one knows for sure. The number is difficult to measure, and it varies with time and place. Another number, \(R_0\), is often regarded as an intrinsic property of the virus itself, an indicator of how easily it passes from person to person. The Centers for Disease Control and Prevention (CDC) suggests that \(R_0\) for SARS-CoV-2 probably lies between 2.0 and 4.0, with a best guess of 2.5. That would make it catchier than influenza but less so than measles. However, the CDC has also published a report arguing that \(R_0\) is “easily misrepresented, misinterpreted, and misapplied.” I’ve certainly been confused by much of what I’ve read on the subject.

Whatever numerical value we assign to \(R\), if it’s greater than 1, it cannot possibly describe the complete course of an epidemic. As \(t\) increases, \(R^t\) will grow at an accelerating pace, and before you know it the predicted number of cases will exceed the global human population. For \(R = 2\), this absurdity arrives after about 33 T_{SP}, which is less than six months.

What we need is a mathematical model with a built-in limit to growth. As it happens, the best-known model in epidemiology features just such a mechanism. Introduced almost 100 years ago by W. O. Kermack and A. G. McKendrick of the Royal College of Physicians in Edinburgh, *removed*, acknowledging that recovery is not the only way an infection can end. But I don’t want to be grim today. Also note that I’m using a calligraphic font for \(\mathcal{S},\mathcal{I}\), and \(\mathcal{R}\) to avoid confusion between the growth rate \(R\) and the recovered group \(\mathcal{R}\).*susceptible* \((\mathcal{S})\), *infective* \((\mathcal{I})\), and *recovered* \((\mathcal{R})\). Initially (before a pathogen enters the population), everyone is of type \(\mathcal{S}\). Susceptibles who contract the virus become infectives—capable of transmitting the disease to other susceptibles. Then, after each infective’s illness has run its course, that person joins the recovered class. Having acquired immunity through infection, the recovereds will never be susceptible again.

A SIR epidemic can’t keep growing indefinitely for the same reason that a forest fire can’t keep burning after all the trees are reduced to ashes. At the beginning of an epidemic, when the entire population is susceptible, the case count can grow exponentially. But growth slows later, when each infective has a harder time finding susceptibles to infect. Kermack and McKendrick made the interesting discovery that the epidemic dies out before it has reached the entire population. That is, the last infective recovers before the last susceptible is infected, leaving a residual \(\mathcal{S}\) population that has never experienced the disease.

The SIR model itself has gone viral in the past few years. There are tutorials everywhere on the web, as well as scholarly articles and books. (I recommend *Epidemic Modelling: An Introduction*, by Daryl J. Daley and Joseph Gani. Or try *Mathematical Modelling of Zombies* if you’re feeling brave.) Most accounts of the SIR model, including the original by Kermack and McKendrick, are presented in terms of differential equations. I’m instead going to give a version with discrete time steps—\(\Delta t\) rather than \(dt\)—because I find it easier to explain and because it translates line for line into computer code. In the equations that follow, \(\mathcal{S}\), \(\mathcal{I}\), and \(\mathcal{R}\) are real numbers in the range \([0, 1]\), representing proportions of some fixed-size population.

\[\begin{align}

\Delta\mathcal{I} & = \beta \mathcal{I}\mathcal{S}\\[0.8ex]

\Delta\mathcal{R} & = \gamma \mathcal{I}\\[1.5ex]

\mathcal{S}_{t+\Delta t} & = \mathcal{S}_{t} - \Delta\mathcal{I}\\[0.8ex]

\mathcal{I}_{t+\Delta t} & = \mathcal{I}_{t} + \Delta\mathcal{I} - \Delta\mathcal{R}\\[0.8ex]

\mathcal{R}_{t+\Delta t} & = \mathcal{R}_{t} + \Delta\mathcal{R}\\[0.8ex]

\end{align}\]

The first equation, with \(\Delta\mathcal{I}\) on the left hand side, describes the actual contagion process—the recruitment of new infectives from the susceptible population. The number of new cases is proportional to the product of \(\mathcal{I}\) and \(\mathcal{S}\), since the only way to propagate the disease is to bring together someone who already has it with someone who can catch it. The constant of proportionality, \(\beta\), is a basic parameter of the model. It measures how often (per T_{SP}) an infective person encounters others closely enough to communicate the virus.

The second equation, for \(\Delta\mathcal{R}\), similarly describes recovery. For epidemiological purposes, you don’t have to be feeling tiptop again to be done with the disease; recovery is defined as the moment when you are no long capable of infecting other people. The model takes a simple approach to this idea, withdrawing a fixed fraction of the infectives in every time step. The fraction is given by the parameter \(\gamma\).

After the first two equations calculate the number of people who are changing their status in a given time step, the last three equations update the population segments accordingly. The susceptibles lose \(\Delta\mathcal{I}\) members; the infectives gain \(\Delta\mathcal{I}\) and lose \(\Delta\mathcal{R}\); the recovereds gain \(\Delta\mathcal{R}\). The total population \(\mathcal{S} + \mathcal{I} + \mathcal{R}\) remains constant throughout.

In

Here’s what happens when you put the model in motion. For this run I set \(\beta = 0.6\) and \(\gamma = 0.2\), which implies that \(\rho = 3.0\). Another number that needs to be specified is the initial proportion of infectives; I chose \(10^{-6}\), or in other words one in a million. The model ran for \(100\) T_{SP}, with a time step of \(\Delta t = 0.1\) T_{SP}; thus there were \(1{,}000\) iterations overall.

Let me call your attention to a few features of this graph. At the outset, nothing seems to happen for weeks and weeks, and then all of a sudden a huge blue wave rises up out of the calm waters. Starting from one case in a population of a million, it takes \(18\) T_{SP} to reach one case in a thousand, but just \(12\) more T_{SP} to reach one in \(10\).

Note that the population of infectives reaches a peak near where the susceptible and recovered curves cross—that is, where \(\mathcal{S} = \mathcal{R}\). This relationship holds true over a wide range of parameter values. That’s not surprising, because the whole epidemic process acts as a mechanism for converting susceptibles into recovereds, via a brief transit through the infective stage. But, as Kermack and McKendrick predicted, the conversion doesn’t quite go to completion. At the end, about \(6\) percent of the population remains in the susceptible category, and there are no infectives left to convert them. This is the condition called herd immunity, where the population of susceptibles is so diluted that most infectives recover before they can find someone to infect. It’s the end of the epidemic, though it comes only after \(90+\) percent of the people have gotten sick. (That’s not what I would call a victory over the virus.)

The \(\mathcal{I}\) class in the SIR model can be taken as a proxy for the tally of new cases tracked in the *Times* data. The two variables are not quite the same—infectives remain in the class \(\mathcal{I}\) until they recover, whereas new cases are counted only on the day they are reported—but they are very similar and roughly proportional to one another. And that brings me to the main point I want to make about the SIR model: In Figure 11 the blue curve for infectives looks nothing like the corresponding new-case tally in Figure 1. In the SIR model, the number of infectives starts near zero, rises steeply to a peak, and thereafter tapers gradually back to zero, never to rise again. It’s a one-hump camel. The roller coaster Covid curve is utterly different.

The detailed geometry of the \(\mathcal{I}\) curve depends on the values assigned to the parameters \(\beta\) and \(\gamma\). Changing those variables can make the curve longer or shorter, taller or flatter. But no choice of parameters will give the curve multiple ups and downs. There are no oscillatory solutions to these equations.

The SIR model strikes me as so plausible that it—or some variant of it—really *must* be a correct description of the natural course of an epidemic. But that doesn’t mean it can explain what’s going on right now with Covid-19. A key element of the model is saturation: the spread of the disease stalls when there are too few susceptibles left to catch it. That can’t be what caused the steep downturn in Covid incidence that began in January of this year, or the earlier slumps that began in April and July of 2020. We were nowhere near saturation during any of those events, and we still aren’t now. (For the moment I’m ignoring the effects of vaccination. I’ll take up that subject below.)

In Figure 11 there comes a dramatic triple point where each of the three categories constitutes about a third of the total population. If we projected that situation onto the U.S., we would have (in very round numbers) \(100\) million active infections, another \(100\) million people who have recovered from an earlier bout with the virus, and a third \(100\) million who have so far escaped (but most of whom will catch it in the coming weeks). That’s orders of magnitude beyond anything seen so far. The cumulative case count, which combines the \(\mathcal{I}\) and \(\mathcal{R}\) categories, is approaching \(37\) million, or \(11\) percent of the U.S. population. *y* axis spans the full U.S. population of 330 million, you get the flatline graph at right.)

If we are still on the early part of the curve, in the regime of rampant exponential growth, it’s easy to understand the surging accelerations we’ve seen in the worst moments of the epidemic. The hard part is explaining the repeated slowdowns in viral transmission that punctuate the Covid curve. In the SIR model, the turnaround comes when the virus begins to run out of victims, but that’s a one-time phenomenon, and we haven’t gotten there yet. What can account for the deep valleys in the *Times* Covid curve?

Among the ideas that immediately come to mind, one strong contender is feedback. We all have access to real-time information on the status of the epidemic. It comes from governmental agencies, from the news media, from idiots on Facebook, and from the personal communications of family, friends, and neighbors. Most of us, I think, respond appropriately to those messages, modulating our anti-contagion precautions according to the perceived severity of the threat. When it’s scary outside, we hunker down and mask up. When the risk abates, it’s party time again! I can easily imagine a scenario where such on-again, off-again measures would trigger oscillations in the incidence of the disease.

If this hypothesis turns out to be true, it is cause for both hope and frustration. Hope because the interludes of viral retreat suggest that our tools for fighting the epidemic must be reasonably effective. Frustration because the rebounds indicate we’re not deploying those tools as well as we could. Look again at the Covid curve of Figure 1, specifically at the steep downturn following the winter peak. In early February, the new case rate was dropping by 30,000 per week. Over the first three weeks of that month, the rate was cut in half. Whatever we were doing then, it was working brilliantly. If we had just continued on the same trajectory, the case count would have hit zero in early March. Instead, the downward slope flattened, and then turned upward again.

We had another chance in June. All through April and May, new cases had been falling steadily, from 65,000 to 17,000, a pace of about –800 cases a day. If we’d been able to sustain that rate for just three more weeks, we’d have crossed the finish line in late June. But again the trend reversed course, and by now we’re back up well above 100,000 cases a day.

Are these pointless ups and downs truly caused by feedback effects? I don’t know. I am particularly unsure about the “if only” part of the story—the idea that if only we’d kept the clamps on for just a few more weeks, the virus would have been eradicated, or nearly so. But it’s an idea to keep in mind.

Perhaps we could learn more by creating a feedback loop in the SIR model, and looking for oscillatory dynamics. Negative feedback is anything that acts to slow the infection rate when that rate is high, and to boost it when it’s low. Such a contrarian mechanism could be added to the model in several ways. Perhaps the simplest is a lockdown threshold: Whenever the number of infectives rises above some fixed limit, everyone goes into isolation; when the \(\mathcal{I}\) level falls below the threshold again, all cautions and restrictions are lifted. It’s an all-or-nothing rule, which makes it simple to implement. We need a constant to represent the threshold level, and a new factor (which I am naming \(\varphi\), for fear) in the equation for \(\Delta \mathcal{I}\):

\[\Delta\mathcal{I} = \beta \varphi \mathcal{I} \mathcal{S}\]

The \(\varphi\) factor is 1 whenever \(\mathcal{I}\) is below the threshold, and \(0\) when it rises above. The effect is to shut down all new infections as soon as the threshold is reached, and start them up again when the rate falls.

Does this scheme produce oscillations in the \(\mathcal{I}\) curve? Strictly speaking, the answer is yes, but you’d never guess it by looking at the graph.

The feedback loop serves as a control system, like a thermostat that switches the furnace off and on to maintain a set temperature. In this case, the feedback loop holds the infective population steady at the threshold level, which is set at 0.05. On close examination, it turns out that \(\mathcal{I}\) is oscillating around the threshold level, but with such a short period and tiny amplitude that the waves are invisible. The value bounces back and forth between 0.049 and 0.051.

To get macroscopic oscillations, we need more than feedback. The SIR output shown below comes from a model that combines feedback with a delay between measuring the state of the epidemic and acting on that information. Introducing such a delay is not the only way to make the model swing, but it’s certainly a plausible one. As a matter of fact, a model *without* any delay, in which a society responds instantly to every tiny twitch in the case count, seems wholly unrealistic.

The model of Figure 14 adopts the same parameters, \(\beta = 0.6\) and \(\gamma = 0.2\), as the version of Figure 13, as well as the same lockdown threshold \((0.05)\). It differs only in the timing of events. If the infective count climbs above the threshold at time \(t\), control measures do not take effect until \(t + 3\); in the meantime, infections continue to spread through the population. The delay and overshoot on the way up are matched by a delay and undershoot at the other end of the cycle, when lockdown continues for three T_{SP} after the threshold is crossed on the way down.

Given these specific parameters and settings, the model produces four cycles of diminishing amplitude and increasing wavelength. (No further cycles are possible because \(\mathcal{I}\) remains below the threshold.) Admittedly, those four spiky sawtooth peaks don’t look much like the humps in the Covid curve. If we’re going to seriously consider the feedback hypothesis, we’ll need stronger evidence than this. But the model is very crude; it could be refined and improved.

The fact is, I really want to believe that feedback could be a major component in the oscillatory dynamics of Covid-19. It would be comforting to know that our measures to combat the epidemic have had a powerful effect, and that we therefore have some degree of control over our fate. But I’m having a hard time keeping the faith. For one thing, I would note that our countermeasures have not always been on target. In the epidemic’s first wave, when the characteristics of the virus were largely unknown, the use of facemasks was discouraged (except by medical personnel), and there was a lot of emphasis on hand washing, gloves, and sanitizing surfaces. Not to mention drinking bleach. Those measures were probably not very effective in stopping the virus, but the wave receded anyway.

Another source of doubt is that wavelike fluctuations are not unique to Covid-19.

**Update:** Alina Glaubitz and Feng Fu of Dartmouth have applied a game-theoretical approach to generating oscillatory dynamics in a SIR model. Their work was published last November but I have only just learned about it from an article by Lina Sorg in *SIAM News*.

One detail of the SIR model troubles me. As formulated by Kermack and McKendrick, the model treats infection and recovery as symmetrical, mirror-image processes, both of them described by exponential functions. The exponential rule for infections makes biological sense. You can only get the virus via transmission from someone who already has it, so the number of new infections is proportional to the number of existing infections. But recovery is different; it’s not contagious. Although the duration of the illness may vary to some extent, there’s no reason to suppose it would depend on the number of other people who are sick at the same time.

In the model, a fixed fraction of the infectives, \(\gamma \mathcal{I}\), recover at every time step. _{SP}, more than \(10\) percent of the original group remain infectious.

The long tail of this distribution corresponds to illnesses that persist for many weeks. Such cases exist, but they are rare. According to the CDC, most Covid patients have no detectable “replication-competent virus” \(10\) days after the onset of symptoms. Even in the most severe cases, with immunocompromised patients, \(20\) days of infectivity seems to be the outer limit. *Theoretical Population Biology* Vol. 60, No. 1, pp. 59–71).

Modifying the model for a fixed period of infectivity is not difficult. We can keep track of the infectives with a data structure called a queue. Each new batch of newly recruited infectives goes into the tail of the queue, then advances one place with each time step. After \(m\) steps (where \(m\) is the duration of the illness), the batch reaches the head of the queue and joins the company of the recovered. Here is what happens when \(m = 3\) T_{SP}:

I chose \(3\) T_{SP} for this example because it is close to the median duration in the exponential distribution in Figure 11, and therefore ought to resemble the earlier result. And so it does, approximately. As in Figure 11, the peak in the infectives curve lies near where the susceptible and recovered curves cross. But the peak never grows quite as tall; and, for obvious reasons, it decays much faster. As a result, the epidemic ends with many more susceptibles untouched by the disease—more than 25 percent.

A disease duration of \(3\) T_{SP}, or about \(15\) days, is still well over the CDC estimates of the typical length. Shortening the queue to \(2\) T_{SP}, or about \(10\) days, transforms the outcome even more dramatically. Now the susceptible and recovered curves never cross, and almost \(70\) percent of the susceptible population remains uninfected when the epidemic peters out.

Figure 18 comes a little closer to describing the current Covid situation in the U.S. than the other models considered above. It’s not that the curves’ shape resembles that of the data, but the overall magnitude or intensity of the epidemic is closer to observed levels. Of the models presented so far, this is the first that reaches a natural limit without burning through most of the population. Maybe we’re on to something.

On the other hand, there are a couple of reasons for caution. First, with these parameters, the initial growth of the epidemic is extremely slow; it takes \(40\) or \(50\) T_{SP} before infections have a noticeable effect on the population. That’s well over six months. Second, we’re still dealing with a one-hump camel. Even though most of the population is untouched, the epidemic has run its course, and there will not be a second wave. Something important is still missing.

Before leaving this topic behind, I want to point out that the finite time span of a viral infection gives us a special point of leverage for controlling the spread of the disease. The viruses that proliferate in your body must find a new host within a week or two, or else they face extinction. Therefore, if we could completely isolate every individual in the country for just two or three weeks, the epidemic would be over. Admittedly, putting each and every one of us into solitary confinement is not feasible (or morally acceptable), but we could strive to come as close as possible, strongly discouraging all forms of person-to-person contact. Testing, tracing, and quarantines would deal with straggler cases. My point is that a very strict but brief lockdown could be both more effective and less disruptive than a loose one that goes on for months. Where other strategies aim to flatten the curve, this one attempts to break the chain.

When Covid emerged late in 2019, it was soon labeled a *pandemic*, signifying that it’s bigger than a mere epidemic, that it’s everywhere. But it’s not everywhere at once. Flareups have skipped around from region to region and country to country. Perhaps we should view the pandemic not as a single global event but as an ensemble of more localized outbreaks.

Suppose small clusters of infections erupt at random times, then run their course and subside. By chance, several geographically isolated clusters might be active over the same range of dates and add up to a big bump in the national case tally. Random fluctuations could also produce interludes of widespread calm, which would cause a dip in the national curve.

We can test this notion with a simple computational experiment, modeling a population divided into \(N\) clusters or communities. For each cluster a SIR model generates a curve giving the proportion of infectives as a function of time. The initiation time for each of these mini-epidemics is chosen randomly and independently. Summing the \(N\) curves gives the total case count for the country as a whole, again as a function of time.

Before scrolling down to look at the graphs generated by this process, you might make a guess about how the experiment will turn out. In particular, how will the shape of the national curve change as the number of local clusters increases?

If there’s just one cluster, then the national curve is obviously identical to the trajectory of the disease in that one place. With two clusters, there’s a good chance they will not overlap much, and so the national curve will probably have two humps, with a deep valley between them. With \(N = 3\) or \(4\), overlap becomes more of an issue, but the sum curve still seems likely to have \(N\) humps, perhaps with shallower depressions separating them. Before I saw the results, I made the following guess about the behavior of the sum as \(N\) continues increasing: The sum curve will always have approximately \(N\) peaks, I thought, but the height difference between peaks and troughs should get steadily smaller. Thus at large \(N\) the sum curve would have many tiny ripples, small enough that the overall curve would appear to be one broad, flat-topped hummock.

So much for my intuition. Here are two examples of sum curves generated by clusters of \(N\) mini-epidemics, one curve for \(N = 6\) and one for \(N = 50\). The histories for individual clusters are traced by fine red lines; the sums are blue. All the curves have been scaled so that the highest peak of the sum curve touches \(1.0\).

My *not* increase in proportion to \(N\). As a matter of fact, both of the sum curves in Figure 19 have four distinct peaks (possibly five in the example at right), even though the number of component curves contributing to the sum is only six in one case and is \(50\) in the other.

I have to confess that the two examples in Figure 19 were not chosen at random. I picked them because they looked good, and because they illustrated a point I wanted to make—namely that the number of peaks in the sum curve remains nearly constant, regardless of the value of \(N\). Figure 20 assembles a more representative sample, selected without deliberate bias but again showing that the number of peaks is not sensitive to \(N\), although the valleys separating those peaks get shallower as \(N\) grows.

The

The question is: Why \(4 \pm 1\)? Why do we keep seeing those particular numbers? And if \(N\), the number of components being summed, has little influence on this property of the sum curve, then what *does* govern it? I puzzled over these questions for some time before a helpful analogy occurred to me.

Suppose you have a bunch of sine waves, all at the same frequency \(f\) but with randomly assigned phases; that is, the waves all have the same shape, but they are shifted left or right along the \(x\) axis by random amounts. What would the sum of those waves look like? The answer is: another sine wave of frequency \(f\). This is a little fact that’s been known for ages (at least since Euler) and is not hard to prove, but it still comes as a bit of a shock every time I run into it. I believe the same kind of argument can explain the behavior of a sum of SIR curves, even though those curves are not sinusoidal. The component SIR curves have a period of \(20\) to \(30\) T_{SP}. In a model run that spans \(100\) T_{SP}, these curves can be considered to have a frequency of between three and five cycles per epidemic period. Their sum should be a wave with the same frequency—something like the Covid curve, with its four (or four and a half) prominent humps. In support of this thesis, when I let the model run to \(200\) T_{SP}, I get a sum curve with seven or eight peaks.

I am intrigued by the idea that an epidemic might arrive in cyclic waves not because of anything special about viral or human behavior but because of a mathematical process akin to wave interference. It’s such a cute idea, dressing up an obscure bit of counterintuitive mathematics and bringing it to bear on a matter of great importance to all of us. And yet, alas, a closer look at the Covid data suggests that nature doesn’t share my fondness for summing waves with random phases.

Figure 22, again based on data extracted from the *Times* archive, plots \(49\) curves, representing the time course of case counts in the Lower \(48\) states and the District of Columbia. I have separated them by region, and in each group I’ve labeled the trace with the highest peak. We already know that these curves yield a sum with four tall peaks; that’s where this whole investigation began. But the \(49\) curves do not support the notion that those peaks might be produced by summing randomly timed mini-epidemics. The oscillations in the \(49\) curves are *not* randomly timed; there are strong correlations between them. And many of the curves have multiple humps, which isn’t possible if each mini-epidemic is supposed to act like a SIR model that runs to completion.

Although these curves spoil a hypothesis I had found alluring, they also reveal some interesting facts about the Covid epidemic. I knew that the first wave was concentrated in New York City and surrounding areas, but I had not realized how much the second wave, in the summer of 2020, was confined to the country’s southern flank, from Florida all the way to California. The summer wave this year is also most intense in Florida and along the Gulf Coast. Coincidence? When I showed the graphs to a friend, she responded: “Air conditioning.”

Searching for the key to Covid, I’ve tried out three slightly whimsical notions: the possibility of a periodic signal, like the sunspot cycle, bringing us waves of infection on a regular schedule; feedback loops producing yo-yo dynamics in the case count; and randomly timed mini-epidemics that add up to a predictable, slow variation in the infection rate. In retrospect they still seem like ideas worth looking into, but none of them does a convincing job of explaining the data.

In my mind the big questions remain unanswered. In November of 2020 the daily tally of new Covid cases was above \(100{,}000\) and rising at a fearful rate. Three months later the infection rate was falling just as steeply. What changed between those dates? What action or circumstance or accident of fate blunted the momentum of the onrushing epidemic and forced it into retreat? And now, just a few months after the case count bottomed out, we are again above \(100{,}000\) cases per day and still climbing. What has changed again to bring the epidemic roaring back?

There are a couple of obvious answers to these questions. As a matter of fact, those answers are sitting in the back of the room, frantically waving their hands, begging me to call on them. First is the vaccination campaign, which has now reached half the U.S. population. The incredibly swift development, manufacture, and distribution of those vaccines is a wonder. In the coming months and years they are what will save us, if anything can. But it’s not so clear that vaccination is what stopped the big wave last winter. The sharp downturn in infection rates began in the first week of January, when vaccination was just getting under way in the U.S. On January 9 (the date when the decline began) only about \(2\) percent of the population had received even one dose. The vaccination effort reached a peak in April, when more than three million doses a day were being administered. By then, however, the dropoff in case numbers had stopped and reversed. If you want to argue that the vaccine ended the big winter surge, it’s hard to align causation with chronology.

On the other hand, the level of vaccination that has now been achieved should exert a powerful damping effect on any future waves. Removing half the people from the susceptible list may not be enough to reach herd immunity and eliminate the virus from the population, but it ought to be enough to turn a growing epidemic into a wilting one.

The SIR model of Figure 23 has the same parameters as the model of Figure 3 \((\beta = 0.6, \gamma = 0.2,\) implying \(\rho = 3.0)\), but \(50\) percent of the people are vaccinated at the start of the simulation. With this diluted population of susceptibles, essentially nothing happens for almost a year. The epidemic is dormant, if not quite defunct.

That’s the world we should be living in right now, according to the SIR model. Instead, today’s new case count is \(141{,}365\); almost \(81{,}556\) people are hospitalized with Covid infections; and 704 people have died. What gives? How can this be happening?

At this point I must acknowledge the other hand waving in the back of the room: the Delta variant, otherwise known as B.1.617.2. Half a dozen mutations in the viral spike protein, which binds to a human cell-surface receptor, have apparently made this new strain at least twice as contagious as the original one.

In Figure 24 contagiousness is doubled by increasing \(\rho\) from \(3.0\) to \(6.0\). That boost brings the epidemic back to life, although there is still quite a long delay before the virus becomes widespread in the unvaccinated half of the population.

The situation is may well be worse than the model suggests. All the models I have reported on here pretend that the human population is homogeneous, or thoroughly mixed. If an infected person is about to spread the virus, everyone in the country has the same probability of being the recipient. This assumption greatly simplifies the construction of the model, but of course it’s far from the truth. In daily life you most often cross paths with people like yourself—people from your own neighborhood, your own age group, your own workplace or school. Those frequent contacts are also people who share your vaccination status. If you are unvaccinated, you are not only more vulnerable to the virus but also more likely to meet people who carry it. This somewhat subtle birds-of-a-feather effect is what allows us to have “a pandemic of the unvaccinated.”

Recent reports have brought still more unsettling and unwelcome news, with evidence that even fully vaccinated people may sometimes spread the virus. I’m waiting for confirmation of that before I panic. (But I’m waiting with my face mask on.)

Having demonstrated that I understand nothing about the history of the epidemic in the U.S.—why it went up and down and up and down and up and down and up and down—I can hardly expect to understand the present upward trend. About the future I have no clue at all. Will this new wave tower over all the previous ones, or is it Covid’s last gasp? I can believe anything.

But let us not despair. This is not the zombie apocalypse. The survival of humanity is not in question. It’s been a difficult ordeal for the past 18 months, and it’s not over yet, but we can get through this. Perhaps, at some point in the not-too-distant future, we’ll even understand what’s going on.

Today *The New York Times* has published two articles raising questions similar to those asked here. David Leonhardt and Ashley Wu write a “morning newsletter” titled “Has Delta Peaked?” Apoorva Mandavilli, Benjamin Mueller, and Shalini Venugopal Bhagat ask “When Will the Delta Surge End?” I think it’s fair to say that the questions in the headlines are not answered in the articles, but that’s not a complaint. I certainly haven’t answered them either.

I’m going to take this opportunity to update two of the figures to include data through the end of August.

In *Figure 1r* the surge in case numbers that was just beginning back in late July has become a formidable sugarloaf peak. The open question is what happens next. Leonhardt and Wu make the optimistic observation that “The number of new daily U.S. cases has risen less over the past week than at any point since June.” In other words, we can celebrate a negative second derivative: The number of cases is still high and is still growing, but it’s growing slower than it was. And they cite the periodicity observed in past U.S. peaks and in those elsewhere as a further reason for hope that we may be near the turning point.

*Figure 22r* tracks where the new cases are coming from. As in earlier peaks, California, Texas, and Florida stand out.

*The New York Times* data archive for Covid-19 cases and deaths in the United States is available in this GitHub repository. The version I used in preparing this article, cloned on 21 July 2021, is identified as “commit c3ab8c1beba1f4728d284c7b1e58d7074254aff8″. You should be able to access the identical set of files through this link.

Source code for the SIR models and for generating the illustrations in this article is also available on GitHub. The code is written in the Julia programming language and organized in Pluto notebooks.

Bartlett, M. S. 1956. Deterministic and stochastic models for recurrent epidemics. In *Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, Volume 4: Contributions to Biology and Problems of Health*, pp. 81–109. Berkeley: University of California Press.

Bartlett, M. S. 1957. Measles periodicity and community size. *Journal of the Royal Statistical Society, Series A (General)* 120(1):48–70.

Brownlee, John. 1917. An investigation into the periodicity of measles epidemics in London from 1703 to the present day by the method 0f the periodogram. *Philosophical Transactions of the Royal Society of London, Series B*, 208:225–250.

Daley, D. J., and J. Gani. 1999. *Epidemic Modelling: An Introduction*. Cambridge: Cambridge University Press.

Glaubitz, Alina, and Feng Fu. 2020. Oscillatory dynamics in the

dilemma of social distancing. *Proceedings of the Royal Society A* 476:20200686.

Kendall, David G. 1956. Deterministic and stochastic epidemics in closed populations. In *Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, Volume 4: Contributions to Biology and Problems of Health*, pp. 149–165. Berkeley: University of California Press.

Kermack, W. O., and A. G. McKendrick. 1927. A contribution to the mathematical theory of epidemics. *Proceedings of the Royal Society of London, Series A* 115:700–721.

Lloyd, Alun L. 2001. Realistic distributions of infectious periods in epidemic models: changing patterns of persistence and dynamics. *Theoretical Population Biology* 60:59–71.

Smith?, Robert (editor). 2014. *Mathematical Modelling of Zombies*. Ottawa: University of Ottawa Press.

Over the years I’ve had several opportunities to play with Ising models and Monte Carlo methods, and I thought I had a pretty good grasp of the basic principles. But, you know, the more you know, the more you know you don’t know.

In 2019 I wrote a brief article on Glauber dynamics, a technique for analyzing the Ising model introduced by Roy J. Glauber, a Harvard physicist. In my article I presented an Ising simulation written in JavaScript, and I explained the algorithm behind it. Then, this past March, I learned that I had made a serious blunder. The program I’d offered as an illustration of Glauber dynamics actually implemented a different procedure, known as the Metropolis algorithm. Oops. (The mistake was brought to my attention by a comment signed “L. Y.,” with no other identifying information. Whoever you are, L. Y., thank you!)

A few days after L. Y.’s comment appeared, I tracked down the source of my error: I had reused some old code and neglected to modify it for its new setting. I corrected the program—only one line needed changing—and I was about to publish an update when I paused for thought. Maybe I could dismiss my goof as mere carelessness, but I realized there were other aspects of the Ising model and the Monte Carlo method where my understanding was vague or superficial. For example, I was not entirely sure where to draw the line between the Glauber and Metropolis procedures. (I’m even less sure now.) I didn’t know which features of the two algorithms are most essential to their nature, or how those features affect the outcome of a simulation. I had homework to do.

Since then, Monte Carlo Ising models have consumed most of my waking hours (and some of the sleeping ones). Sifting through the literature, I’ve found sources I never looked at before, and I’ve reread some familiar works with new understanding and appreciation. I’ve written a bunch of computer programs to clarify just which details matter most. I’ve dug into the early history of the field, trying to figure out what the inventors of these techniques had in mind when they made their design choices. Three months later, there are still soft spots in my knowledge, but it’s time to tell the story as best I can.

This is a long article—nearly 6,000 words. If you can’t read it all, I recommend playing with the simulation programs. There are five of them: 1, 2, 3, 4, 5. On the other hand, if you just can’t get enough of this stuff, you might want to have a look at the source code for those programs on GitHub. The repo also includes data and scripts for the graphs in this article.

Let’s jump right in with an Ising simulation. Below this paragraph is a grid of randomly colored squares, and beneath that a control panel. Feel free to play. Press the *Run* button, adjust the temperature slider, and click the radio buttons to switch back and forth between the Metropolis and the Glauber algorithms. The *Step* button slows down the action, showing one frame at a time. Above the grid are numerical readouts labeled *Magnetization* and *Local Correlation*; I’ll explain below what those instruments are monitoring.

The model consists of 10,000 sites, arranged in a \(100 \times 100\) square lattice, and colored either dark or light, indigo or mauve. In the initial condition (or after pressing the *Reset* button) the cells are assigned colors at random. Once the model is running, more organized patterns emerge. Adjacent cells “want” to have the same color, but thermal agitation disrupts their efforts to reach accord.

The lattice is constructed with “wraparound” boundaries: *periodic* boundary conditions. Imagine infinitely many copies of the lattice laid down like square tiles on an infinite plane.

When the model is running, changing the temperature can have a dramatic effect. At the upper end of the scale, the grid seethes with activity, like a boiling cauldron, and no feature survives for more than a few milliseconds. In the middle of the temperature range, large clusters of like-colored cells begin to appear, and their lifetimes are somewhat longer. When the system is cooled still further, the clusters evolve into blobby islands and isthmuses, coves and straits, all of them bounded by strangely writhing coastlines. Often, the land masses eventually erode away, or else the seas evaporate, leaving a featureless monochromatic expanse. In other cases broad stripes span the width or height of the array.

Whereas nudging the temperature control utterly transforms the appearance of the grid, the effect of switching between the two algorithms is subtler.

- At high temperature (5.0, say), both programs exhibit frenetic activity, but the turmoil in Metropolis mode is more intense.
- At temperatures near 3.0, I perceive something curious in the Metropolis program: Blobs of color seem to migrate across the grid. If I stare at the screen for a while, I see dense flocks of crows rippling upward or leftward; sometimes there are groups going both ways at once, with wings flapping. In the Glauber algorithm, blobs of color wiggle and jiggle like agitated amoebas, but they don’t go anywhere.
- At still lower temperatures (below about 1.5), the Ising world calms down. Both programs converge to the same monochrome or striped patterns, but Metropolis gets there faster.

I have been noticing these visual curiosities—the fluttering wings, the pulsating amoebas—for some time, but I have never seen them mentioned in the literature. Perhaps that’s because graphic approaches to the Ising model are of more interest to amateurs like me than to serious students of the underlying physics and mathematics. Nevertheless, I would like to understand where the patterns come from. (Some partial answers will emerge toward the end of this article.)

For those who want numbers rather than pictures, I offer the magnetization and local-correlation meters at the top of the program display. Magnetization is a global measure of the extent to which one color or the other dominates the lattice. Specifically, it is the number of dark cells minus the number of light cells, divided by the total number of cells:

\[M = \frac{\blacksquare - \square }{\blacksquare + \square}.\]

\(M\) ranges from \(-1\) (all light cells) through \(0\) (equal numbers of light and dark cells) to \(+1\) (all dark).

Local correlation examines all pairs of nearest-neighbor cells and tallies the number of like pairs minus the number of unlike pairs, divided by the total number of pairs:

\[R = \frac{(\square\square + \blacksquare\blacksquare) - (\square\blacksquare + \blacksquare\square) }{\square\square + \square\blacksquare + \blacksquare\square + \blacksquare\blacksquare}.\]

Again the range is from \(-1\) to \(+1\). These two quantities are both measures of order in the Ising system, but they focus on different spatial scales, global *vs.* local. All three of the patterns in Figure 1 have magnetization \(M = 0\), but they have very different values of local correlation \(R\).

The Ising model was invented 100 years ago by Wilhelm Lenz of the University of Hamburg, who suggested it as a thesis project for his student Ernst Ising. It was introduced as a model of a permanent magnet.

A real ferromagnet is a quantum-mechanical device. Inside, electrons in neighboring atoms come so close together that their wave functions overlap. Under these circumstances the electrons can reduce their energy slightly by aligning their spin vectors. According to the rules of quantum mechanics, an electron’s spin must point in one of two directions; by convention, the directions are labeled *up* and *down*. The ferromagnetic interaction favors pairings with both spins up or both down. Each spin generates a small magnetic dipole moment. Zillions of them acting together hold your grocery list to the refrigerator door.

In the Ising version of this structure, the basic elements are still called spins, but there is nothing twirly about them, and nothing quantum mechanical either. They are just abstract variables constrained to take on exactly two values. It really doesn’t matter whether we name the values up and down, mauve and indigo, or plus and minus. (Within the computer programs, the two values are \(+1\) and \(-1\), which means that flipping a spin is just a matter of multiplying by \(-1\).) In this article I’m going to refer to up/down spins and dark/light cells interchangeably, adopting whichever term is more convenient at the moment.

As in a ferromagnet, nearby Ising spins want to line up in parallel; they reduce their energy when they do so. This urge to match spin directions (or cell colors) extends only to nearest neighbors; more distant sites in the lattice have no influence on one another. In the two-dimensional square lattice—the setting for all my simulations—each spin’s four nearest neighbors are the lattice sites to the north, east, south, and west (including “wraparound” neighbors for cells on the boundary lines).

If neighboring spins want to point the same way, why don’t they just go ahead and do so? The whole system could immediately collapse into the lowest-energy configuration, with all spins up or all down. That does happen, but there are complicating factors and countervailing forces. Neighborhood conflicts are the principal complication: Flipping your spin to please one neighbor may alienate another. The countervailing influence is heat. Thermal fluctuations can flip a spin even when the change is energetically unfavorable.

The behavior of the Ising model is easiest to understand at the two extremities of the temperature scale. As the temperature \(T\) climbs toward infinity, thermal agitation completely overwhelms the cooperative tendencies of adjacent spins, and all possible states of the system are on an equal footing. The lattice becomes a random array of up and down spins, each of which is rapidly changing its orientation. At the opposite end of the scale, where \(T\) approaches zero, the system freezes. As thermal fluctuations subside, the spins sooner or later sink into the orderly, low-energy, fully magnetized state—although “sooner or later” can stretch out toward the age of the universe.

Things get more complicated between these extremes. Experiments with real magnets show that the transition from a hot random state to a cold magnetized state is not gradual. As the material is cooled, spontaneous magnetization appears suddenly at a critical temperature called the Curie point (about 840 K in iron). Lenz and Ising wondered whether this abrupt onset of magnetization could be seen in their simple, abstract model. Ising was able to analyze only a one-dimensional version of the system—a line or ring of spins—and he was disappointed to see no sharp phase transition. He thought this result would hold in higher dimensions as well, but on that point he was later proved wrong.

The idealized phase diagram in Figure 2 (borrowed with amendments from my 2019 article) outlines the course of events for a two-dimensional model. To the right, above the critical temperature \(T_c\), there is just one phase, in which up and down spins are equally abundant on average, although they may form transient clusters of various sizes. Below the critical point the diagram has two branches, leading to all-up and all-down states at zero temperature. As the system is cooled through \(T_c\) it must follow one branch or the other, but which one is chosen is a matter of chance.

The immediate vicinity of \(T_c\) is the most interesting region of the phase diagram. If you scroll back up to Program 1 and set the temperature near 2.27, you’ll see filigreed patterns of all possible sizes, from single pixels up to the diameter of the lattice. The time scale of fluctuations also spans orders of magnitude, with some structures winking in and out of existence in milliseconds and others lasting long enough to test your patience.

All of this complexity comes from a remarkably simple mechanism. The model makes no attempt to capture all the details of ferromagnet physics. But with minimal resources—binary variables on a plain grid with short-range interactions—we see the spontaneous emergence of cooperative, collective phenomena, as self-organizing patterns spread across the lattice. The model is not just a toy. Ten years ago Barry M. McCoy and Jean-Marie Maillard wrote:

It may be rightly said that the two dimensional Ising model… is one of the most important systems studied in theoretical physics. It is the first statistical mechanical system which can be exactly solved which exhibits a phase transition.

As I see it, the main question raised by the Ising model is this: At a specified temperature \(T\), what does the lattice of spins look like? Of course “look like” is a vague notion; even if you know the answer, you’ll have a hard time communicating it except by showing pictures. But the question can be reformulated in more concrete ways. We might ask: Which configurations of the spins are most likely to be seen at temperature \(T\)? Or, conversely: Given a spin configuration \(S\), what is the probability that \(S\) will turn up when the lattice is at temperature \(T\)?

Intuition offers some guidance on these points. Low-energy configurations should always be more likely than high-energy ones, at any finite temperature. Differences in energy should have a stronger influence at low temperature; as the system gets warmer, thermal fluctuations can mask the tendency of spins to align. These rules of thumb are embodied in a little fragment of mathematics at the very heart of the Ising model:

\[W_B = \exp\left(\frac{-E}{k_B T}\right).\]

Here \(E\) is the energy of a given spin configuration, found by scanning through the entire lattice and tallying the number of nearest-neighbor pairs that have parallel *vs.* antiparallel spins. In the denominator, \(T\) is the absolute temperature and \(k_B\) is Boltzmann’s constant, named for Ludwig Boltzmann, the Austrian maestro of statistical mechanics. The entire expression is known as the Boltzmann weight, and it determines the probability of observing any given configuration.

In standard physical units the constant \(k_B\) is about \(10^{-23}\) joules per kelvin, but the Ising model doesn’t really live in the world of joules and kelvins. It’s a mathematical abstraction, and we can measure its energy and temperature in any units we choose. The convention among theorists is to set \(k_B = 1\), and thereby eliminate it from the formula altogether. Then we can treat both energy and temperature as if they were pure numbers, without units.

Figure 3 confirms that the equation for the Boltzmann weight yields curves with an appropriate general shape. Lower energies correspond to higher weights, and lower temperatures yield steeper slopes. These features make the curves plausible candidates for describing a physical system such as a ferromagnet. Proving that they are not only good candidates but the unique, true description of a ferromagnet is a mathematical and philosophical challenge that I decline to take on. Fortunately, I don’t have to. The model, unlike the magnet, is a human invention, and we can make it obey whatever laws we choose. In this case let’s simply decree that the Boltzmann distribution gives the correct relation between energy, temperature, and probability.

Note that the Boltzmann weight is said to *determine* a probability, not that it *is* a probability. It can’t be. \(W_B\) can range from zero to infinity, but a probability must lie between zero and one. To get the probability of a given configuration, we need to calculate its Boltzmann weight and then divide by \(Z\), the sum of the weights of all possible configurations—a process called normalization. For a model with \(10{,}000\) spins there are \(2^{10{,}000}\) configurations, so normalization is not a task to be attempted by direct, brute-force arithmetic.

It’s a tribute to the ingenuity of mathematicians that the impossible-sounding problem of calculating \(Z\) has in fact been conquered. In 1944 Lars Onsager published a complete solution of the two-dimensional Ising model—complete in the sense that it allows you to calculate the magnetization, the energy per spin, and a variety of other properties, all as a function of temperature. I would like to say more about Onsager’s solution, but I can’t. I’ve tried more than once to work my way through his paper, but it defeats me every time. I would understand nothing at all about this result if it weren’t for a little help from my friends. Barry Cipra, in a 1987 article, and Cristopher Moore and Stephan Mertens, in their magisterial tome *The Nature of Computation*, rederive the solution by other means. They relate the Ising model to more tractable problems in graph theory, where I am able to follow most of the steps in the argument. Even in these lucid expositions, however, I find the ultimate result unilluminating. I’ll cite just one fact emerging from Onsager’s difficult algebraic exercise. The exact location of the critical temperature, separating the magnetic from the nonmagnetic phases, is:

\[\frac{2}{\log{(1 + \sqrt{2})}} \approx 2.269185314.\]

For those intimidated by the icy crags of Mt. Onsager, I can recommend the warm blue waters of Monte Carlo. The math is easier. There’s a clear, mechanistic connection between microscopic events and macroscopic properties. And there are the visualizations—that lively dance of the mauve and the indigo—which offer revealing glimpses of what’s going on behind the mathematical curtains. All that’s missing is exactness. Monte Carlo studies can pin down \(T_C\) to several decinal places, but they will never give the algebraic expression found by Onsager.

The Monte Carlo method was devised in the years immediately after World War II by mathematicians and physicists working at the Los Alamos Laboratory in New Mexico.

The second application was the design of nuclear weapons. The problem at hand was to understand the diffusion of neutrons through uranium and other materials. When a wandering neutron collided with an atomic nucleus, the neutron might be scattered in a new direction, or it might be absorbed by the nucleus and effectively disappear, or it might induce fission in the nucleus and thereby give rise to several more neutrons. Experiments had provided reasonable estimates of the probability of each of these events, but it was still hard to answer the crucial question: In a lump of fissionable matter with a particular shape, size, and composition, would the nuclear chain reaction fizzle or boom? The Monte Carlo method offered an answer by simulating the paths of thousands of neutrons, using random numbers to generate events with the appropriate probabilities. The first such calculations were done with the ENIAC, the vacuum-tube computer built at the University of Pennsylvania. Later the work shifted to the MANIAC, built at Los Alamos.

This early version of the Monte Carlo method is now sometimes called simple or naive Monte Carlo; I have also seen the term hit-or-miss Monte Carlo. The scheme served well enough for card games and for weapons of mass destruction, but the Los Alamos group never attempted to apply it to a problem anything like the Ising model. It would not have worked if they had tried. I know that because textbooks say so, but I had never seen any discussion of exactly *how* the model would fail. So I decided to try it for myself.

My plan was indeed simple and naive and hit-or-miss. First I generated a random sample of \(10{,}000\) spin configurations, drawn independently with uniform probability from the set of all possible states of the lattice. This was easy to do: I constructed the samples by the computational equivalent of tossing a fair coin to assign a value to each spin. Then I calculated the energy of each configuration and, assuming some definite temperature \(T\), assigned a Boltzmann weight. I still couldn’t convert the Boltzmann weights into true probabilities without knowing the sum of all \(2^{10{,}000}\) weights, but I could sum up the weights of the \(10{,}000\) configurations in the sample. Dividing each weight by this sum yields a *relative* probability: It estimates how frequently (at temperature \(T\)) we can expect to see a member of the sample relative to all the other members.

At extremely high temperatures—say \(T \gt 1{,}000\)—this procedure works pretty well. That’s because all configurations are nearly equivalent at those temperatures; they all have about the same relative probability. On cooling the system, I hoped to see a gradual skewing of the relative probabilities, as configurations with lower energy are given greater weight. What happens, however, is not a gradual skewing but a spectacular collapse. At \(T = 2\) the lowest-energy state in my sample had a relative probability of \(0.9999999979388337\), leaving just \(0.00000000206117\) to be shared among the other \(9{,}999\) members of the set.

The fundamental problem is that a small sample of randomly generated lattice configurations will almost never include any states that are commonly seen at low temperature. The histograms of Figure 4 show Boltzmann distributions at various temperatures *(blue)* compared with the distribution of randomly generated states *(red)*. The random distribution is a slender peak centered at zero energy. There is slight overlap with the Boltzmann distribution at \(T = 50\), but none whatever for lower temperatures.

There’s actually some good news in this fiasco. The failure of random sampling indicates that the interesting states of the Ising system—those which give the model its distinctive behavior—form a tiny subset buried within the enormous space of \(2^{10{,}000}\) configurations. If we can find a way to focus on that subset and ignore the rest, the job will be much easier.

The means to focus more narrowly came with a second wave of Monte Carlo methods, also emanating from Los Alamos. The foundational document was a paper titled “Equation of State Calculations by Fast Computing Machines,” published in 1953. Among the five authors, Nicholas Metropolis was listed first (presumably for alphabetical reasons), and his name remains firmly attached to the algorithm presented in the paper.

With admirable clarity, Metropolis *et al.* explain the distinction between the old and the new Monte Carlo: “[I]nstead of choosing configurations randomly, then weighting them with \(\exp(-E/kT)\), we choose configurations with a probability \(\exp(-E/kT)\) and weight them evenly.” Starting from an arbitrary initial state, the scheme makes small, random modifications, with a bias favoring configurations with a lower energy (and thus higher Boltzmann weight), but not altogether excluding moves to higher-energy states. After many moves of this kind, the system is almost certain to be meandering through a neighborhood that includes the most probable configurations. Methods based on this principle have come to be known as MCMC, for Markov chain Monte Carlo. The Metropolis algorithm and Glauber dynamics are the best-known exemplars.

Roy Glauber also had Los Alamos connections. He worked there during World War II, in the same theory division that was home to Ulam, John von Neumann, Hans Bethe, Richard Feynman, and many other notables of physics and mathematics. But Glauber was a very junior member of the group; he was 18 when he arrived, and a sophomore at Harvard. His one paper on the Ising model was published two decades later, in 1963, and makes no mention of his former Los Alamos colleagues. It also makes no mention of Monte Carlo methods; nevertheless, Glauber dynamics has been taken up enthusiastically by the Monte Carlo community.

When applied to the Ising model, both the Metropolis algorithm and Glauber dynamics work by focusing attention on a single spin at each step, and either flipping the selected spin or leaving it unchanged. Thus the system passes through a sequence of states that differ by at most one spin flip. Statistically speaking, this procedure sounds a little dodgy. Unlike the naive Monte Carlo approach, where successive states are completely independent, MCMC generates configurations they are closely correlated. It’s a biased sample. To overcome the bias, the MCMC process has to run long enough for the correlations to fade away. With a lattice of \(N\) sites, a common protocol retains only every \(N\)th sample, discarding all those in between.

The mathematical justification for the use of correlated samples is the theory of Markov chains, devised by the Russian mathematician A. A. Markov circa 1900. It is a tool for calculating probabilities when each event depends on the previous event. And, in the Monte Carlo method, it allows one to work with those probabilities without getting bogged down in the morass of normalization.

The Metropolis and the Glauber algorithms are built on the same armature. They both rely on two main components: a visitation sequence and an acceptance function. The visitation sequence determines which lattice site to visit next; in effect, it shines a spotlight on one selected spin, proposing to flip it to the opposite orientation. The acceptance function determines whether to accept this proposal (and flip the spin) or reject it (and leave the existing spin direction unchanged). Each iteration of this two-phase process constitutes one “microstep” of the Monte Carlo procedure. Repeating the procedure \(N\) times constitutes a “macrostep.” Thus one macrostep amounts to one microstep per spin.

In the Metropolis algorithm, the visitation order is deterministic. The program sweeps through the lattice methodically, repeating the same sequence of visits in every macrostep. The original 1953 presentation of the algorithm did not prescribe any specific sequence, but the procedure was clearly designed to visit each site exactly once during a sweep. The version of the Metropolis algorithm in Program 1 adopts the most obvious deterministic option: “typewriter order.” The program chugs through the first row of the lattice from left to right, then goes through the second row in the same way, and so on down to the bottom.

Glauber dynamics takes a different approach: At each microstep the algorithm selects a single spin at random, with uniform probability, from the entire set of \(N\) spins. In other words, every spin has a \(1 / N\) chance of being chosen at each microstep, whether or not it has been chosen before. A macrostep lasts for \(N\) microsteps, but the procedure does *not* guarantee that every spin will get a turn during every sweep. Some sites will be passed over, while others are visited more than once. Still, as the number of steps goes to infinity, all the sites eventually get equal attention.

So much for the visitation sequence; now on to the acceptance function. It has three parts:

- Calculate \(\Delta E\), the change in energy that would result from flipping the selected spin \(s\). To determine this value, we need to examine \(s\) itself and its four nearest neighbors.
- Based on \(\Delta E\) and the temperature \(T\), calculate the probability \(p\) of flipping spin \(s\).
- Generate a random number \(r\) in the interval \([0, 1)\). If \(r \lt p\), flip the selected spin; otherwise leave it as is.

Part 2 of the acceptance rule calls for a mathematical function that maps values of \(\Delta E\) and \(T\) to a probability \(p\). To be a valid probability, \(p\) must be confined to the interval \([0, 1]\). To make sense in the context of the Monte Carlo method, the function should assign a higher probability to spin flips that reduce the system’s energy, without totally excluding those that bring an energy increase. And this preference for negative \(\Delta E\) should grow sharper as T gets lower. The specific functions chosen by the Metropolis and the Glauber algorithms satisfy both of these criteria.

Let’s begin with the Glauber acceptance function, which I’m going to call the *G*-rule:

\[p = \frac{e^{-\Delta E/T}}{1 + e^{-\Delta E/T}}.\]

Parts of this equation should look familiar. The expression for the Boltzmann weight, \(e^{-\Delta E/T}\), appears twice, except that the configuration energy \(E\) is replaced by \(\Delta E\), the change in energy when a specific spin is flipped. *G*-rule stays within the bounds of \(0\) to \(1\). The curve at right shows the probability distribution for \(T = 2.7\), near the critical point for the onset of magnetization. To get a qualitative understanding of the form of this curve, consider what happens when \(\Delta E\) grows without bound toward positive infinity: The numerator of the fraction goes to \(0\) while the denominator goes to \(1\), leaving a quotient that approaches \(0.0\). At the other end of the curve, as \(\Delta E\) goes to negative infinity, both numerator and denominator increase without limit, and the probability approaches (but never quite reaches) \(1.0\). Between these extremes, the curve is symmetrical and smooth. It looks like it would make a pleasant ski run.

The Metropolis acceptance criterion also includes the expression \(e^{-\Delta E/T}\), but the function and the curve are quite different. The acceptance probability is defined in a piecewise fashion:

\[p = \left\{\begin{array}{cl}

1 & \text { if } \quad \Delta E \leq 0 \\

e^{-\Delta E/T} & \text { if } \quad \Delta E>0

\end{array}\right.\]

In words, the rule says: If flipping a spin would reduce the energy of the system or leave it unchanged, always do it; otherwise, flip the spin with probability \(e^{-\Delta E/T}\). The probability curve *(left)* has a steep escarpment; if this one is a ski slope, it rates a black diamond. Unlike the smooth and symmetrical Glauber curve, this one has a sharp corner, as well as a strong bias. Consider a spin with \(\Delta E = 0\). Glauber flips such a spin with probability \(1/2\), but Metropolis *always* flips it.

The graphs in Figure 7 compare the two acceptance functions over a range of temperatures. The curves differ most at the highest temperatures, and they become almost indistinguishable at the lowest temperatures, where both curves approximate a step function. Although both functions are defined over the entire real number line, the two-dimensional Ising model allows \(\Delta E\) to take on only five distinct values: \(–8, –4, 0, +4,\) and \(+8\). Thus the Ising probability functions are never evaluated anywhere other than the positions marked by colored dots.

Here are JavaScript functions that implement a macrostep in each of the algorithms, with their differences in both visitation sequence and acceptance function:

```
function runMetro() {
for (let y = 0; y < gridSize; y++) {
for (let x = 0; x < gridSize; x++) {
let deltaE = calcDeltaE(x, y);
let boltzmann = Math.exp(-deltaE/temperature);
if ((deltaE <= 0) || (Math.random() < boltzmann)) {
lattice[x][y] *= -1;
}
}
}
drawLattice();
}
function runGlauber() {
for (let i = 0; i < N; i++) {
let x = Math.floor(Math.random() * gridSize);
let y = Math.floor(Math.random() * gridSize);
let deltaE = calcDeltaE(x, y);
let boltzmann = Math.exp(-deltaE/temperature);
if (Math.random() < (boltzmann / (1 + boltzmann))) {
lattice[x][y] *= -1;
}
}
drawLattice();
}
```

(As I mentioned above, the rest of the source code for the simuations is available on GitHub.)

We’ve seen that the Metropolis and the Glauber algorithms differ in their choice of both visitation sequence and acceptance function. They also produce different patterns or textures when you watch them in action on the computer screen. But what about the numbers? Do they predict different properties for the Ising ferromagnet?

A theorem mentioned throughout the MCMC literature says that these two algorithms (and others like them) should give identical results when properties of the model are measured at thermal equilibrium. I have encountered this statement many times in my reading, but until a few weeks ago I had never tested it for myself. Here are some magnetization data that look fairly convincing:

T = 1.0 | T = 2.0 | T = 2.7 | T = 3.0 | T = 5.0 | T = 10.0 | |
---|---|---|---|---|---|---|

Metropolis | 0.9993 | 0.9114 | 0.0409 | 0.0269 | 0.0134 | 0.0099 |

Glauber | 0.9993 | 0.9118 | 0.0378 | 0.0274 | 0.0136 | 0.0100 |

The table records the absolute value of the magnetization in Metropolis and Glauber simulations at various temperatures. Five of the six measurements differ by less than \(0.001\); the exception cames at \(T = 2.7\), near the critical point, where the difference rises to about \(0.003\). Note that the results are consistent with the presence of a phase transition: Magnetization remains close to \(0\) down to the critical point and then approaches \(1\) at lower temperatures. (By reporting the magnitude, or absolute value, of the magnetization, we treat all-up and all-down states as equivalent.)

I made the measurements by first setting the temperature and then letting the simulation run for at least 1,000 macrosteps in order to reach an equilibrium condition.

When I first looked at these results and saw the close match between Metropolis and Glauber, I felt a twinge of paradoxical surprise. I call it paradoxical because I knew before I started what I would see, and that’s exactly what I *did* see, so obviously I should not have been surprised at all. But some part of my mind didn’t get that memo, and as I watched the two algorithms converge to the same values all across the temperature scale, it seemed remarkable.

The theory behind this convergence was apparently understood by the pioneers of MCMC in the 1950s. The theorem states that any MCMC algorithm will produce the same distribution of states at equilibrium, as long as the algorithm satisfies two conditions, called ergodicity and detailed balance.

*ergodic* was coined by Boltzmann, and is usually said to have the Greek roots εργον οδος, meaning something like “energy path.” Giovanni Gallavotti disputes this etymology, suggesting a derivation from εργον ειδoς, which he translates as “monode with a given energy.” Take your pick.*cul de sac* states you might wander into and never be able to escape, or border walls that divide the space into isolated regions. The Metropolis and Glauber algorithms satisfy this condition because every transition between states has a nonzero probability. (In both algorithms the acceptance probability comes arbitrarily close to zero but never touches it.) In the specific case of the \(100 \times 100\) lattice I’ve been playing with, any two states are connected by a path of no more than \(10{,}000\) steps.

Both algorithms also exhibit detailed balance, which is essentially a requirement of reversibility. Suppose that while watching a model run, you observe a transition from state \(A\) to state \(B\). Detailed balance says that if you continue observing long enough, you will see the inverse transition \(B \rightarrow A\) with the same frequency as \(A \rightarrow B\). Given the shapes of the acceptance curves, this assertion may seem implausible. If \(A \rightarrow B\) is energetically favorable, then \(B \rightarrow A\) must be unfavorable, and it will have a lower probability. But there’s another factor at work here. Remember we are assuming the system is in equilibrium, which implies that the occupancy of each state—or the amount of time the system spends in that state—is proportional to the state’s Boltzmann weight. Because the system is more often found in state \(B\), the transition \(B \rightarrow A\) has more chances to be chosen, counterbalancing the lower intrinsic probability.

The claim that Metropolis and Glauber yield identical results applies only when the Ising system is in equilibrium—poised at the eternal noon where the sun stands still and nothing ever changes. For Metropolis and his colleagues at Los Alamos in the early 1950s, understanding the equilibrium behavior of a computational model was challenge enough. They were coaxing answers from a computer with about four kilobytes of memory. Ten years later, however, Glauber wanted to look beyond equilibrium. For example, he wanted to know what happens when the temperature suddenly changes. How do the spins reorganize themselves during the transient period between one equilibrium state and another? He designed his version of the Ising model specifically to deal with such dynamic situations. He wrote in his 1963 paper:

If the mathematical problems of equilibrium statistical mechanics are great, they are at least relatively well-defined. The situation is quite otherwise in dealing with systems which undergo large-scale changes with time…. We have attempted, therefore, to devise a form of the Ising model whose behavior can be followed exactly, in statistical terms, as a function of time.

Clearly, in this dynamic situation, the algorithms are not identical or interchangeable. The Metropolis program adapts more quickly to the cooler environment; Glauber produces a slower but steadier rise in magnetization. The curves differ in shape, with Metropolis exhibiting a distinctive “knee” where the slope flattens. I want to know what causes these differences, but before digging into that question it seems important to understand why *both* algorithms are so agonizingly slow. At the right edge of the graph the blue Metropolis curve is approaching the equilibrium value of magnetization (which is about 0.91), but it has taken 7,500 Monte Carlo macrosteps (or 75 million microsteps) to get there. The red Glauber curve will require many more. What’s the holdup?

To put this sluggishness in perspective, let’s look at the behavior of local spin correlations measured under the same circumstances. Graphing the average nearest-neighbor correlation following a sudden temperature drop produces these hockey-stick curves:

The response is dramatically faster; both algorithms reach quite high levels of local correlation within just a few macrosteps.

For a hint of why local correlations grow so much faster than global magnetization, it’s enough to spend a few minutes watching the Ising simulation evolve on the computer screen. When the temperature plunges from warm \(T = 5\) to frigid \(T = 2\), nearby spins have a strong incentive to line up in parallel, but magnetization does not spread uniformly across the entire lattice. Small clusters of aligned spins start expanding, and they merge with other clusters of the same polarity, thereby growing larger still. It doesn’t take long, however, before clusters of opposite polarity run into one another, blocking further growth for both. From then on, magetization is a zero-sum game: The up team can win only if the down team loses.

Figure 10 shows the first few Monte Carlo macrosteps following a flash freeze. The initial configuration at the upper left reflects the high-temperature state, with a nearly random, salt-and-pepper mix of up and down spins. The rest of the snapshots (reading left to right and top to bottom) show the emergence of large-scale order. Prominent clusters appear after the very first macrostep, and by the second or third step some of these blobs have grown to include hundreds of lattice sites. But the rate of change becomes sluggish thereafter. The balance of power may tilt one way and then the other, but it’s hard for either side to gain a permanent advantage. The mottled, camouflage texture will persist for hundreds or thousands of steps.

If you choose a single spin at random from such a mottled lattice, you’ll almost surely find that it lives in an area where most of the neighbors have the same orientation. Hence the high levels of local correlation. But that fact does *not* imply that the entire array is approaching unanimity. On the contrary, the lattice can be evenly divided between up and down domains, leaving a net magnetization near zero. (Yes, it’s like political polarization, where homogeneous states add up to a deadlocked nation.)

The images in Figure 11 show three views of the same state of an Ising lattice. At left is the conventional representation, with sinuous, interlaced territories of nearly pure up and down spins. The middle panel shows the same configuration recolored according to the local level of spin correlation. The vast majority of sites *(lightest hue)* are surrounded by four neighbors of the same orientation; they correspond to *both* the mauve and the indigo regions of the leftmost image. Only along the boundaries between domains is there any substantial conflict, where darker colors mark cells whose neighbors include spins of the opposite orientation. The panel at right highlights a special category of sites—those with exactly two parallel and two antiparallel neighbors. They are special because they are tiny neutral territories wedged between the contending factions. Flipping such a spin does not alter its correlation status; both before and after it has two like and two unlike neighbors. Flipping a neutral spin also does not alter the total energy of the system. But it *can* shift the magnetization. Indeed, flipping such “neutral” spins is the main agent of evolution in the Ising system at low temperature.

The struggle to reach full magnetization in an Ising lattice looks like trench warfare. Contending armies, almost evenly matched, face off over the boundary lines between up and down territories. All the action is along the borders; nothing that happens behind the lines makes much difference. Even along the boundaries, some sections of the front are static. If a domain margin is a straight line parallel to the \(x\) or \(y\) axis, the sites on each side of the border have three friendly neighbors and only one enemy; they are unlikely to flip. The volatile neutral sites that make movement possible appear only at corners and along diagonals, where neighborhoods are evenly split.

*only* neutral spins. They have the pleasant property of conserving energy, which is not true of the Metropolis and Glauber algorithms.

From these observations and ruminations I feel I’ve acquired some intuition about why my Monte Carlo simulations bog down during the transition from a chaotic to an ordered state. But why is the Glauber algorithm even slower than the Metropolis?

Since the schemes differ in two features—the visitation sequence and the acceptance function—it makes sense to investigate which of those features has the greater effect on the convergence rate. That calls for another computational experiment.

The tableau below is a mix-and-match version of the MCMC Ising simulation. In the control panel you can choose the visitation order and the acceptance function independently. If you select a deterministic visitation order and the *M*-rule acceptance function, you have the classical Metropolis algorithm. Likewise random order and the *G*-rule correspond to Glauber dynamics. But you can also pair deterministic order with the *G*-rule or random order with the *M*-rule. (The latter mixed-breed choice is what I unthinkingly implemented in my 2019 program.)

I have also included an acceptance rule labeled *M**, which I’ll explain below.

Watching the screen while switching among these alternative components reveals that all the combinations yield different visual textures, at least at some temperatures. Also, it appears there’s something special about the pairing of deterministic visitation order with the *M*-rule acceptance function (*i.e.*, the standard Metropolis algorithm).

Try setting the temperature to 2.5 or 3.0. I find that the distinctive sensation of fluttery motion—bird flocks migrating across the screen—appears *only* with the deterministic/*M*-rule combination. With all other pairings, I see amoeba-like blobs that grow and shrink, fuse and divide, but there’s not much coordinated motion.

Now lower the temperature to about 1.5, and alternately click *Run* and *Reset* until you get a persistent bold stripe that crosses the entire grid either horizontally or vertically. *M*-rule combination is different from all the others. With this mode, the stripe appears to wiggle across the screen like a millipede, either right to left or bottom to top. Changing either the visitation order or the acceptance function suppresses this peristaltic motion; the stripe may still have pulsating bulges and constrictions, but they’re not going anywhere.

These observations suggest some curious interactions between the visitation order and the acceptance function, but they do not reveal which factor gives the Metropolis algorithm its speed advantage. Using the same program, however, we can gather some statistical data that might help answer the question.

These curves were a surprise to me. From my earlier experiments I already knew that the Metropolis algorithm—the combination of elements in the blue curve—would outperform the Glauber version, corresponding to the red curve. But I expected the acceptance function to account for most of the difference. The data do not support that supposition. On the contrary, they suggest that both elements matter, and the visitation sequence may even be the more important one. A deterministic visitation order beats a random order no matter which acceptance function it is paired with.

My expectations were based mainly on discussions of the “mixing time” for various Monte Carlo algorithms. Mixing time is the number of steps needed for a simulation to reach equilibrium from an arbitrary initial state, or in other words the time needed for the system to lose all memory of how it began. If you care only about equilibrium properties, then an algorithm that offers the shortest mixing time is likely to be preferred, since it also minimizes the number of CPU cycles you have to waste before you can start taking data. Discussions of mixing time tend to focus on the acceptance function, not the visitation sequence. In particular, the *M*-rule acceptance function of the Metropolis algorithm was explicitly designed to minimize mixing time.

What I am measuring in my experiments is not exactly mixing time, but it’s closely related. Going from an arbitrary initial state to equilibrium at a specified temperature is much like a transition from one temperature to another. What’s going on inside the model is similar. Thus if the acceptance function determines the mixing time, I would expect it also to be the major factor in adapting to a new temperature regime.

On the other hand, I can offer a plausible-sounding theory of why visitation order might matter. The deterministic model scans through all \(10{,}000\) lattice sites during every Monte Carlo macrostep; each such sweep is guaranteed to visit every site exactly once. The random order makes no such promise. In that algorithm, each microstep selects a site at random, whether or not it has been visited before. A macrostep concludes after \(10{,}000\) such random choices. Under this protocol some sites are passed over without being selected even once, while others are chosen two or more times. How many sites are likely to be missed? During each microstep, every site has the same probability of being chosen, namely \(1 / 10{,}000\). Thus the probability of *not* being selected on any given turn is \(9{,}999 / 10{,}000\). For a site to remain unvisited throughout an entire macrostep, it must be passed over \(10{,}000\) times in a row. The probability of that event is \((9{,}999 / 10{,}000)^{10{,}000}\), which works out to about \(0.368\).

Excluding more than a third of the sites on every pass through the lattice seems certain to have *some* effect on the outcome of an experiment. In the long run the random selection process is fair, in the sense that every spin is sampled at the same frequency. But the rate of convergence to the equilibrium state may well be lower.

There are also compelling arguments for the importance of the acceptance function. A key fact mentioned by several authors is that the *M* acceptance rule leads to more spin flips per Monte Carlo step. If the energy change of a proposed flip is favorable or neutral, the *M*-rule *always* approves the flip, whereas the *G*-rule rejects some proposed flips even when they lower the energy. Indeed, for all values of \(T\) and \(\Delta E\) the *M*-rule gives a higher probability of acceptance than the *G*-rule does. This liberal policy—if in doubt, flip—allows the *M*-rule to explore the space of all possible spin configurations more rapidly.

The discrete nature of the Ising model, with just five possible values of \(\Delta E\), introduces a further consideration. At \(\Delta E = \pm 4\) and at \(\Delta E = \pm 8\), the *M*-rule and the *G*-rule don’t actually differ very much when the temperature is below the critical point *(see Figure 7)*. The two curves diverge only at \(\Delta E = 0\): The *M*-rule invariably flips a spin in this circumstance, whereas the *G*-rule does so only half the time, assigning a probability of \(0.5\). This difference is important because the lattice sites where \(\Delta E = 0\) are the ones that account for almost all of the spin flips at low temperature. These are the neutral sites highlighted in the right panel ofFigure 11, the ones with two like and two unlike neighbors.

This line of thought leads to another hypothesis. Maybe the big difference between the Metropolis and the Glauber algorithms has to do with the handling of this single point on the acceptance curve. And there’s an obvious way to test the hypothesis: Simply change the *M*-rule at this one point, having it toss a coin whenever \(\Delta E = 0\). The definition becomes:

\[p = \left\{\begin{array}{cl}

1 & \text { if } \quad \Delta E \lt 0 \\

\frac{1}{2} & \text { if } \quad \Delta E = 0 \\

e^{-\Delta E/T} & \text { if } \quad \Delta E>0

\end{array}\right.\]

This modified acceptance function is the *M** rule offered as an option in Program 2. Watching it in action, I find that switching the Metropolis algorithm from *M* to *M** robs it of its most distinctive traits: At high temperature the fluttering birds are banished, and at low temperature the wiggling worms are immobilized. The effects on convergence rates are also intriguing. In the Metropolis algorithm, replacing *M* with *M** greatly diminishes convergence speed, from a standout level to just a little better than average. At the same time, in the Glauber algorithm replacing *G* with *M** brings a considerable performance improvement; when combined with random visitation order, *M** is superior not only to *G* but also to *M*.

I don’t know how to make sense of all these results except to suggest that both the visitation order and the acceptance function have important roles, and non-additive interactions between them may also be at work. Here’s one further puzzle. In all the experiments described above, the Glauber algorithm and its variations respond to a disturbance more slowly than Metropolis. But before dismissing Glauber as the perennial laggard, take a look at Figure 14.

Here we’re observing a transition from low to high temperature, the opposite of the situation discussed above. When going in this direction—from an orderly phase to a chaotic one, melting rather than freezing—both algorithms are quite zippy, but Glauber is a little faster than Metropolis. Randomness, it appears, is good for randomization. That sounds sensible enough, but I can’t explain in any detail how it comes about.

Up to this point, a deterministic visitation order has always meant the typewriter scan of the lattice—left to right and top to bottom. Of course this is not the only deterministic route through the grid. In Program 3 you can play with a few of the others.

Why should visitation order matter at all? As long as you touch every site exactly once, you might imagine that all sequences would produce the same result at the end of a macrostep. But it’s not so, and it’s not hard to see why. Whenever two sites are neighbors, the outcome of applying the Monte Carlo process can depend on which neighbor you visit first.

Consider the cruciform configuration at right. At first glance, you might assume that the dark central square will be unlikely to change its state. After all, the central square has four like-colored neighbors; if it were to flip, it would have four opposite-colored neighbors, and the energy associated with those spin-spin interactions would rise from \(-4\) to \(+4\). Any visitation sequence that went first to the central square would almost surely leave it unflipped. However, when the Metropolis algorithm comes tap-tap-tapping along in typewriter mode, the central cell does in fact change color, and so do all four of its neighbors. The entire structure is annihilated in a single sweep of the algorithm. (The erased pattern *does* leave behind a ghost—one of the diagonal neighbor sites flips from light to dark. But then that solitary witness disappears on the next sweep.)

To understand what’s going on here, just follow along as the algorithm marches from left to right and top to bottom through the lattice. When it reaches the central square of the cross, it has already visited (and flipped) the neighbors to the north and to the west. Hence the central square has two neighbors of each color, so that \(\Delta E = 0\). According to the *M*-rule, that square must be flipped from dark to light. The remaining two dark squares are now isolated, with only light neighbors, so they too flip when their time comes.

The underlying issue here is one of chronology—of past, present, and future. Each site has its moment in the present, when it surveys its surroundings and decides (based on the results of the survey) whether or not to change its state. But in that present moment, half of the site’s neighbors are living in the past—the typewriter algorithm has already visited them—and the other half are in the future, still waiting their turn.

A well-known alternative to the typewriter sequence might seem at first to avoid this temporal split decision. Superimposing a checkerboard pattern on the lattice creates two sublattices that do not communicate for purposes of the Ising model. Each black square has only white neighbors, and vice versa. Thus you can run through all the black sites (in any order; it really doesn’t matter), flipping spins as needed. Afterwards you turn to the white sites. These two half-scans make up one macrostep. Throughout the process, every site sees all of its neighbors in the same generation. And yet time has not been abolished. The black cells, in the first half of the sweep, see four neighboring sites that have not yet been visited. The white cells see neighbors that have already had their chance to flip. Again half the neighbors are in the past and half in the future, but they are distributed differently.

There are plenty of other deterministic sequences. You can trace successive diagonals; in Program 3 they run from southwest to northeast. There’s the ever-popular boustrophedonic order, following in the footsteps of the ox in the plowed field. More generally, if we number the sites consecutively from \(1\) to \(10{,}000\), any permutation of this sequence represents a valid visitation order, touching each site exactly once. There are \(10{,}000!\) such permutations, a number that dwarfs even the \(2^{10{,}000}\) configurations of the binary-valued lattice. The *permuted* choice in Program 3 selects one of those permutations at random; it is then used repeatedly for every macrostep until the program is reset. The *re-permuted* option is similar but selects a new permutation for each macrostep. The *random* selection is here for comparison with all the deterministic variations.

(There’s one final button labeled simultaneous, which I’ll explain below. If you just can’t wait, go ahead and press it, but I won’t be held responsible for what happens.)

The variations add some further novelties to the collection of curious visual effects seen in earlier simulations. The fluttering wings are back, in the diagonal as well as the typewriter sequences. Checkerboard has a different rhythm; I am reminded of a crowd of frantic commuters in the concourse of Grand Central Terminal. Boustrophedon is bidirectional: The millipede’s legs carry it both up and down or both left and right at the same time. Permuted is similar to checkerboard, but re-permuted is quite different.

The next question is whether these variant algorithms have any quantitative effect on the model’s dynamics. Figure 15 shows the response to a sudden freeze for seven visitation sequences. Five of them follow roughly the same arcing trajectory. Typewriter remains at the top of the heap, but checkerboard, diagonal, boustrophedon, and permuted are all close by, forming something like a comet tail. The random algorithm is much slower, which is to be expected given the results of earlier experiments.

The intriguing case is the re-permuted order, which seems to lie in the no man’s land between the random and the deterministic algorithms. Perhaps it belongs there. In earlier comparisons of the Metropolis and Glauber algorithms, I speculated that random visitation is slower to converge because many sites are passed over in each macrostep, while others are visited more than once. That’s not true of the re-permuted visitation sequence, which calls on every site exactly once, though in random order. The only difference between the permuted algorithm and the re-permuted one is that the former reuses the same permutation over and over, whereas re-permuted creates a new sequence for every macrostep. The faster convergence of the static permuted algorithm suggests there is some advantage to revisiting all the same sites in the same order, no matter what that order may be. Most likely this has something to do with sites that get switched back and forth repeatedly, on every sweep.

Now for the mysterious *simultaneous* visitation sequence. If you have not played with it yet in Program 3, I suggest running the following experiment. Select the typewriter sequence, press the *Run* button, reduce the temperature to 1.10 or 1.15, and wait until the lattice is all mauve or all indigo, with just a peppering of opposite-color dots. (If you get a persistent wide stripe instead of a clear field, raise the temperature and try again.) Now select the *simultaneous* visitation order.

This behavior is truly weird but not inexplicable. The algorithm behind it is one that I have always thought should be the best approach to a Monte Carlo Ising simulation. In fact it seems to be just about the worst.

All of the other visitation sequences are—as the term suggests they should be—*sequential*. They visit one site at a time, and differ only in how they decide where to go next. If you think about the Ising model as if it were a real physical process, this kind of serialization seems pretty implausible. I can’t bring myself to believe that atoms in a ferromagnet politely take turns in flipping their spins. And surely there’s no central planner of the sequence, no orchestra conductor on a podium, pointing a baton at each site when its turn comes.

Natural systems have an all-at-onceness to them. They are made up of many independent agents that are all carrying out the same kinds of activities at the same time. If we could somehow build an Ising model out of real atoms, then each cell or site would be watching the state of its four neighbors all the time, and also sensing thermal agitation in the lattice; it would decide to flip whenever circumstances favored that choice, although there might be some randomness to the timing. If we imagine a computer model of this process (yes, a model of a model), the most natural implementation would require a highly parallel machine with one processor per site.

Lacking such fancy hardware, I make due with fake parallelism. The *simultaneous* algorithm makes two passes through the lattice on every macrostep. On the first pass, it looks at the neighborhood of each site and decides whether or not to flip the spin, but it doesn’t actually make any changes to the lattice. Instead, it uses an auxiliary array to keep track of which spins are scheduled to flip. Then, after all sites have been surveyed in the first pass, the second pass goes through the lattice again, flipping all the spins that were designated in the first pass. The great advantage of this scheme is that it avoids the temporal oddities of working within a lattice where some spins have already been updated and others have not. In the *simultaneous* algorithm, all the spins make the transition from one generation to the next at the same instant.

When I first wrote a program to implement this scheme, almost 40 years ago, I didn’t really know what I was doing, and I was utterly baffled by the outcome. The information mechanics group at MIT (Ed Fredkin, Tommaso Toffoli, Norman Margolus, and Gérard Vichniac) soon came to my rescue and explained what was going on, but all these years later I still haven’t quite made my peace with it.

*does* flip, with the result that the new configuration is a mirror image of the previous one, with every up spin become a down and vice versa. When the next round begins, every site wants to flip again.

What continues to disturb me about this phenomenon is that I still think the simultaneous update rule is in some sense more natural or realistic than many of the alternatives. It is closer to how the world works—or how I imagine that it works—than any serial ordering of updates. Yet nature does not create magnets that continually swing between states that have the highest possible energy. (A 2002 paper by Gabriel Pérez, Francisco Sastre, and Rubén Medina attempts to rehabilitate the simultaneous-update scheme, but the blinking catastrophe remains pretty catastrophic.)

This is not the only bizarre behavior to be found in the dark corners of Monte Carlo Ising models. In the Metropolis algorithm,

I have not seen this high-temperature anomaly mentioned in published works on the Metropolis algorithm, although it must have been noticed many times over the years. Perhaps it’s not mentioned because this kind of failure will never be seen in physical systems. \(T = 1{,}000\) in the Ising model is \(370\) times the critical temperature; the corresponding temperature in iron would be over \(300{,}000\) kelvins. Iron boils at \(3{,}000\) kelvins.

The curves in Figure 15 and most of the other graphs above are averages taken over hundreds of repetitions of the Monte Carlo process. The averaging operation is meant to act like sandpaper, smoothing out noise in the curves, but it can also grind away interesting features, replacing a population of diverse individuals with a single homogenized exemplar. Figure 17 shows six examples of the lumpy and jumpy trajectories recorded during single runs of the program:

In these squiggles, magnetization does not grow smoothly or steadily with time. Instead we see spurts of growth followed by plateaus and even episodes of retreat. One of the Metropolis runs is slower than the three Glauber examples, and indeed makes no progress toward a magnetized state. Looking at these plots, it’s tempting to explain them away by saying that the magnetization measurements exhibit high variance. That’s certain true, but it’s not the whole story.

Figure 18 shows the distribution of times needed for a Metropolis Ising model to reach a magnetization of \(0.85\) in response to a sudden shift from \(T = 10\) to \(T= 2\). The histogram records data from \(10{,}000\) program runs, expressing convergence time in Monte Carlo macrosteps.

The median of this distribution is \(451\) macrosteps; in other words, half of the runs concluded in \(451\) steps or fewer. But the other half of the population is spread out over quite a wide range. Runs of \(10\) times the median length are not great rareties, and the blip at the far right end of the \(x\) axis represents the \(59\) runs that had still not reached the threshold after \(10{,}000\) macrosteps (where I stopped counting). This is a heavy-tailed distribution, which appears to be made up of two subpopulations. In one group, forming the sharp peak at the left, magnetization is quick and easy, but members of the other group are recalcitrant, holding out for thousands of steps. I have a hypothesis about what distinguishes those two sets. The short-lived ones are ponds; the stubborn ones that overstay their welcome are rivers.

When an Ising system cools and becomes fully magnetized, it goes from a salt-and-pepper array of tiny clusters to a monochromatic expanse of one color or the other. At some point during this process, there must be a moment when the lattice is divided into exactly two regions, one light and one dark.

I believe the correct answer has to do with the concepts of inside and outside, connected and disconnected, open sets and closed sets—but I can’t articulate these ideas in a way that would pass mathematical muster. I want to say that the indigo pond is a bounded region, entirely enclosed by the unbounded mauve continent. But the wraparound lattice make it hard to wrap your head around this notion. The two images in Figure 20 show exactly the same object as Figure 19, the only difference being that the origin of the coordinate system has moved, so that the center of the disk seems to lie on an edge or in a corner of the lattice. The indigo pond is still surrounded by the mauve continent, but it sure doesn’t look that way. In any case, why should boundedness determine which area survives the Monte Carlo process?

For me, the distinction between inside and outside began to make sense when I tried taking a more “local” view of the boundaries between regions, and the curvature of those boundaries. As noted in connection with Figure 11, boundaries are places where you can expect to find neutral lattice sites (i.e., \(\Delta E = 0\)), which are the only sites where a spin is likely to change orientation at low temperature.

I’ll spare you the trouble of counting the dots in Figure 21: There are 34 orange ones inside the pond but only 30 green ones outside. That margin could be significant. Because the dotted cells are likely to change state, the greater abundance of orange dots means there are more indigo cells ready to turn mauve than mauve cells that might become indigo. If the bias continues as the system evolves, the indigo region will steadily lose area and eventually be swallowed up.

But is there any reason to think the interior of the pond will *always* have a surplus of neutral sites susceptible to flipping?

What if the shape becomes a little more complicated? Perhaps the square pond grows a protuberance on one side, and an invagination on another, as in Figure 23. *Aha!* moment. There’s a conservation law, I thought. No matter how you alter the outline of the pond, the neutral sites inside will outnumber those outside by four.

This notion is not utterly crazy. If you walk clockwise around the boundary of the simple square pond in Figure 22, you will have made four right turns by the time you get back to your starting point. Each of those right turns creates a neutral cell in the interior of the pond—we’ll call them *innie turns*—where you can place an orange dot. A clockwise circuit of the modified pond in Figure 23, with its excrenscences and increscences, requires some left turns as well as right turns. *outie turn*—earning a green dot. But for every outie turn added to the perimeter, you’ll have to make an additional innie turn in order to get back to your starting point. Thus, except for the four original corners, innie and outie turns are like particles and antiparticles, always created and annihilated in pairs. A closed path, no matter how convoluted, always has exactly four more innies than outies. The four-turn differential is again on exhibit in the more elaborate example of Figure 24, where the orange dots prevail 17 to 13. Indeed, I assert that there are *always* four more innie turns than outie turns on the perimeter of any simple (*i.e.*, non-self-intersecting) closed curve on the square lattice. (I think this is one of those statements that is obviously true but not so simple to prove, like the claim that every simple closed curve on the plane has an inside and an outside.)

Unfortunately, even if the statement about counting right and left turns is true, the corresponding statement about orange and green dots is not. *not* come in matched innie/outie pairs. In this case there are more green dots than orange ones, which might be taken to suggest that the indigo area will grow rather than shrink.

In my effort to explain why ponds always evaporate, I seem to have reached a dead end. I should have known from the outset that the effort was doomed. I can’t prove that ponds *always* shrink because they don’t. The system is ergodic: Any state can be reached from any other state in a finite number of steps. In particular, a single indigo cell (a very small pond) can grow to cover the whole lattice. The sequence of steps needed to make that happen is improbable, but it certainly exists.

If proof is out of reach, maybe we can at least persuade ourselves that the pond shrinks with high probability. And we have a tool for doing just that: It’s called the Monte Carlo method. Figure 26 follows the fate of a \(25 \times 25\) square pond embedded in an otherwise blank lattice of \(100 \times 100\) sites, evolving under Glauber dynamics at a very low temperature \((T = 0.1)\). The uppermost curve, in indigo, shows the steady evaporation of the pond, dropping from the original \(625\) sites to \(0\) after about \(320\) macrosteps. The middle traces record the abundance of Swiss sites, orange for those inside the pond and green for those outside. Because of the low temperature, these are the only sites that have any appreciable likelihood of flipping. The black trace at the bottom is the difference between orange and green. For the most part it hovers at \(+4\), never exceeding that value but seldom falling much below it, and making only one brief foray into negative territory. Statistically speaking, the system appears to vindicate the innie/outie hypothesis. The pond shrinks because there are almost always more flippable spins inside than outside.

Figure 26 is based on a single run of the Monte Carlo algorithm. Figure 27 presents essentially the same data averaged over \(1{,}000\) Monte Carlo runs under the same conditions—starting with a \(25 \times 25\) square pond and applying Glauber dynamics at \(T = 0.1\).

The pond’s loss of area follows a remarkably linear path, with a steady rate very close to two lattice sites per Monte Carlo macrostep. And it’s clear that virtually all of these pondlike blocks of indigo cells disappear within a little more than \(300\) macrosteps, putting them in the tall peak at the short end of the lifetime distribution in Figure 18. None of them contribute to the long tail that extends out past \(10{,}000\) steps.

So much for the quick-drying ponds. What about the long-flowing rivers?

When the two sides join, everything changes. It’s not just a matter of adjusting the shape and size of the rectangle. There is no more rectangle! By definition, a rectangle has four sides and four right-angle corners. The object now occupying the lattice has only two sides and no corners. It may appear to have corners at the far left and right, but that’s an artifact of drawing the figure on a flat plane. It really lives on a torus, and the band of indigo cells is like a ring of chocolate icing that goes all the way around the doughnut. Or it’s a river—an endless river. You can walk along either bank as far as you wish, and you’ll never find a place to cross.

The topological difference between a pond and a river has dire consequences for Monte Carlo simulations of the Ising model. When the rectangle’s four corners disappeared, so did the four orange dots marking interior Swiss cells. Indeed, the river has no neutral cells at all, neither inside nor outside. At temperatures near zero, where neutral cells are the only ones that ever change state, the river becomes an all-but-eternal feature. The Monte Carlo process has no effect on it. The system is stuck in a metastable state, with no practical way to reach the lower-energy state of full magnetization.

When I first noticed how a river can block magnetization, I went looking to see what others might have written about the phenomonon. I found nothing. There was lots of talk about metastability in general, but none of the sources I consulted mentioned this particular topological impediment. I began to worry that I had made some blunder in programming or had misinterpreted what I was seeing. Finally I stumbled on a 2002 paper by Spirin, Krapivsky, and Redner that reports essentially the same observations and extends the discussion to three dimensions, where the problem is even worse.

A river with perfectly straight banks looks rather unnatural—more like a canal.

But that’s not what happens. The lower part of Figure 29 shows the same stretch of river after \(1{,}000\) Monte Carlo macrosteps at \(T = 0.1\). The algorithm has not amplified the sinuosity; on the contrary, it has shaved off the peaks and filled in the troughs, generally flattening the terrain. After \(5{,}000\) steps the river has returned to a perfectly straight course. No neutral cells remain, so no further change can be expected in any human time frame.

The presence or absence of four corners makes all the difference between ponds and rivers. Ponds shrink because the corners create a consistent bias: Sites subject to flipping are more numerous inside than outside, which means, over the long run, that the outside steadily encroaches on the inside’s territory. That bias does not exist for rivers, where the number of interior and exterior neutral sites is equal on average. Figure 30 records the inside-minus-outside difference for the first \(1{,}000\) steps of a simulation beginning with a sinusoidal river.

The difference hovers near zero, though with short-lived excursions both above and below; the mean value is \(+0.062\).

Even at somewhat higher temperatures, any pattern that crosses from one side of the grid to the other will stubbornly resist extinction. Figure 31 shows snapshots every \(1{,}000\) macrosteps in the evolution of a lattice at \(T = 1.0\), which is well below the critical temperature but high enough to allow a few energetically unfavorable spin flips. The starting configuration was a sinusoidal river, but by \(1{,}000\) steps it has already become a thick, lumpy band. In subsequent snapshots the ribbon grows thicker and thinner, migrates up and down—and then abruptly disappears, sometime between the \(8{,}000\)th and the \(9{,}000\)th macrostep.

Swiss cells, with equal numbers of friends and enemies among their neighbors, appear wherever a boundary line takes a turn. All the rest of the sites along a boundary—on both sides—have three friendly neighbors and one enemy neighbor. At a site of this kind, flipping a spin carries an energy penalty of \(\Delta E = +4\). At \(T = 1.0\) the probability of such an event is roughly \(1/50\). In a \(10{,}000\)-site lattice crossed by a river there can be as many as \(200\) of these three-against-one sites, so we can expect to see a few of them flip during every macrostep. Thus at \(T = 1.0\) the river is not a completely static formation, as it is at temperatures closer to zero. The channel can shift or twist, grow wider or narrower. But these motions are glacially slow, not only because they depend on somewhat rare events but also because the probabilities are unbiased. At every step the river is equally likely to grow wider or narrower; on average, it goes nowhere.

In one last gesture to support my claim that short-lived patterns are ponds and long-lived patterns are rivers I offer Figure 32:

A troubling question is whether these uncrossable rivers that block full magnetization in Ising models also exist in physical ferromagnets. It seems unlikely. The rivers I describe above owe their existence to the models’ wraparound boundary conditions. The crystal lattices of real magnetic materials do not share that topology. Thus it seems that metastability may be an artifact or a mere incidental feature of the model, not something present in nature.

Statistical mechanics is generally formulated in terms of systems without boundaries. You construct a theory of \(N\) particles, but it’s truly valid only in the “thermodynamic limit,” where \(N \to \infty\). Under this regime the two-dimensional Ising model would be studied on a lattice extending over an infinite plane. Computer models can’t do that, and so we wind up with tricks like wraparound boundary conditions, which can be considered a hack for faking infinity.

It’s a pretty good hack. As in an infinite lattice, every site has the same local environment, with exactly four neighbors, who also have four neighbors, and so on. There are no edges or corners that require special treatment. For these reasons wraparound or periodic boundary conditions have always been the most popular choice for computational models in the sciences, going back at least as far as 1953. Still, there are glitches. If you were standing on a wraparound lattice, you could walk due north forever, but you’d keep passing your starting point again and again. If you looked into the distance, you’d see the back of your own head. For the Ising model, perhaps the most salient fact is this: On a genuinely infinite plane, every simple, finite, closed curve is a pond; no finite structure can behave like a river, transecting the entire surface so that you can’t get around it. Thus the wraparound model differs from the infinite one in ways that may well alter important conclusions.

These defects are a little worrisome. On the other hand, physical ferromagnets are also mere finite approximations to the unbounded spaces of thermodynamic theory. A single magnetic domain might have \(10^{20}\) atoms, which is large compared with the \(10^4\) sites in the models presented here, but it’s a long ways short of infinity. The domains have boundaries, which can have a major influence on their properties. All in all, it seems like a good idea to explore the space of possibile boundary conditions, including some alternatives to the wraparound convention. Hence Program 4:

An extra row of cells around the perimeter of the lattice serves to make the boundary conditions visible in this simulation. The cells in this halo layer are not active participants in the Ising process; they serve as neighbors to the cells on the periphery of the lattice, but their own states are not updated by the Monte Carlo algorithm. To mark their special role, their up and down states are indicated by red and pink instead of indigo and mauve.

The behavior of wraparound boundaries is already familiar. If you examine the red/pink stripe along the right edge of the grid, you will see that it matches the leftmost indigo/mauve column. Similar relations determine the patterns along the other edges.

The two simplest alternatives to the wraparound scheme are static borders made up of cells that are always up or always down. You can probably guess how they will affect the outcome of the simulation. Try setting the temperature around 1.5 or 2.0, then click back and forth between *all up* and *all down* as the program runs. The border color quickly invades the interior space, encircling a pond of the opposite color and eventually squeezing it down to nothing. Switching to the opposite border color brings an immediate re-enactment of the same scene with all colors reversed. The biases are blatant.

Another idea is to assign the border cells random values, chosen independently and with equal probability. A new assignment is made after every macrostep. Randomness is akin to high temperature, so this choice of boundary condition amounts to an Ising lattice surrounded by a ring of fire. There is no bias in favor of up or down, but the stimulation from the sizzling periphery creates recurrent disturbances even at temperatures near zero, so the system never attains a stable state of full magnetization.

Before I launched this project, my leading candidate for a better boundary condition was a zero border. This choice is equivalent to an “open” or “free” boundary, or to no boundary at all—a universe that just ends in blankness. Implementing open boundaries is slightly irksome because cells on the verge of nothingness require special handling: Those along the edges have only three neighbors, and those in the corners only two. A zero boundary produces the same effect as a free boundary without altering the neighbor-counting rules. The cells of the outer ring all have a numerical value of \(0\), indicated by gray. For the interior cells with numerical values of \(+1\) and \(-1\), the zero cells act as neighbors without actually contributing to the \(\Delta E\) calculations that determine whether or not a spin flips.

The zero boundary introduces no bias favoring up or down, it doesn’t heat or cool the system, and it doesn’t tamper with the topology, which remains a simple square embedded in a flat plane. Sounds ideal, no? However, it turns out the zero boundary has a lot in common with wraparound borders. In particular, it allows persistent rivers to form—or maybe I should call them lakes. I didn’t see this coming before I tried the experiment, but it’s not hard to understand what’s happening. On the wraparound lattice, elongating a rectangle until two opposite edges meet eliminates the Swiss cells at the four corners. The same thing happens when a rectangle extends all the way across a lattice with zero borders. The corner cells, now up against the border, no longer have two friendly and two enemy neighbors; instead they have two friends, one enemy, and one cel of spin zero, for a net \(\Delta E\) of \(+1\).

A pleasant surprise of these experiments was the boundary type I have labeled *sampled*. The idea is to make the boundary match the statistics of the interior of the lattice, but without regard to the geometry of any patterns there. For each border cell \(b\) we select an interior cell \(s\) at random, and assign the color of \(s\) to \(b\). The procedure is repeated after each macrostep. The border therefore maintains the same up/down proportion as the interior lattice, and always favors the majority. If the spins are evenly split between mauve and indigo, the border region shows no bias; as soon as the balance begins to tip, however, the border shifts in the same direction, supporting and hastening the trend.

If you tend to root for the underdog, this rule is not for you—but we can turn it upside down, assigning a color opposite that of a randomly chosen interior cell. The result is interesting. Magnetization is held near \(0\), but at low temperature the local correlation coefficient approaches \(1\). The lattice devolves into two large blobs of no particular shape that circle the ring like wary wrestlers, then eventually reach a stable truce in which they split the territory either vertically or horizontally. This behavior has no obvious bearing on ferromagnetism, but maybe there’s an apt analogy somewhere in the social or political sciences.

The curves in Figure 33 record the response to a sudden temperature step in systems using each of six boundary conditions. The all-up and all-down boundaries converge the fastest—which is no surprise, since they put a thumb on the scale. The response of the sampled boundary is also quick, reflecting its weathervane policy of supporting the majority. The random and zero boundaries are the slowest; they follow identical trajectories, and I don’t know why. Wraparound is right in the middle of the pack. All of these results are for Glauber dynamics, but the curves for the Metropolis algorithm are very similar.

The menu in Program 4 has one more choice, labeled *twisted*. I wrote the code for this one in response to the question, “I wonder what would happen if…?” Twisted is the same as wraparound, except that one side is given a half-twist before it is mated with the opposite edge. Thus if you stand on the right edge near the top of the lattice and walk off to the right, you will re-enter on the left near the bottom. The object formed in this way is not a torus but a Klein bottle—a “nonorientable surface without boundary.” All I’m going to say about running an Ising model on this surface is that the results are not nearly as weird as I expected. See for yourself.

I have one more toy to present for your amusement: the MCMC microscope. It was the last program I wrote, but it should have been the first.

All of the programs above produce movies with one frame per macrostep. In that high-speed, high-altitude view it can be hard to see how individual lattice sites are treated by the algorithm. The MCMC microscope provides a slo-mo close-up, showing the evolution of a Monte Carlo Ising system one microstep at a time. The algorithm proceeds from site to site (in an order determined by the visitation sequence) and either flips the spin or not (according to the acceptance function).

As the algorithm proceeds, the site currently under examination is marked by a hot-pink outline. Sites that have yet to be visited are rendered in the usual indigo or mauve; those that have already had their turn are shown in shades of gray. The *Microstep* button advances the algorithm to the next site (determined by the visitation sequence) and either flips the spin or leaves it as-is (according to the acceptance function). The *Macrostep* button performs a full sweep of the lattice and then pauses; the *Run* button invokes a continuing series of microsteps at a somewhat faster pace. Some adjustments to this protocol are needed for the simultaneous update option. In this mode no spins are changed during the scan of the lattice, but those that *will* change are marked with a small square of constrasting gray. At the end of the macrostep, all the changes are made at once.

The *Dotted Swiss* checkbox paints orange and green dots on neutral cells (those with equal numbers of friendly and enemy neighbors). Doodle mode allows you to draw on the lattice via mouse clicks and thereby set up a specific initial pattern.

I’ve found it illuminating to draw simple geometric figures in doodle mode, then watch as they are transformed and ultimately destroyed by the various algorithms. These experiments are particularly interesting with the Metropolis algorithm at very low temperature. Under these conditions the Monte Carlo process—despite its roots in randomness—becomes very nearly deterministic. Cells with \(\Delta E \le 0\) always flip; other cells never do. (What, never? Well, hardly ever.) Thus we can speak of what happens when a program is run, rather than just describing the probability distribution of possible outcomes.

Here’s a recipe to try: Set the temperature to its lower limit, choose doodle mode, *down* initialization, typewriter visitation, and the *M*-rule acceptance function. Now draw some straight lines on the grid in four orientations: vertical, horizontal, and along both diagonals. Each line can be six or seven cells long, but don’t let them touch. Lines in three of the four orientations are immediately erased when the program runs; they disappear after the first macrostep. The one survivor is the diagonal line oriented from lower left to upper right, or southwest to northeast. With each macrostep the line migrates one cell to the left, and also loses one site at the bottom. This combination of changes gives the subjective impression that the pattern is moving not only left but also upward. I’m pretty sure that this phenomenon is responsible for the fluttering wings illusion seen at much higher temperatures (and higher animation speeds).

If you perform the same experiment with the diagonal visitation order, you’ll see exactly the same outcomes. A question I can’t answer is whether there is any pattern that serves to discriminate between the typewriter and diagonal orders. What I’m seeking is some arrangement of indigo cells on a mauve background that I could draw on the grid and then look away while you ran one algorithm or the other for some fixed number of macrosteps (which I get to specify). Afterwards, I win if I can tell which visitation sequence you chose.

The checkerboard algorithm is also worth trying with the four line orientations. The eventual outcome is the same, but the intermediate stages are quite different.

Finally I offer a few historical questions that seem hard to settle, and some philosophical musings on what it all means.

The name, of course, is an allusion to the famous casino, a prodigious producer and consumer of randomness. Nicholas Metropolis claimed credit for coming up with the term. In a 1987 retrospective he wrote:

It was at that time [spring of 1947] that I suggested an obvious name for the statistical method—a suggestion not unrelated to the fact that Stan [Ulam] had an uncle who would borrow money from relatives because he “just had to go to Monte Carlo.”

An oddity of this story is that Metropolis was not at Los Alamos in 1947. He left after the war and didn’t return until 1948.

Ulam’s account of the matter does not contradict the Metropolis version, *does* contradict Metropolis, writing: “The basic idea, as well as the name was due to Stan Ulam originally.” But Rosenbluth wasn’t at Los Alamos in 1947 either.

It seems to me that the name Monte Carlo contributed very much to the popularization of this procedure. It was named Monte Carlo because of the element of chance, the production of random numbers with which to play the suitable games.

Note the anonymous passive voice: “It was named…,” with no hint of by whom. If Ulam was so carefully noncommittal, who am I to insist on a definite answer?

As far as I know, the phrase “Monte Carlo method” first appeared in public print in 1949, in an article co-authored by Metropolis and Ulam. Presumably the term was in use earlier among the denizens of the Los Alamos laboratory. Daniel McCracken, in a 1955 *Scientific American* article, said it was a code word invented for security reasons. This is not implausible. Code words were definitely a thing at Los Alamos (the place itself was designated “Project Y”), but I’ve never seen the code word status of “Monte Carlo” corroborated by anyone with first-hand knowledge.

To raise the question, of course, is to hint that it was not Metropolis.

The 1953 paper that introduced Markov chain Monte Carlo, “Equation of State Calculations by Fast Computing Machines,” had five authors, who were listed in alphabetical order: Nicholas Metropolis, Arianna W. Rosenbluth, Marshall N. Rosenbluth, Augusta H. Teller, and Edward Teller. The two Rosenbluths were wife and husband, as were the two Tellers. Who did what in this complicated collaboration? Apparently no one thought to ask that question until 2003, when J. E. Gubernatis of Los Alamos was planning a symposium to mark the 50th anniversary of MCMC. He got in touch with Marshall Rosenbluth, who was then in poor health. Nevertheless, Rosenbluth attended the gathering, gave a talk, and sat for an interview. (He died a few months later.)

According to Rosenbluth, the basic idea behind MCMC—sampling the states of a system according to their Boltzmann weight, while following a Markov chain from one state to the next—came from Edward Teller. Augusta Teller wrote a first draft of a computer program to implement the idea. Then the Rosenbluths took over. In particular, it was Arianna Rosenbluth who wrote the program that produced all the results reported in the 1953 paper. Gubernatis adds:

Marshall’s recounting of the development of the Metropolis algorithm first of all made it very clear that Metropolis played no role in its development other than providing computer time.

In his interview, Rosenbluth was even blunter: “Metropolis was boss of the computer laboratory. We never had a single scientific discussion with him.”

These comments paint a rather unsavory portrait of Metropolis as a credit mooch. I don’t know to what extent that harsh verdict might be justified. In his own writings, Metropolis makes no overt claims about his contributions to the work. On the other hand, he also makes no *dis*claimers; he never suggests that someone else’s name might be more appropriately attached to the algorithm.

An interesting further question is who actually wrote the 1953 paper—who put the words together on the page. Internal textual evidence suggests there were at least two writers. Halfway through the article there’s a sudden change of tone, from gentle exposition to merciless technicality.

In recent years the algorithm has acquired the hyphenated moniker Metropolis-Hastings, acknowledging the contributions of W. Keith Hastings, a Canadian mathematician and statistician. Hastings wrote a 1970 paper that generalized the method, showing it could be applied to a wider class of problems, with probability distributions other than Boltzmann’s. Hastings is also given credit for rescuing the technique from captivity among the physicists and bringing it home to statistics, although it was another 20 years before the statistics community took much notice.

I don’t know who started the movement to name the generalized algorithm “Metropolis-Hastings.” The hyphenated term was already fairly well established by 1995, when Siddhartha Chib and Edward Greenberg put it in the title of a review article.

In this case there is no doubt or controversy about authorship. Glauber wrote the 1963 paper, and he did the work reported in it. On the other hand, Glauber did *not* invent the Monte Carlo algorithm that now goes by the name “Glauber dynamics.” His aim in tackling the Ising model was to find exact, mathematical solutions, in the tradition of Ising and Onsager. (Those two authors are the only ones cited in Glauber’s paper.) He never mentions Monte Carlo methods or any other computational schemes.

So who *did* devise the algorithm? The two main ingredients—the *G*-rule and the random visitation sequence—were already on the table in the 1950s. A form of the *G*-rule acceptance function \(e^{-\Delta E/T} / (1 + e^{-\Delta E/T})\) was proposed in 1954 by John G. Kirkwood of Yale University, a major figure in statistical mechanics at midcentury. He suggested it to the Los Alamos group as an alternative to the *M*-rule. Although the suggestion was not taken, the group did acknowledge that it would produce valid simulations. The random visitation sequence was used in a followup study by the Los Alamos group in 1957. (By then the group was led by William W. Wood, who had been a student of Kirkwood.)

Those two ingredients first came together a few years later in work by P. A. Flinn and G. M. McManus, who were then at Westinghouse Research in Pittsburgh. Their 1961 paper describes a computer simulation of an Ising model with both random visitation order and the \(e^{-\Delta E/T} / (1 + e^{-\Delta E/T})\) acceptance function, two years before Glauber’s article appeared. On grounds of publication priority, shouldn’t the Monte Carlo variation be named for Flinn and McManus rather than Glauber?

For a while, it was. There were dozens of references to Flinn and McManus throughout the 1960s and 70s. For example, an article by G. W. Cunningham and P. H. E. Meijer compared and evaluated the two main MCMC methods, identifying them as algorithms introduced by “Metropolis *et al*.” and by “Flinn and McManus.” A year later another compare-and-contrast article by John P. Valleau and Stuart G. Whittington adopted the same terminology. Neither of these articles mentions Glauber.

According to Semantic Scholar, the phrase “Glauber dynamics” first appeared in the physics literature in 1977, in an article by Ph. A. Martin. But this paper is a theoretical work, with no computational component, along the same lines as Glauber’s own investigation. Among the Semantic Scholar listings, “Glauber dynamics” was first mentioned in the context of Monte Carlo studies by A. Sadiq and Kurt Binder, in 1984. After that, the balance shifted strongly toward Glauber.

In bringing up the disappearance of Flinn and McManus from the Ising and Monte Carlo literature, I don’t mean to suggest that Glauber doesn’t deserve his recognition. His main contribution to studies of the Ising model—showing that it could give useful results away from equilibrium—is of the first importance. On the other hand, attaching his name to a Monte Carlo algorithm is unhelpful. If you turn to his 1963 paper to learn about the origin of the algorithm, you’ll be disappointed.

One more oddity. I have been writing the *G*-rule as

\[\frac{e^{-\Delta E/T}}{1 + e^{-\Delta E/T}},\]

which is the way it appeared in Flinn and McManus, as well as in many recent accounts of the algorithm. However, nothing resembling this expression is to be found in Glauber’s paper. Instead he defined the rule in terms of the hyperbolic tangent. Reconstructing various bits of his mathematics in a form that could serve as a Monte Carlo acceptance function, I come up with:

\[\frac{1}{2}\left(1 -\tanh \frac{\Delta E}{2 T}\right).\]

The two expressions are mathematically synonymous, but the prevalence of the first form suggests that some authors who cite Glauber rather than Flinn and McManus are not getting their notation from the paper they cite.

When I first heard of the Ising model, sometime in the 1970s, I would read statements along the lines of “as the system cools to the critical temperature, fluctuations grow in scale until they span the entire lattice.” I wanted to *see* what that looked like. What kinds of patterns or textures would appear, and how they would evolve over time? In those days, live motion graphics were too much to ask for, but it seemed reasonable to expect at least a still image, or perhaps a series of them covering a range of temperatures.

In my reading, however, I found no pictures. Part of the reason was surely technological. Turning computations into graphics wasn’t so easy in those days. But I suspect another motive as well. A computational scientist who wanted to be taken seriously was well advised to focus on quantitative results. A graph of magnetization as a function of temperature was worth publishing; a snapshot of a single lattice configuration might seem frivolous—not real physics but a plaything like the Game of Life. Nevertheless, I still yearned to see what it would look like.

In 1979 I had an opportunity to force the issue. I was working with Kenneth G. Wilson, a physicist then at Cornell University, on a *Scientific American* article about “Problems in Physics with Many Scales of Length.” The problems in question included the Ising model, and I asked Wilson if he could produce pictures showing spin configurations at various temperatures. He resisted; I persisted; a few weeks later I received a fat envelope of fanfold paper, covered in arrays of \(1\)s and \(0\)s. With help from the *Scientific American* art department the numbers were transformed into black and white squares:

This particular image, one of three we published, is at the critical temperature. Wilson credited his colleagues Stephen Shenker and Jan Tobochnik for writing the program that produced it.

The lattice pictures made by Wilson, Shenker, and Tobochnik were the first I had ever seen of an Ising model at work, but they were not the first to be published. In recent weeks I’ve discovered a 1974 paper by P. A. Flinn in which black-and-white spin tableaux form the very centerpiece of the presentation. Flinn discusses aspects of the appearance of these grids that would be very hard to reduce to simple numerical facts:

Phase separation may be seen to occur by the formation and growth of clusters, but they look rather more like “seaweed” than like the roughly round clusters of traditional theory. The structures look somewhat like those observed in phase-separated glass.

I also found one even earlier instance of lattice diagrams, in a 1963 paper by J. R. Beeler, Jr., and J. A. Delaney. Are they the first?

Modeling calls for a curious mix of verisimilitude and fakery. A miniature steam locomotive chugging along the tracks of a model railroad reproduces in meticulous detail the pistons and linkage rods that drive the wheels of the real locomotive. But in the model it’s the wheels that impart motion to the links and pistons, not the other way around. The model’s true power source is hidden—an electric motor tucked away inside, where the boiler ought to be.

Scientific models also rely on shortcuts and simplifications. In a physics textbook you will meet the ideal gas, the frictionless pendulum, the perfectly elastic spring, the falling body that encounters no air resistance, the planet whose entire mass is concentrated at a dimensionless point. Such idealizations are not necessarily defects. By brushing aside irrelevant details, a good model allows a deeper truth to shine through. The problem, of course, is that some details are not irrelevant.

The Ising model is a fascinating case study in this process. Lenz and Ising set out to explain ferromagnetism, and almost all later discussions of the model (including the one you are reading right now) put some emphasis on that connection. The original aim was to find the simplest framework that would exhibit important properties of real ferromagnets, most notably the sudden onset of magnetization at the Curie temperature. As far as I can tell, the Ising model has failed in this respect. Some of the omitted details were of the essence; quantum mechanics just won’t go away, no matter how much we might like it to. These days, serious students of magnetism seem to have little interest in simple grids of flipping spins. A 2006 review of “modeling, analysis, and numerics of ferromagnetism,” by Martin Kružík and Andreas Prohl, doesn’t even mention the Ising model.

Yet the model remains wildly popular, the subject of hundreds of papers every year. Way back in 1967, Stephen G. Brush wrote that the Ising model had become “the preferred basic theory of all cooperative phenomena.” I’d go even further. I think it’s fair to say the Ising model has become an object of study for its own sake. The quest is to understand the phase diagram of the Ising system itself, whether or not it tells us anything about magnets or other physical phenomena.

Uprooting the Ising system from its ancestral home in physics leaves us with a model that is not a model *of* anything. It’s like a map of an imaginary territory; there is no ground truth. You can’t check the model’s accuracy by comparing its predictions with the results of experiments.

Seeing the Ising model as a free-floating abstraction, untethered from the material world, is a prospect I find exhilarating. We get to make our own universe—and we’ll do it right this time, won’t we! However, losing touch with physics is also unsettling. On what basis are we to choose between versions of the model, if not through fidelity to nature? Are we to be guided only by taste or convenience? A frequent argument in support of Glauber dynamics is that it seems more “natural” than the Metropolis algorithm. I would go along with that judgment: The random visitation sequence and the smooth, symmetrical curve of the *G*-rule both seem more like something found in nature than the corresponding Metropolis apparatus. But does naturalness matter if the model is solely a product of human imagination?

Bauer, W. F. 1958. The Monte Carlo method. *Journal of the Society for Industrial and Applied Mathematics* 6(4):438–451.

http://www.cs.fsu.edu/~mascagni/Bauer_1959_Journal_SIAM.pdf

Beeler, J. R. Jr., and J. A. Delaney. 1963. Order-disorder events produced by single vacancy migration. *Physical Review* 130(3):962–971.

Binder, Kurt. 1985. The Monte Carlo method for the study of phase transitions: a review of some recent progress. *Journal of Computational Physics* 59:1–55.

Binder, Kurt, and Dieter W. Heermann. 2002. *Monte Carlo Simulation in Statistical Physics: An Introduction*. Fourth edition. Berlin: Springer-Verlag.

Brush, Stephen G. 1967. History of the Lenz-Ising model. *Reviews of Modern Physics* 39:883-893.

Chib, Siddhartha, and Edward Greenberg. 1995. Understanding the Metropolis-Hastings Algorithm. *The American Statistician* 49(4): 327–335.

Cipra, Barry A. 1987. An introduction to the Ising model. *American Mathematical Monthly* 94:937-959.

Cunningham, G. W., and P. H. E. Meijer. 1976. A comparison of two Monte Carlo methods for computations in statistical mechanics. *Journal Of Computational Physics* 20:50-63.

Davies, E. B. 1982. Metastability and the Ising model. *Journal of Statistical Physics* 27(4):657–675.

Diaconis, Persi. 2009. The Markov chain Monte Carlo revolution. *Bulletin of the American Mathematical Society* 46(2):179–205.

Eckhardt, R. 1987. Stan Ulam, John von Neumann, and the Monte Carlo method. *Los Alamos Science* 15:131–137.

Flinn, P. A., and G. M. McManus. 1961. Monte Carlo calculation of the order-disorder transformation in the body-centered cubic lattice. *Physical Review* 124(1):54–59.

Flinn, P. A. 1974. Monte Carlo calculation of phase separation in a two-dimensional Ising system. *Journal of Statistical Physics* 10(1):89–97.

Fosdick, L. D. 1963. Monte Carlo computations on the Ising lattice. *Methods in Computational Physics* 1:245–280.

Geyer, Charles J. 2011. Intorduction to Markov chain Monte Carlo. In *Handbook of Markov Chain Monte Carlo*, edited by Steve Brooks, Andrew Gelman, Galin Jones and Xiao-Li Meng, pp. 3–48. Boca Raton: Taylor & Francis.

Glauber, R. J., 1963. Time-dependent statistics of the Ising model. *Journal of Mathematical Physics* 4:294–307.

Gubernatis, James E. (editor). 2003. *The Monte Carlo Method in the Physical Sciences: Celebrating the 50th Anniversary of the Metropolis Algorithm*. Melville, N.Y.: American Institute of Physics.

Halton, J. H. 1970. A retrospective and prospective survey of the Monte Carlo method. *SIAM Review* 12(1):1–63.

Hammersley, J. M., and D. C. Handscomb. 1964. *Monte Carlo Methods*. London: Chapman and Hall.

Hastings, W. K. 1970. Monte Carlo sampling methods using Markov chains and their applications. *Biometrika* 57(1):97–109.

Hayes, Brian. 2000. Computing science: The world in a spin. *American Scientist*, Vol. 88, No. 5, September-October 2000, pages 384-388. http://bit-player.org/bph-publications/AmSci-2000-09-Hayes-Ising.pdf

Hitchcock, David B. 2003. A history of the Metropolis–Hastings algorithm. *The American Statistician* 57(4):254–257. https://doi.org/10.1198/0003130032413

Hurd, Cuthbert C. 1985. A note on early Monte Carlo computations and scientific meetings. *Annals of the History of Computing* 7(2):141–155.

Ising, Ernst. 1925. Beitrag zur Theorie des Ferromagnetismus. *Zeitschrift für Physik* 31:253–258.

Janke, Wolfhard, Henrik Christiansen, and Suman Majumder. 2019. Coarsening in the long-range Ising model: Metropolis versus Glauber criterion. *Journal of Physics: Conference Series*, Volume 1163, International Conference on Computer Simulation in Physics and Beyond 24–27 September 2018, Moscow, Russian Federation. https://iopscience.iop.org/article/10.1088/1742-6596/1163/1/012002

Kobe, S. 1995. Ernst Ising—physicist and teacher. Actas: Noveno Taller Sur de Fisica del Solido, Misión Boroa, Chile, 26–29 April 1995. http://arXiv.org/cond-mat/9605174

Kružík, Martin, and Andreas Prohl. 2006. Recent developments in the modeling, analysis, and numerics of ferromagnetism. *SIAM Review* 48(3):439–483.

Liu, Jun S. 2004. *Monte Carlo Strategies in Scientific Computing*. New York: Springer Verlag.

Lu, Wentao T., and F. Y. Fu. 2000. Ising model on nonorientable surfaces: Exact solution for the Möbius strip and the Klein bottle. arXiv: cond-mat/0007325

Martin, Ph. A. 1977. On the stochastic dynamics of Ising models. *Journal of Statistical Physics* 16(2):149–168.

McCoy, Barry M., and Jean-Marie Maillard. 2012. The importance of the Ising model. *Progress in Theoretical Physics* 127:791-817. https://arxiv.org/abs/1203.1456v1

McCracken, Daniel D. 1955. The Monte Carlo method. *Scientific American* 192(5):90–96.

Metropolis, Nicholas, and S. Ulam. 1949. The Monte Carlo method. *Journal of the American Statistical Association* 247:335–341.

Metropolis, Nicholas, Arianna W. Rosenbluth, Marshall N. Rosenbluth, Augusta H. Teller, and Edward Teller. 1953. Equation of state calculations by fast computing machines. *The Journal of Chemical Physics* 21(6):1087–1092.

Metropolis, N. 1987. The beginning of the Monte Carlo method. *Los Alamos Science* 15:125–130.

Moore, Cristopher, and Stephan Mertens. 2011. *The Nature of Computation*. Oxford: Oxford University Press.

Onsager, Lars. 1944. Crystal Statistics. I. A two-dimensional model with an order-disorder transition. *Physical Review* 65:117–149.

Pérez, Gabriel, Francisco Sastre, and Rubén Medina. 2002. Critical exponents for extended dynamical systems with simultaneous updating: the case of the Ising model. *Physica D* 168–169:318–324.

Peierls, R. 1936. On Ising’s model of ferromagnetism. *Proceedings of the Cambridge Philosophical Society, Mathematical and Physical Sciences* 32:477–481.

Peskun, P. H. 1973. Optimum Monte-Carlo sampling using Markov chains. *Biometrika* 60(3):607–612.

Richey, Matthew. 2010. The evolution of Markov chain Monte Carlo methods. *American Mathematical Monthly* 117:383–413.

Robert, Christian, and George Casella. 2011. A short history of Markov chain Monte Carlo: Subjective recollections from incomplete data. *Statistical Science* 26(1):102–115. Also appears as a chapter in *Handbook of Markov Chain Monte Carlo*. Also available as arXiv preprint 0808.2902v7.

Rosenbluth, Marshall N. 2003a. Genesis of the Monte Carlo algorithm for statistical mechanics. AIP Conference Proceedings 690, 22. https://doi.org/10.1063/1.1632112.

Rosenbluth, Marshall N. 2003b. Interviewe with Kai-Henrik Barth, La Jolla, California, August 11, 2003. Niels Bohr Library & Archives, American Institute of Physics, College Park, MD. https://www.aip.org/history-programs/niels-bohr-library/oral-histories/28636-1

Sadiq, A., and K. Binder. 1984. Dynamics of the formation of two-dimensional ordered structures. *Journal of Statistical Physics* 35(5/6):517–585.

Spirin, V., P. L. Krapivsky, and S. Redner. 2002. Freezing in Ising ferromagnets. *Physical Review E* 65(1):016119. https://arxiv.org/abs/cond-mat/0105037.

Stigler, Stephen M. 1991. Stochastic simulation in the nineteenth century. Statistical Science 6:89-97.

Stoll, E., K. Binder, and T. Schneider. 1973. Monte Carlo investigation of dynamic critical phenomena in the two-dimensional kinetic Ising model. *Physical Review B* 8(7):3266–3289.

Tierney, Luke. 1994. Markov chains for exploring posterior distributions. *The Annals of Statistics* 22(4):1701–1762.

Ulam, Stanislaw M. 1976, 1991. *Adventures of a Mathematician*. Berkeley: University of California Press.

Valleau, John P., and Stuart G. Whittington. 1977. Monte Carlo in statistical mechanics: choosing between alternative transition matrices. *Journal of Computational Physics *24:150–157.

Wilson, Kenneth G. 1979. Problems in physics with many scales of length. *Scientific American* 241(2):158–179.

Wood, W. W. 1986. Early history of computer simulations in statistical mechanics. In *Molecular-Dynamics Simulation of Statistical-Mechanics Systems*, edited by G. Ciccotti and W. G. Hoover (North-Holland, New York): pp. 3–14. https://digital.library.unt.edu/ark:/67531/metadc1063911/m1/1/

Wood, William W. 2003. A brief history of the use of the Metropolis method at LANL in the 1950s. AIP Conference Proceedings 690, 39. https://doi.org/10.1063/1.1632115

]]>This flimsy slip of paper seems like an odd scrap to preserve for the ages, but when I pulled it out of the envelope, I knew instantly where it came from and why I had saved it.

The year was 1967. I was 17 then; I’m 71 now. Transposing those two digits takes just a flick of the fingertips. I can blithely skip back and forth from one prime number to the other. But the span of lived time between 1967 and 2021 is a chasm I cannot so easily leap across. At 17 I was in a great hurry to grow up, but I couldn’t see as far as 71; I didn’t even try. Going the other way—revisiting the mental and emotional life of an adolescent boy—is also a journey deep into alien territory. But the straw wrapper helps—it’s a Proustian *aide memoire*.

In the spring of 1967 I had a girlfriend, Lynn. After school we would meet at the Maple Diner, where the booths had red leatherette upholstery and formica tabletops with a boomerang motif. We’d order two Cokes and a plate of french fries to share. The waitress liked us; she’d make sure we had a full bottle of ketchup. I mention the ketchup because it was a token of our progress toward intimacy. On our first dates Lynn had put only a dainty dab on her fries, but by April we were comfortable enough to reveal our true appetites.

One afternoon I noticed she was fiddling intently with the wrapper from her straw, folding and refolding. I had no idea what she was up to. A teeny paper airplane she would sail over my head? When she finished, she pushed her creation across the table:

What a wallop there was in that little wad of paper. At that point in our romance, the words had not yet been spoken aloud.

How did I respond to Lynn’s folded declaration? I can’t remember; the words are lost. But evidently I got through that awkward moment without doing any permanent damage. A year later Lynn and I were married.

Today, at 71, with the preserved artifact in front of me, my chief regret is that I failed to take up the challenge implicit in the word game Lynn had invented. Why didn’t I craft a reply by folding my own straw wrapper? There are quite a few messages I could have extracted by strategic deletions from “It’s a pleasure to serve you.”

itsapleasuretoserveyou==> I love you.

itsapleasuretoserveyou==> I please you.

itsapleasuretoserveyou==> I tease you.

itsapleasuretoserveyou==> I pleasure you.

itsapleasuretoserveyou==> I pester you.

itsapleasuretoserveyou==> I peeve you.

itsapleasuretoserveyou==> I salute you.

itsapleasuretoserveyou==> I leave you.

Not all of those statements would have been suited to the occasion of our rendezvous at the Maple Diner, but over the course of our years together—17 years, as it turned out—there came a moment for each of them.

How many words can we form by making folds in the straw-paper slogan? I could not have answered that question in 1967. I couldn’t have even asked it. But times change. Enumerating all the foldable messages now strikes me as an obvious thing to do when presented with the straw wrapper. Furthermore, I have the computational means to do it—although the project was not quite as easy as I expected.

A first step is to be explicit about the rules of the game. We are given a source text, in this case “It’s a pleasure to serve you.” Let us ignore the spaces between words as well as all punctuation and capitalization; in this way we arrive at the normalized text “itsapleasuretoserveyou”. A word is *foldable* if all of its letters appear in the normalized text in the correct order (though not necessarily consecutively). The folding operation amounts to an editing process in which our only permitted act is deletion of letters; we are not allowed to insert, substitute, or permute. If two or more foldable words are to be combined to make a phrase or sentence, they must follow one another in the correct order without overlaps.

So much for foldability. Next comes the fraught question: What is a word? Linguists and lexicographers offer many subtly divergent opinions on this point, but for present purposes a very simple definition will suffice: A finite sequence of characters drawn from the 26-letter English alphabet is a word if it can legally be played in a game of Scrabble. I have been working with a word list from the 2015 edition of Collins Scrabble Words, which has about 270,000 entries. (There are a number of alternative lists, which I discuss in an appendix at the end of this article.)

Scrabble words range in length from 2 to 15 letters. The upper limit—determined by the size of the game board—is not much of a concern. You’re unlikely to meet a straw-paper text that folds to yield words longer than *sesquipedalian*. The absence of 1-letter words is more troubling, but the remedy is easy: I simply added the words *a*, *I*, and *O* to my copy of the Scrabble list.

My first computational experiments with foldable words searched for examples at random. Writing a program for random sampling is often easier than taking an exact census of a population, and the sample offers a quick glimpse of typical results. The following Python procedure generates random foldable sequences of letters drawn from a given source text, then returns those sequences that are found in the Scrabble word list. (The parameter *k* is the length of the words to be generated, and *reps* specifies the number of random trials.)

```
def randomFoldableWords(text, lexicon, k, reps):
normtext = normalize(text)
n = len(normtext)
findings = []
for i in range(reps):
indices = random.sample(range(n), k)
indices.sort()
letters = ""
for idx in indices:
letters += normtext[idx]
if letters in lexicon:
findings.append(letters)
return findings
```

Here are the six-letter foldable words found by invoking the program as follows: `randomFoldableWords(scrabblewords, 6, 10000)`

.

please, plater, searer, saeter, parter, sleety, sleeve, parser, purvey, laster, islets, taster, tester, slarts, paseos, tapers, saeter, eatery, salute, tsetse, setose, salues, sparer

Note that the word saeter (you could look it up—I had to) appears twice in this list. The frequency of such repetitions can yield an estimate of the total population size. A variant of the mark-and-recapture method, well-known in wildlife ecology, led me to an estimate of 92 six-letter foldable Scrabble words in the straw-wrapper slogan. The actual number turns out to be 106.

Samples and estimates are helpful, but they leave me wondering, What am I missing? What strange and beautiful word has failed to turn up in any of the samples, like the big fish that never takes the bait? I had to have an exhaustive list.

In many word games, the tool of choice for computer-aided playing (or cheating) is the regular expression, or regex. A regex is a pattern defining a set of strings, or character sequences; from a collection of strings, a regex search will pick out those that match the pattern. For example, the regular expression `^.*love.*$`

selects from the Scrabble word list all words that have the letter sequence *love* somewhere within them. There are 137 such words, including some that I would not have thought of, such as *rollover* and *slovenly*. The regex `^.*l.*o.*v.*e.*$`

finds all words in which *l, o, v,* and *e* appear in sequence, whether of not they are adjacent. The set has 267 members, including such secret-lover gems as *bloviate*, *electropositive*, and *leftovers*.

A solution to the foldable words problem could surely be crafted with regular expressions, but I am not a regex wizard. In search of a more muggles-friendly strategy, my first thought was to extend the idea behind the random-sampling procedure. Instead of selecting foldable sequences at random, I’d generate all of them, and check each one against the word list.

The procedure below generates all three-letter strings that can be folded from the given text, and returns the subset of those strings that appear in the Scrabble word list:

```
def foldableStrings3(lexicon, text):
normtext = normalize(text)
n = len(normtext)
words = []
for i in range(0, n-2):
for j in range(i+1, n-1):
for k in range(j+1, n):
s = normtext[i] + normtext[j] + normtext[k]
if s in lexicon:
words.append(s)
return(words)
```

At the heart of the procedure are three nested loops that methodically step through all the foldable combinations: For any initial letter `text[i]`

we can choose any following letter `text[j]`

with` j > i`

; likewise `text[j]`

can be followed by any `text[k]`

with `k > j`

. This scheme works perfectly well, finding 348 instances of three-letter words. I speak of “instances” because some words appear in the list more than once; for example, *pee* can be formed in three ways. If we count only unique words, there are 137.

Following this model, we could write a separate routine for each word length from 1 to 15 letters, but that looks like a dreary and repetitious task. Nobody wants to write a procedure with loops nested 15 deep. An alternative is to write a meta-procedure, which would generate the appropriate procedure for each word length. I made a start on that exercise in advanced loopology, but before I got very far I realized there’s an easier way. I was wondering: In a text of *n* letters, how many foldable substrings exist—whether or not they are recognizable words? There are several ways of answering this question, but to me the most illuminating argument comes from an inclusion/exclusion principle. Consider the first letter of the text, which in our case is the letter *I*. In the set of all foldable strings, half include this letter and half exclude it. The same is true of the second letter, and the third, and so on. Thus each letter added to the text doubles the number of foldable strings, which means the total number of strings is simply \(2^n\). (Included in this count is the empty string, made up of no letters.)

This observation suggests a simple algorithm for generating all the foldable strings in any *n*-letter text. Just count from \(0\) to \(2^{n} - 1\), and for each value along the way line up the binary representation of the number with the letters of the text. Then select those letters that correspond to a `1`

bit, like so:

itsapleasuretoserveyou 0000100000110011111000

And so we see that the word `preserve`

corresponds to the binary representation of the number `134392`

.

Counting is something that computers are good at, so a word-search procedure based on this principle is straightforward:

```
def foldablesByCounting(lexicon, text):
normtext = normalize(text)
n = len(normtext)
words = []
for i in range(2**n - 1):
charSeq = ''
positions = positionsOf1Bits(i, n)
for p in positions:
charSeq += normtext[p]
if charSeq in lexicon:
words.append(charSeq)
return(words)
```

The outer loop (variable `i`

) counts from \(0\) to \(2^{n} - 1\); for each of these numbers the inner loop (variable `p`

) picks out the letters corresponding to 1 bits. The program produces the output expected. Unfortunately, it does so very slowly. For every character added to the text, running time roughly doubles. I haven’t the patience to plod through the \(2^{22}\) patterns in “itsapleasuretoserveyou”; estimates based on shorter phrases suggest the running time would be more than three hours.

In the middle of the night I realized my approach to this problem was totally backwards. Instead of blindly generating all possible character strings and filtering out the few genuine words, I could march through the list of Scrabble words and test each of them to see if it’s foldable. At worst I would have to try some 270,000 words. I could speed things up even more by making a preliminary pass through the Scrabble list, discarding all words that include characters not present in the normalized text. For the text “It’s a pleasure to serve you,” the character set has just 12 members: `aeiloprstuvy`

. Allowing only words formed from these letters slashes the Scrabble list down to a length of 12,816.

To make this algorithm work, we need a procedure to report whether or not a word can be formed by folding the given text. The simplest approach is to slide the candidate word along the text, looking for a match for each character in turn:

taste itsapleasuretoserveyoutaste itsapleasuretoserveyout aste itsapleasuretoserveyout a ste itsapleasuretoserveyout a s te itsapleasuretoserveyout a s t eitsapleasuretoserveyou

If every letter of the word finds a mate in the text, the word is foldable, as in the case of `taste`

, shown above. But an attempt to match `tastes`

would fall off the end of the text looking for a second `s`

, which does not exist.

The following code implements this idea:

```
def wordIsFoldable(word, text):
normtext = normalize(text)
t = 0 # pointer to positions in normtext
w = 0 # pointer to positions in word
while t < len(normtext):
if word[w] == normtext[t]: # matching chars in word and text
w += 1 # move to next char in word
if w == len(word): # matched all chars in word
return(True) # so: thumbs up
t += 1 # move to next char in text
return(False) # fell off the end: thumbs down
```

All we need to do now is embed this procedure in a loop that steps through all the candidate Scrabble words, collecting those for which `wordIsFoldable`

returns `True`

.

There’s still some waste motion here, since we are searching letter-by-letter through the same text, and repeating the same searches thousands of times. The source code (available on GitHub as a Jupyter notebook) explains some further speedups. But even the simple version shown here runs in less than two tenths of a second, so there’s not much point in optimizing.

I can now report that there are 778 unique foldable Scrabble words in “It’s a pleasure to serve you” (including the three one-letter words I added to the list). Words that can be formed in multiple ways bring the total count to 899.

And so we come to the tah-dah! moment—the unveiling of the complete list. I have organized the words into groups based on each word’s starting position within the text. (By Python convention, the positions are numbered from 0 through \(n-1\).) Within each group, the words are sorted according to the position of their last character; that position is given in the subscript following the word. For example, *tapestry* is in Group 1 because it begins at position 1 in the text (the *t* in *It’s*), and it carries the subscript 19 because it ends at position 19 (the *y* in *you*).

This arrangement of the words is meant to aid in contructing multiword phrases. If a word ends at position \(m\), the next word in the phrase must come from a group numbered \(m+1\) or greater.

**Group 0:** i_{0} it_{1} is_{2} its_{2} ita_{3} isle_{6} ilea_{7} isles_{8} itas_{8} ire_{11} issue_{11} iure_{11} islet_{12} io_{13} iso_{13} ileus_{14} ios_{14} ires_{14} islets_{14} isos_{14} issues_{14} issuer_{16} ivy_{19}

**Group 1:** ta_{3} tap_{4} tae_{6} tale_{6} tape_{6} te_{6} tala_{7} talea_{7} tapa_{7} tea_{7} taes_{8} talas_{8} tales_{8} tapas_{8} tapes_{8} taps_{8} tas_{8} teas_{8} tes_{8} tapu_{9} tau_{9} talar_{10} taler_{10} taper_{10} tar_{10} tear_{10} tsar_{10} taleae_{11} tare_{11} tease_{11} tee_{11} tapet_{12} tart_{12} tat_{12} taut_{12} teat_{12} test_{12} tet_{12} tret_{12} tut_{12} tao_{13} taro_{13} to_{13} talars_{14} talers_{14} talus_{14} taos_{14} tapers_{14} tapets_{14} tapus_{14} tares_{14} taros_{14} tars_{14} tarts_{14} tass_{14} tats_{14} taus_{14} tauts_{14} tears_{14} teases_{14} teats_{14} tees_{14} teres_{14} terts_{14} tests_{14} tets_{14} tres_{14} trets_{14} tsars_{14} tuts_{14} tasse_{15} taste_{15} tate_{15} terete_{15} terse_{15} teste_{15} tete_{15} toe_{15} tose_{15} tree_{15} tsetse_{15} taperer_{16} tapster_{16} tarter_{16} taser_{16} taster_{16} tater_{16} tauter_{16} tearer_{16} teaser_{16} teer_{16} teeter_{16} terser_{16} tester_{16} tor_{16} tutor_{16} tav_{17} tarre_{18} testee_{18} tore_{18} trove_{18} tutee_{18} tapestry_{19} tapstry_{19} tarry_{19} tarty_{19} tasty_{19} tay_{19} teary_{19} terry_{19} testy_{19} toey_{19} tory_{19} toy_{19} trey_{19} troy_{19} try_{19} too_{20} toro_{20} toyo_{20} tatou_{21} tatu_{21} tutu_{21}

**Group 2:** sap_{4} sal_{5} sae_{6} sale_{6} sea_{7} spa_{7} sales_{8} sals_{8} saps_{8} seas_{8} spas_{8} sau_{9} sar_{10} sear_{10} ser_{10} slur_{10} spar_{10} spear_{10} spur_{10} sur_{10} salse_{11} salue_{11} seare_{11} sease_{11} seasure_{11} see_{11} sere_{11} sese_{11} slae_{11} slee_{11} slue_{11} spae_{11} spare_{11} spue_{11} sue_{11} sure_{11} salet_{12} salt_{12} sat_{12} saut_{12} seat_{12} set_{12} slart_{12} slat_{12} sleet_{12} slut_{12} spart_{12} spat_{12} speat_{12} spet_{12} splat_{12} spurt_{12} st_{12} suet_{12} salto_{13} so_{13} salets_{14} salses_{14} saltos_{14} salts_{14} salues_{14} sapless_{14} saros_{14} sars_{14} sass_{14} sauts_{14} sears_{14} seases_{14} seasures_{14} seats_{14} sees_{14} seres_{14} sers_{14} sess_{14} sets_{14} slaes_{14} slarts_{14} slats_{14} sleets_{14} slues_{14} slurs_{14} sluts_{14} sos_{14} spaes_{14} spares_{14} spars_{14} sparts_{14} spats_{14} spears_{14} speats_{14} speos_{14} spets_{14} splats_{14} spues_{14} spurs_{14} spurts_{14} sues_{14} suets_{14} sures_{14} sus_{14} salute_{15} saree_{15} sasse_{15} sate_{15} saute_{15} setose_{15} slate_{15} sloe_{15} sluse_{15} sparse_{15} spate_{15} sperse_{15} spree_{15} saeter_{16} salter_{16} saluter_{16} sapor_{16} sartor_{16} saser_{16} searer_{16} seater_{16} seer_{16} serer_{16} serr_{16} slater_{16} sleer_{16} spaer_{16} sparer_{16} sparser_{16} spearer_{16} speer_{16} spuer_{16} spurter_{16} suer_{16} surer_{16} sutor_{16} sav_{17} sov_{17} salve_{18} save_{18} serre_{18} serve_{18} slave_{18} sleave_{18} sleeve_{18} slove_{18} sore_{18} sparre_{18} sperre_{18} splore_{18} spore_{18} stere_{18} sterve_{18} store_{18} stove_{18} salary_{19} salty_{19} sassy_{19} saury_{19} savey_{19} say_{19} serry_{19} sesey_{19} sey_{19} slatey_{19} slaty_{19} slavey_{19} slay_{19} sleety_{19} sley_{19} slurry_{19} sly_{19} soy_{19} sparry_{19} spay_{19} speary_{19} splay_{19} spry_{19} spurrey_{19} spurry_{19} spy_{19} stey_{19} storey_{19} story_{19} sty_{19} suety_{19} surety_{19} surrey_{19} survey_{19} salvo_{20} servo_{20} stereo_{20} sou_{21} susu_{21}

**Group 3:** a_{3} al_{5} ae_{6} ale_{6} ape_{6} aa_{7} ala_{7} aas_{8} alas_{8} ales_{8} als_{8} apes_{8} as_{8} alu_{9} alar_{10} aper_{10} ar_{10} alae_{11} alee_{11} alure_{11} apse_{11} are_{11} aue_{11} alert_{12} alt_{12} apart_{12} apert_{12} apt_{12} aret_{12} art_{12} at_{12} aero_{13} also_{13} alto_{13} apo_{13} apso_{13} auto_{13} aeros_{14} alerts_{14} altos_{14} alts_{14} alures_{14} alus_{14} apers_{14} apos_{14} apres_{14} apses_{14} apsos_{14} apts_{14} ares_{14} arets_{14} ars_{14} arts_{14} ass_{14} ats_{14} aures_{14} autos_{14} alate_{15} aloe_{15} arete_{15} arose_{15} arse_{15} ate_{15} alastor_{16} alerter_{16} alter_{16} apter_{16} aster_{16} arere_{18} ave_{18} aery_{19} alary_{19} alay_{19} aleatory_{19} apay_{19} apery_{19} arsey_{19} arsy_{19} artery_{19} artsy_{19} arty_{19} ary_{19} ay_{19} aloo_{20} arvo_{20} avo_{20} ayu_{21}

**Group 4:** pe_{6} pa_{7} pea_{7} plea_{7} pas_{8} peas_{8} pes_{8} pleas_{8} plu_{9} par_{10} pear_{10} per_{10} pur_{10} pare_{11} pase_{11} peare_{11} pease_{11} pee_{11} pere_{11} please_{11} pleasure_{11} plue_{11} pre_{11} pure_{11} part_{12} past_{12} pat_{12} peart_{12} peat_{12} pert_{12} pest_{12} pet_{12} plast_{12} plat_{12} pleat_{12} pst_{12} put_{12} pareo_{13} paseo_{13} peso_{13} pesto_{13} po_{13} pro_{13} pareos_{14} pares_{14} pars_{14} parts_{14} paseos_{14} pases_{14} pass_{14} pasts_{14} pats_{14} peares_{14} pears_{14} peases_{14} peats_{14} pees_{14} peres_{14} perts_{14} pesos_{14} pestos_{14} pests_{14} pets_{14} plats_{14} pleases_{14} pleasures_{14} pleats_{14} plues_{14} plus_{14} pos_{14} pros_{14} pures_{14} purs_{14} pus_{14} puts_{14} parse_{15} passe_{15} paste_{15} pate_{15} pause_{15} perse_{15} plaste_{15} plate_{15} pose_{15} pree_{15} prese_{15} prose_{15} puree_{15} purse_{15} parer_{16} parr_{16} parser_{16} parter_{16} passer_{16} paster_{16} pastor_{16} pater_{16} pauser_{16} pearter_{16} peer_{16} perter_{16} pester_{16} peter_{16} plaster_{16} plater_{16} pleaser_{16} pleasurer_{16} pleater_{16} poser_{16} pretor_{16} proser_{16} puer_{16} purer_{16} purr_{16} purser_{16} parev_{17} pav_{17} perv_{17} pareve_{18} parore_{18} parve_{18} passee_{18} pave_{18} peeve_{18} perve_{18} petre_{18} pore_{18} preeve_{18} preserve_{18} preve_{18} prore_{18} prove_{18} parry_{19} party_{19} pastry_{19} pasty_{19} patsy_{19} paty_{19} pay_{19} peatery_{19} peaty_{19} peavey_{19} peavy_{19} peeoy_{19} peery_{19} perry_{19} pervy_{19} pesty_{19} plastery_{19} platy_{19} play_{19} ploy_{19} plurry_{19} ply_{19} pory_{19} posey_{19} posy_{19} prey_{19} prosy_{19} pry_{19} pursy_{19} purty_{19} purvey_{19} puy_{19} parvo_{20} poo_{20} proo_{20} proso_{20} pareu_{21} patu_{21} poyou_{21}

**Group 5:** la_{7} lea_{7} las_{8} leas_{8} les_{8} leu_{9} lar_{10} lear_{10} lur_{10} lare_{11} lase_{11} leare_{11} lease_{11} leasure_{11} lee_{11} lere_{11} lure_{11} last_{12} lat_{12} least_{12} leat_{12} leet_{12} lest_{12} let_{12} lo_{13} lares_{14} lars_{14} lases_{14} lass_{14} lasts_{14} lats_{14} leares_{14} lears_{14} leases_{14} leasts_{14} leasures_{14} leats_{14} lees_{14} leets_{14} leres_{14} leses_{14} less_{14} lests_{14} lets_{14} los_{14} lues_{14} lures_{14} lurs_{14} laree_{15} late_{15} leese_{15} lose_{15} lute_{15} laer_{16} laser_{16} laster_{16} later_{16} leaser_{16} leer_{16} lesser_{16} lor_{16} loser_{16} lurer_{16} luser_{16} luter_{16} lav_{17} lev_{17} luv_{17} lave_{18} leave_{18} lessee_{18} leve_{18} lore_{18} love_{18} lurve_{18} lay_{19} leary_{19} leavy_{19} leery_{19} levy_{19} ley_{19} lory_{19} lovey_{19} loy_{19} lurry_{19} laevo_{20} lasso_{20} levo_{20} loo_{20} lassu_{21} latu_{21} lou_{21}

**Group 6:** ea_{7} eas_{8} es_{8} eau_{9} ear_{10} er_{10} ease_{11} ee_{11} ere_{11} east_{12} eat_{12} est_{12} et_{12} euro_{13} ears_{14} eases_{14} easts_{14} eats_{14} eaus_{14} eres_{14} eros_{14} ers_{14} eses_{14} ess_{14} ests_{14} euros_{14} erose_{15} esse_{15} easer_{16} easter_{16} eater_{16} err_{16} ester_{16} erev_{17} eave_{18} eve_{18} easy_{19} eatery_{19} eery_{19} estro_{20} evo_{20}

**Group 7:** a_{7} as_{8} ar_{10} ae_{11} are_{11} aue_{11} aret_{12} art_{12} at_{12} auto_{13} ares_{14} arets_{14} ars_{14} arts_{14} ass_{14} ats_{14} aures_{14} autos_{14} arete_{15} arose_{15} arse_{15} ate_{15} aster_{16} arere_{18} ave_{18} aery_{19} arsey_{19} arsy_{19} artery_{19} artsy_{19} arty_{19} ary_{19} ay_{19} aero_{20} arvo_{20} avo_{20} ayu_{21}

**Group 8:** sur_{10} sue_{11} sure_{11} set_{12} st_{12} suet_{12} so_{13} sets_{14} sos_{14} sues_{14} suets_{14} sures_{14} sus_{14} see_{15} sese_{15} setose_{15} seer_{16} ser_{16} suer_{16} surer_{16} sutor_{16} sov_{17} sere_{18} serve_{18} sore_{18} stere_{18} sterve_{18} store_{18} stove_{18} sesey_{19} sey_{19} soy_{19} stey_{19} storey_{19} story_{19} sty_{19} suety_{19} surety_{19} surrey_{19} survey_{19} servo_{20} stereo_{20} sou_{21} susu_{21}

**Group 9:** ur_{10} ure_{11} ut_{12} ures_{14} us_{14} uts_{14} use_{15} ute_{15} ureter_{16} user_{16} uey_{19} utu_{21}

**Group 10:** re_{11} ret_{12} reo_{13} reos_{14} res_{14} rets_{14} ree_{15} rete_{15} roe_{15} rose_{15} rev_{17} reeve_{18} resee_{18} reserve_{18} retore_{18} rore_{18} rove_{18} retry_{19} rory_{19} rosery_{19} rosy_{19} retro_{20} roo_{20}

**Group 11:** et_{12} es_{14} ee_{15} er_{16} ere_{18} eve_{18} eery_{19} evo_{20}

**Group 12:** to_{13} te_{15} toe_{15} tose_{15} tor_{16} tee_{18} tore_{18} toey_{19} tory_{19} toy_{19} trey_{19} try_{19} too_{20} toro_{20} toyo_{20}

**Group 13:** o_{13} os_{14} oe_{15} ose_{15} or_{16} ore_{18} oy_{19} oo_{20} ou_{21}

**Group 14:** ser_{16} see_{18} sere_{18} serve_{18} sey_{19} servo_{20} so_{20} sou_{21}

**Group 15:** er_{16} ee_{18} ere_{18} eve_{18} evo_{20}

**Group 16:** re_{18} reo_{20}

**Group 17:**

**Group 18:**

**Group 19:** yo_{20} you_{21} yu_{21}

**Group 20:** o_{20} ou_{21}

**Group 21:**

Naturally, I’ve tried out the code on a few other well-known phrases.

If Lynn and I had met at a different dining establishment, she might have found a straw with the statement, “It takes two hands to handle a Whopper.” There’s quite a diverse assortment of possible messages lurking in this text, with 1,154 unique foldable words and almost 2,000 word instances. Perhaps she would have chosen the upbeat “Inhale hope.” Or, in a darker mood, “I taste woe.”

If we had been folding dollar bills instead of straw wrappers, “In God We Trust” might have become the forward-looking proclamation, “I go west!” Horace Greeley’s marching order on the same theme, “Go west, young man,” gives us the enigmatic “O, wet yoga!” or, perhaps more aptly, “Gunman.”

Jumping forward from 1967 to 2021—from the Summer of Love to the Winter of COVID—I can turn “Wear a mask. Wash your hands.” into the plaintive, “We ask: Why us?” With “Maintain social distance,” the best I can do is “A nasal dance” or “A sad stance.”

And then there’s “Make America Great Again.” It yields “Meme rage.” Also “Make me ragtag.”

In a project like this one, you might think that getting a suitable list of English words would be the easy part. In fact it seems to be the main trouble spot.

The Scrabble lexicon I’ve been relying on derives from a word list known as SOWPODS, compiled by two associations of Scrabble players starting in the 1980s. Current editions of the list are distributed by a commercial publisher, Collins Dictionaries. If I understand correctly, all versions of the list are subject to copyright (see discussion on Stack Exchange) and cannot legally be distributed without permission. But no one seems to be much bothered by that fact. Copies of the lists in plain-text format, with one word per line, are easy to find on the internet—and not just on dodgy sites that specialize in pirated material.

There are alternative lists without legal encumbrances. Indeed, there’s a good chance you already have one such list pre-installed on your computer. A file called `words`

is included in most distributions of the Unix operating system, including MacOS; my copy of the file lives in `usr/share/dict/words`

. If you don’t have or can’t find the Unix `words`

file, I suggest downloading the Natural Language Toolkit, a suite of data files and Python programs that includes a lexicon almost identical to Unix words, as well as many other linguistic resources.

The Scrabble list has one big advantage over `words`

: It includes plurals and inflected forms of verbs—not just *test* but also *tests*, *tested*, and *testing*. [Bad example; see comments below.] The `words`

file is more like a list of dictionary head words, with only the stem form explicitly included. On the other hand, `words`

has an abundance of names and other proper nouns, as well as abbreviations, which are excluded from the Scrabble list since they are not legal plays in the board game.

How about combining the two word lists? Their union has just under 400,000 entries—quite a large lexicon. Using this augmented list for the analysis of “It’s a pleasure to serve you,” my program finds an additional 219 foldable words, beyond the 778 found with the Scrabble list alone. Here they are:

aaru aer aerose aes alares alaster alea alerse aleut alo alose alur aly ao apa apar aperu apus aro arry aru ase asor asse ast astor atry aueto aurore aus ausu aute e eastre eer erse esere estre eu ey iao ie ila islay ist isuret itala itea iter ito iyo l laet lao larry larve lastre lasty latro laur leo ler lester lete leto loro lu lue luo lut luteo lutose ly oer ory ovey p parsee parto passo pastose pato pau paut pavo pavy peasy perty peru pess peste pete peto petr plass platery pluto poe poy presee pretry pu purre purry puru r reve ro roer roey roy s sa saa salar salat salay saltee saltery salvy sao sapa saple sapo sare sart saur sauty sauve se seary seave seavy seesee sero sert sesuto sla slare slav slete sloo sluer soe sory soso spary spass spave spleet splet splurt spor spret sprose sput ssu stero steve stre strey stu sueve suto sutu suu t taa taar tal talao talose taluto tapeats tapete taplet tapuyo tarr tarse tartro tarve tasser tasu taur tave tavy teaer teaey teart teasy teaty teave teet teety tereu tess testor toru torve tosy tou treey tsere tst tu tue tur turr turse tute tutory u uro urs uru usee v vu y

Many of the proper nouns in this list are present in the vocabulary of most English speakers: *Aleut, Peru, Pluto, Slav*; the same is true of personal names such as *Larry, Leo, Stu, Tess*. But the rest of the words are very unlikely to turn up in the smalltalk of teenage sweethearts. Indeed, the list is full of letter sequences I simply don’t recognize as English words. Please define *isuret, ovey, spleet,* or *sput*.

There are even bigger word lists out there. In 2006 Google extracted 13.5 million unique English words from public web pages. (The sheer number implies a very liberal definition of *English* and *word*.) A good place to start exploring this archive is Peter Norvig’s website, which offers a file with the 333,333 most frequent words from the corpus. The list begins as you might expect: *the, of, and, to, a, in, for*…; but the weirdness creeps in early. The single letters *c, e, s,* and *x* are all listed among the 100 most common “words,” and the rest of the alphabet turns up soon after. By the time we get to the end of the file, it’s mostly typos *(mepquest, halloweeb, scholarhips)*, run-together words *(dietsdontwork, weightlossdrugs)*, and hundreds of letter strings that have some phonetic or orthographic resemblance to *Google* or *Yahoo!* or both *(hoogol, googgl, yahhol, gofool, yogol)*. (I suspect that much of this rubbish was scraped not from the visible text of web pages but from metadata stuffed into headers for purposes of search-engine optimization.)

Applying the Google list to the search for foldable words more than doubles the volume of results, but it contributes almost nothing to the stock of words that might form interesting messages. I found 1,543 new words, beyond those that are also present in the union of the Scrabble and Unix lists. In alphabetical order, the additions begin: *aae, aao, aaos, aar, aare, aaro, aars, aart, aarts, aase, aass, aast, aasu, aat, aats, aatsr, aau, aaus, aav, aave, aay, aea, aeae….* I’m not going to be folding up any straw wrappers with those words for my sweetheart.

What we really need, I begin to think, is not a longer word list but a shorter and more discriminating one.

]]>The tableau presented below is a product of my amateur efforts to address these questions. It’s a simple exercise in the mechanics of probability. I take a sample of the U.S. population, roughly 10,000 people, and randomly assign them to clusters of size \(n\), where \(n\) can range from 1 to 32. (In any single run of the model, \(n\) is fixed; all the groups are the same size.) Each cluster represents a Thanksgiving gathering. If a cluster includes someone infected with SARS-CoV-2, the disease may spread to the uninfected and susceptible members of the same group.

With the model’s default settings, \(n = 12\). The population sample consists of 9,900 people, represented as tiny colored dots arranged in 825 clusters of 12 dots each. Most of the dots are green, indicating susceptible individuals. Red dots are the infectious spreaders. Purple dots represent the unfortunates who are newly infected as a result of mingling with spreaders in these holiday get-togethers. I count the purple dots and estimate the rate of new infections per 100,000 population.

You can explore the model on your own. Twiddle with the sliders in the control panel, then press the “Go” button to generate a new sample population and a new cycle of infections. For example, by moving the group-size slider you can get a thousand clusters of 10 persons each, or 400 clusters of 25 each.

Before going any further with this discussion, I should make clear that the simulation is *not* offered as a prediction of how Covid-19 will spread during tomorrow’s Thanksgiving festivities. This is not a guide to personal risk assessment. If you play around with the controls, you’ll soon discover you can make the model say anything you wish. Depending on the settings you choose, the result can lie anywhere along the entire spectrum of possible outcomes, from nobody-gets-sick to everybody’s-got-it. There are settings that lead to impossible states, such as infection rates beyond 100 percent. Even so, I’m not totally convinced that the model is useless. It might point to combinations of parameters that would limit the damage.

The crucial input that drives the model is the daily tally of Covid cases for the entire country, expressed as a rate of new infections per 100,000 population. The official version of this statistic is published by the CDC; a few other organizations, including Johns Hopkins and the New York Times, maintain their own daily counts. The CDC report for November 24 cites a seven-day rolling average of 52.3 new cases per 100,000 people. For the model I set the default rate at 50, but the slider marked “daily new cases per 100,000 population” will accommodate any value between 0 and 500.

From the daily case rate we can estimate the prevalence of the disease: the total number of active cases at a given moment. In the model, the prevalence is simply 14 times the daily case rate. In effect, I am assuming (or pretending) that the daily rate is unchanging and that everyone’s illness lasts 14 days from the moment of infection to full recovery. Neither of these assumptions is true. In a model of ongoing disease propagation, where today’s events determine what happens next week, the steady-state approximation would be unacceptable. But this model produces only a snapshot on one particular day of the year, and so dynamics are not very important.

What we *do* need to consider in more detail is the sequence of stages in a case of Covid-19. The archetypal model in epidemiology has three stages: susceptible *(S)*, infected *(I)*, and recovered *(R)*; *R* stands for “removed,” acknowledging that recovery isn’t the only possible end of an illness. But I am going to look away from the grimmer aspects of this story.*(U)*, infectious *(I)*, and symptomatic *(Q)*, which gives us a SUIQR model. An incubating patient has been infected but is not yet producing enough virus particles to infect others. The infectious stage is the most dangerous period: Patients have no conspicuous symptoms and are still unaware of their own infection, but nonetheless they are spewing virus particles with every breath.

During the symptomatic phase, patients know they are sick and should be in quarantine; hence the letter *Q*. For the purposes of the model I assume that everyone in category *Q* will decline the invitation to Thanksgiving dinner. *x*, which I think of as an empty chair at the dinner table. The purple dots for newly acquired infections add a sixth category to the model, although they really belong to the incubating *U* class.

A parameter of some importance is the duration of the presymptomatic infectious stage, since the red-dot people in that category are the only ones actually spreading the disease in my model of Thanksgiving gatherings. I made a foray into the medical literature to pin down this number, but what I learned is that after a year of intense scrutiny there’s still a lot we don’t know about Covid-19. The typical period from infection to the onset of symptoms (encompassing both the *U* and *I* stages of my model) is four or five days, but apparently it can range from less than two days to three weeks. The graph below is based on a paper by Conor McAloon and colleagues that aggregates results of eight studies carried out early in the pandemic (when it’s easier to determine the date of infection, since cases are rare and geographically isolated).

Ultimately I decided, for the sake of simplicity (or lazy convenience) to collapse this distribution to its median, which is about five days. Then there’s the question of when within this period an infected person becomes dangerous to those nearby. Various sources [Harvard, MIT, Fox News] suggest that infected individuals begin spreading the virus two or three days before they show symptoms, and that the moment of maximum infectiousness comes shortly before symptom onset. I chose to interpret “two or three days” as 2.5 days.

What all this boils down to is the following relation: If the national new-case rate is 50 per 100,000, then among Thanksgiving celebrants in the model, 125 per 100,000 are Covid spreaders. That’s 0.125 percent. Turn to the person on your left. Turn to your right. Are you feeling lucky?

The model’s default settings assume a new-case rate of \(50\) per \(100,000\), a Thanksgiving group size of \(12\), and a \(0.25\) probability of transmitting the virus from an infectious person to a susceptible person. Let’s do some back-of-the-envelope calculating. As noted above, the \(50/100{,}000\) new case rate translates into \(125/100{,}000\) infectious individuals. Among the \(\approx 10,000\) members of the model population, we shoud expect to see \(12\) or 1\(\)3 red-dot *I*s. Because the number of *I*s is much smaller than the number of groups \((825)\), it’s unlikely that more than one red dot will turn up in any single group of \(12\). In each group with a single spreader, we can expect the virus to jump to \(0.25 \times 11 = 2.75\) of the spreader’s companions. This assumes that all the companions are green-dot susceptibles, which isn’t quite true. There are also yellow-dot incubating and blue-dot recovered people, as well as the red-*x* empty chairs of those in quarantine. But these are small corrections. The envelope estimate gives \(344/100,000\) new infections on Thanksgiving day; the computer model yields 325 per 100,000, when averaged over many runs.

But the average doesn’t tell the whole story. The variance of these outcomes is quite high, as you’ll see if you press the “Go” button repeatedly. Counting the number of new infections in each of a million runs of the model, the distribution looks like this:

The peak of the curve is at 30 new infections per model run, which corresponds to about 300 cases for 100,000 population, but you shouldn’t be surprised to see a result of 150 or 500.

If the effect of Thanksgiving gatherings in the real world matches the results of this model, we’re in serious trouble. A rate of 300 cases per 100,000 people corresponds to just under a million new cases in the U.S. population. All of those infections would arise on a single day (although few of them would be detected until about a week later). That’s an outburst of contagion more than five times bigger than the worst daily toll recorded so far.

But there are plenty of reasons to be skeptical of this result.

Even in a “normal” year, not everyone in America sits down at a table for 12 to exchange gossip and germs, and surely many more will be sitting out this year’s events. According to a survey attributed

Another potential mitigating factor is that people invited to your holiday celebration are probably not selected at random from the whole population, as they are in the model. Guests tend to come in groups, often family units. If your aunt and uncle and their three kids all live together, they probably get sick together, too. Thus a gathering of 12 individuals might better be treated as an assembly of three or four “pods.” One way to introduce this idea into the computer model is to enforce nonzero correlations between the people selected for each group. If one attendee is infectious, that raises the probability that others will also be infectious, and vice versa. As the correlation coefficient increases, groups are increasingly homogeneous. If lots of spreaders are crowded in one group, they can’t infect the vulnerable people in other groups. In the model, a correlation coefficient of 0.5 reduces the average number of new cases from 32.5 to 23.5. (Complete or perfect correlation eliminates contagion altogether, but this is highly unrealistic.)

Geography should also be considered. The national average case rate of 50 per 100,000 conceals huge local and regional variations. In Hawaii the rate is about 5 cases, so if you and all your guests are Hawaiians, you’ll have to be quite unlucky to pick up a Covid case at the Thanksgiving luau. At the other end of the scale, there are counties in the Great Plains states that have approached 500 cases per 100,000 in recent weeks. A meal with a dozen attendees in one of those hotspots looks fairly calamitous: The model shows 3,000 new cases for 100,000, or 3 percent of the population.

If you are determined to have a big family meal tomorrow and you want to minimize the risks, there are two obvious strategies. You can reduce the chance that your gathering includes someone infectious, or you can reduce the likelihood that any infectious person who happens to be present will transmit the virus to others. Most of the recommendations I’ve read in the newspaper and on health-care websites focus on the latter approach. They urge us to wear masks, the keep everyone at arms’ length, to wash our hands, to open all the windows (or better yet to hold the whole affair outdoors). Making it a briefer event should also help.

In the model, any such measures are implemented by nudging the slider for transmission probability toward smaller values. The effect is essentially linear over a broad range of group sizes. Reducing the transmission probability by half reduces the number of new infections proportionally.

The trouble is, I have no firm idea of what the actual transmission probability might be, or how effective those practices would be in reducing it. A recent study by a group at Vanderbilt University found a transmission rate within households of greater than 50 percent. I chose 25 percent as the default value in the model on the grounds that spending a single day together should be less risky than living permanently under the same roof. But the range of plausible values remains quite wide. Perhaps studies done in the aftermath of this Thanksgiving will yield better data.

As for reducing the chance of having an infectious guest, one approach is simply to reduce the size of the group. In this case the effect is better than linear, but only slightly so. Splitting that 12-person meal into two separate 6-seat gatherings cuts the infection rate by a little more than half, from 32.5 to 15.2. And, predictably, larger groups have worse outcomes. Pack in 24 people per group and you can expect 70 infections. Neither of these strategies seems likely to cut the infection rate by a factor of 10 or more. Unless, of course, everyone eats alone. Set the group-size slider to 1, and no one gets sick.

Another factor to keep in mind is that this model counts only infections passed from person to person during a holiday get-together. Leaving all those cases aside, the country has quite a fierce rate of “background” transmission happening on days with no special events. If the Thanksgiving cases are to be added to the background cases, we’re even worse off than the model would suggest. But the effect could be just the opposite. A family holiday is an occasion when most people skip some ordinary activities that can also be risky. Most of us have the day off from work. We are less likely to go out to a bar or a restaurant. It’s even possible that the holiday will actually suppress the overall case rate. But don’t bet your life on it.

There’s one more wild card to be taken into account. A tacit assumption in the structure of the model is that the reported Covid case count accurately reflects the prevalence of the disease in the population. This is surely not quite true. There are persistent reports of asymptomatic cases—people who are infected and infectious, but who never feel unwell. Those cases are unlikely to be recorded. Others may be ill and suspect the cause is Covid but avoid getting medical care for one reason or another. (For example, they may fear losing their job.) All in all, it seems likely the CDC is under-reporting the number of infections.

Early in the course of the epidemic, a group at Georgia Tech led by Aroon Chande built a risk-estimating web tool based on case rates for individual U.S. counties. They included an adjustment for “ascertainment bias” to compensate for cases omitted from official public health estimates. Their model multiplies the reported case counts by a factor of either 5 or 10. This adjustment may well have been appropriate last spring, when Covid testing was hard to come by even for those with reasonable access to medical services. It seems harder to justify such a large multiplier now, but the model, which is still being maintained, continues to insert a fivefold or tenfold adjustment. Out of curiosity, I have included a slider that can be set to make a similar adjustment.

Is it possible that we are still counting only a tenth of all the cases? If so, the cumulative total of infections since the virus first came ashore in the U.S. is 10 times higher than official estimates. Instead of 12.5 million total cases, we’ve experienced 125 million; more than a third of the population has already been through the ordeal and (mostly) come out the other side. We’ll know the answer soon. At the present infection rate (multiplied by 10), we will have burned through another third of the population in just a few weeks, and infection rates should fall dramatically through herd immunity. (I’m not betting my life on this one either.)

One other element of the Covid story that ought to be in the model is testing, which provides another tool for improving the chances that we socialize only with safe companions. If tests were completely reliable, their effect would merely be to move some fraction of the dangerous red-dot category into the less-dangerous red-*x* quarantined camp. But false-positive and false-negative testing results complicate the situation. (If the actual infection rate is low, false positives may outnumber true positives.)

I offer no conclusions or advice as a result of my little adventure in computational epidemiology. You should not make life-or-death decisions based on the writings of some doofus at a website called bit-player. (Nor based on a tweet from @realDonaldTrump.)

I *do* have some stray thoughts about the nature of holidays in Covid times. In the U.S. most of our holidays, both religious and secular, are intensely social, convivial occasions. Thanksgiving is a feast, New Year’s Eve is a party, Mardi Gras is a parade, St. Patrick’s Day is a pub crawl, July Fourth is a picnic. I’m not asking to abolish these traditions, some of which I enjoy myself. But they are not helping matters in the midst of a raging epidemic. Every one of these occasions can be expected to produce a spike in that curve we’re supposed to be flattening.

I wish we could find a spot on the calendar for a new kind of holiday—a day or a weekend for silent and solitary contemplative respite. Close the door, or go off by yourself. Put a dent in the curve.

]]>In that essay I also mentioned three other questions about trees that have long been bothering me. In this sequel I want to poke at those other questions a bit more deeply.

Botanists have an elaborate vocabulary for describing leaf shapes: *cordate* (like a Valentine heart), *cuneate* (wedgelike), *ensiform* (sword shaped), *hastate* (like an arrowhead, with barbs), *lanceolate* (like a spearhead), *oblanceolate* (a backwards spearhead), *palmate* (leaflets radiating like fingers), *pandurate* (violin shaped), *reniform* (kidney shaped), *runcinate* (saw-toothed), *spatulate* (spoonlike). That’s not, by any means, a complete list.

Steven Vogel, in his 2012 book *The Life of a Leaf*, enumerates many factors and forces that might have an influence on leaf shape. For example, leaves can’t be too heavy, or they would break the limbs that hold them aloft. On the other hand, they can’t be too delicate and wispy, or they’ll be torn to shreds by the wind. Leaves also must not generate too much aerodynamic drag, or the whole tree might topple in a storm.

Job One for a leaf is photosynthesis: gathering sunlight, bringing together molecules of carbon dioxide and water, synthesizing carbohydrates. Doing that efficiently puts further constraints on the design. As much as possible, the leaf should turn its face to the sun, maximizing the flux of photons absorbed. But temperature control is also important; the biosynthetic apparatus shuts down if the leaf is too hot or too cold.

Vogel points out that subtle features of leaf shape can have a measurable impact on thermal and aerodynamic performance. For example, convective cooling is most effective near the margins of a leaf; temperature rises with distance from the nearest edge. In environments where overheating is a risk, shapes that minimize this distance—such as the *lobate* forms of oak leaves—would seem to have an advantage over simpler, disklike shapes. But the choice between frilly and compact forms depends on other factors as well. Broad leaves with convex shapes intercept the most sunlight, but that may not always be a good thing. Leaves with a lacy design let dappled sunlight pass through, allowing multiple layers of leaves to share the work of photosynthesis.

Natural selection is a superb tool for negotiating a compromise among such interacting criteria. If there is some single combination of traits that works best for leaves growing in a particular habitat, I would expect evolution to find it. But I see no evidence of convergence on an optimal solution. On the contrary, even closely related species put out quite distinctive leaves.

Take a look at the three oak leaves in the upper-left quadrant of the image above. They are clearly variations on a theme. What the leaves have in common is a sequence of peninsular protrusions springing alternately to the left and the right of the center line. The variations on the theme have to do with the number of peninsulas (three to five per side in these specimens), their shape (rounded or pointy), and the depth of the coves between peninsulas. Those variations could be attributed to genetic differences at just a few loci. But *why* have the leaves acquired these different characteristics? What evolutionary force makes rounded lobes better for white oak trees and pointy ones better for red oak and pin oak?

Much has been learned about the developmental mechanisms that generate leaf shapes. Biochemically, the main actors are the plant hormones known as auxins; their spatial distribution and their transport through plant tissues regulate local growth rates and hence the pattern of development. (A 2014 review article by Jeremy Dkhar and Ashwani Pareek covers these aspects of leaf form in great detail.) On the mathematical and theoretical side, Adam Runions, Miltos Tsiantis, and Przemyslaw Prusinkiewicz have devised an algorithm that can generate a wide spectrum of leaf shapes with impressive verisimilitude. (Their 2017 paper, along with source code and videos, is at algorithmicbotany.org/papers/leaves2017.html.) With different parameter values the same program yields shapes that are recognizable as oaks, maples, sycamores, and so on. Again, however, all this work addresses questions of *how*, not *why*.

Another property of tree leaves—their size—*does* seem to respond in a simple way to evolutionary pressures. Across all land plants (not just trees), leaf area varies by a factor of a million—from about 1 square millimeter per leaf to 1 square meter. A 2017 paper by Ian J. Wright and colleagues reports that this variation is strongly correlated with climate. Warm, moist regions favor large leaves; think of the banana. Cold, dry environments, such as alpine ridges, host mainly tiny plants with even tinier leaves. So natural selection is alive and well in the realm of tree leaves; it just appears to have no clear preferences when it comes to shape.

Or am I missing something important? Elsewhere in nature we find flamboyant variations that seem gratuitous if you view them strictly in the grim context of survival-of-the-fittest. I’m thinking of the fancy-dress feathers of birds, for example. Cardinals and bluejays both frequent my back yard, but I don’t spend much time wondering whether red or blue is the optimal color for survival in that habitat. Nor do I expect the two species to converge on some shade of purple. Their gaudy plumes are not adaptations to the physical environment but elements of a communication system; they send signals to rivals or potential mates. Could something similar be going on with leaf shape? Do the various oak species maintain distinctive leaves to identity themselves to animals that help with pollination or seed dispersal? I rate this idea unlikely, but I don’t have a better one.

Surely this question is too easy! We know why trees grow tall. They reach for the sky. It’s their only hope of escaping the gloomy depths of the forest’s lower stories and getting a share of the sunshine. In other words, if you are a forest tree, you need to grow tall because your neighbors are tall; they overshadow you. And the neighbors grow tall because you’re tall. It’s is a classic arms race. Vogel has an acute commentary on this point:

In every lineage that has independently come up with treelike plants, a variety of species achieve great height. That appears to me to be the height of stupidity…. We’re looking at, almost surely, an object lesson in the limitations of evolutionary design….

A trunk limitation treaty would permit all individuals to produce more seeds and to start producing seeds at earlier ages. But evolution, stupid process that it is, hasn’t figured that out—foresight isn’t exactly its strong suit.

Vogel’s trash-talking of Darwinian evolution is meant playfully, of course. But I think the question of height-limitation treaties*tree*ties?) deserves more serious attention.

Forest trees in the eastern U.S. often grow to a height of 25 or 30 meters, approaching 100 feet. It takes a huge investment of material and energy to erect a structure that tall. To ensure sufficient strength and stiffness, the girth of the trunk must increase as the \(\frac{3}{2}\) power of the height, and so the cross-sectional area \((\pi r^2)\) grows as the cube of the height. It follows that doubling the height of a tree trunk multiplies its mass by a factor of 16.

Great height imposes another, ongoing, metabolic cost. Every day, a living tree must lift 500 liters of water—weighing 500 kilograms—from the root zone at ground level up to the leaves in the crown. It’s like carrying enough water to fill four or five bathtubs from the basement of a building to the 10th floor.

Height also exacerbates certain hazards to the life and health of the tree. A taller trunk forms a longer lever arm for any force that might tend to overturn the tree. Compounding the risk, average wind speed increases with distance above the ground.

Standing on the forest floor, I tilt my head back and stare dizzily upward toward the leafy crowns, perched atop great pillars of wood. I can’t help seeing these plants on stilts as a colossal waste of resources. It’s even sillier than the needlelike towers of apartments for billionaires that now punctuate the Manhattan skyline. In those buildings, all the floors are put to *some* use. In the forest, the tree trunks are denuded of leaves and sometimes of branches over 90 percent of their length; only the penthouses are occupied.

If the trees could somehow get together and negotiate a deal—a zoning ordinance or a building code—they would *all* benefit. Perhaps they could decree a maximum height of 10 meters. Nothing would change about the crowns of the trees; the rule would simply chop off the bottom 20 meters of the trunk.

If every tree would gain from the accord, why don’t we see such amputated forests evolving in nature? The usual response to this why-can’t-everybody-get-along question is that evolution just doesn’t work that way. Natural selection is commonly taken to be utterly selfish and individualist, even when it hurts. A tree reaching the 10-meter limit would say to itself: “Yes, this is good; I’m getting plenty of light without having to stand on tiptoe. But it could be even better. If I stretched my trunk another meter or two, I’d collect an even bigger share of solar energy.” Of course the other trees reason with themselves in exactly the same way, and so the futile arms race resumes. As Vogel said, foresight is not evolution’s strong suit.

I am willing to accept this dour view of evolution, but I am not at all sure it actually explains what we see in the forest. If evolution has no place for cooperative action in a situation like this one, how does it happen that all the trees do in fact stop growing at about the same height? Specifically, if an agreement to limit height to 10 meters would be spoiled by rampant cheating, why doesn’t the same thing happen at 30 meters?

One might conjecture that 30 meters is a physiological limit, that the trees would grow taller if they could, but some physical constraint prevents it. Perhaps they just can’t lift the water any higher. I would consider this a very promising hypothesis if it weren’t for the sequoias and the coast redwoods in the western U.S. Those trees have not heard about any such physical barriers. They routinely grow to 70 or 80 meters, and a few specimens have exceeded 100 meters. Thus the question for the East Coast trees is not just “Why are you so tall?” but also “Why aren’t you taller?”

I can think of at least one good reason for forest trees to grow to a uniform height. If a tree is shorter than average, it will suffer for being left in the shade. But standing head and shoulders above the crowd also has disadvantages: Such a standout tree is exposed to stronger winds, a heavier load of ice and snow, and perhaps higher odds of lightning strikes. Thus straying too far either below or above the mean height may be punished by lower reproductive success. But the big question remains: How do all the trees reach consensus on what height is best?

Another possibility: Perhaps the height of forest trees is not a result of an arms-race after all but instead is a response to predation. The trees are holding their leaves on high to keep them away from herbivores. I can’t say this is wrong, but it strikes me as unlikely. No giraffes roam the woods of North America (and if they did, 10 meters would be more than enough to put the leaves out of reach). Most of the animals that nibble on tree leaves are arthropods, which can either fly (adult insects) or crawl up the trunk (caterpillars and other larvae). Thus height cannot fully protect the leaves; at best it might provide a deterrent. Tree leaves are not a nutritious diet; perhaps some small herbivores consider them worth a climb of 10 meters, but not 30.

To a biologist, a tree is a woody plant of substantial height. To a mathematician, a tree is a graph without loops. It turns out that math-trees and bio-trees have some important properties in common.

The diagram below shows two mathematical graphs. They are collections of dots (known more formally as vertices), linked by line segments (called edges). A graph is said to be *connected* if you can travel from any vertex to any other vertex by following some sequence of edges. Both of the graphs shown here are connected. Trees form a subspecies of connected graphs. They are *minimally* connected: Between any two vertices there is exactly one *x, y, x, z* is not a path.*a* to *b*. The graph at right is not a tree. There are two routes from *a* to *b* (red and yellow lines).

Here’s another way to describe a math-tree. It’s a graph that obeys the antimatrimonial rule: What branching puts asunder, let no one join together again. Bio-trees generally work the same way: Two limbs that branch away from the trunk will not later return to the trunk or fuse with each other. In other words, there are no cycles, or closed loops. The pattern of radiating branches that never reconverge is evident in the highly regular structure of the bio-tree pictured below. (The tree is a Norfolk Island pine, native to the South Pacific, but this specimen was photographed on Sardinia.)

Trees have achieved great success without loops in their branches. Why would a plant ever want to have its structural elements growing in circles?

I can think of two reasons. The first is mechanical strength and stability. Engineers know the value of triangles (the smallest closed loops) in building rigid structures. Also arches, where two vertical elements that could not stand alone lean on each other. Trees can’t take advantage of these tricks; their limbs are cantilevers, supported only at the point of juncture with the trunk or the parent limb. Loopy structures would allow for various kinds of bracing and buttressing.

The second reason is reliability. Providing multiple channels from the roots to the leaves would improve the robustness of the tree’s circulatory system. An injury near the base of a limb would no longer doom all the structures beyond the point of damage.

Networks with multiple paths between nodes are exploited elsewhere in nature, and even in other aspects of the anatomy of trees. The reticulated channels in the image below are veins distributing fluids and nutrients within a leaf from a red oak tree. The very largest veins (or ribs) have a treelike arrangement, but the smaller channels form a nested hierarchy of loops within loops. (The pattern reminds me of a map of an ancient city.) Because of the many redundant pathways, an insect taking a chomp out of the middle of this network will not block communication with the rest of the leaf.

The absence of loops in the larger-scale structure of trunk and branches may be a natural consequence of the developmental program that guides the growth of a tree. Aristid Lindenmayer, a Hungarian-Dutch biologist, invented a family of formal languages (now called L-systems) for describing such growth. The languages are rewriting systems: You start with a single symbol (the *axiom*) and replace it with a string of symbols specified by the rules of a grammar. Then the string resulting from this substitution becomes a new input to the same rewriting process, with each of its symbols being replaced by another string formed according to the grammar rules. In the end, the symbols are interpreted as commands for constructing a geometric figure.

Here’s an L-system grammar for drawing cartoonish two-dimensional trees:

f ⟶ f [r f] [l f] l ⟶ l r ⟶ r

The symbols `f`

, `l`

, and `r`

are the basic elements of the language; when interpreted as drawing commands, they stand for *forward*, *left*, and *right*. The first rule of the grammar replaces any occurrence of `f`

with the string `f [l f] [r f]`

; the second and third rules change nothing, replacing `l`

and `r`

with themselves. Square brackets enclose a subprogram. On reaching a left bracket, the system makes note of its current position and orientation in the drawing. Then it executes the instructions inside the brackets, and finally on reaching the right bracket backtracks to the saved position and orientation.

Starting with the axiom `f`

, the grammar yields a succession of ever-more-elaborate command sequences:

Stage 0: f Stage 1: f [r f] [l f] Stage 2: f [r f] [l f] [r f [r f] [l f]] [l f [r f] [l f]]]

When this rewriting process is continued for a few further stages and then converted to graphic output, we see a sapling growing into a young tree, with a shape reminiscent of an elm.*forward* step is reduced by a factor of 0.6. And all turns, both *left* and *right*, are through an angle of 20 degrees.

L-systems like this one can produce a rich variety of branching structures. More elaborate versions of the same program can create realistic images of biological trees. (The Algorithmic Botany website at the University of Calgary has an abundance of examples.) What the L-systems *can’t* do is create closed loops. That would require a fundamentally different kind of grammar, such as a transformation rule that takes two symbols or strings as input and produces a conjoined result. (Note that in the stage 5 diagram above, two branches of the tree appear to overlap, but they are not joined. The graph has no vertex at the intersection point.)

If the biochemical mechanisms governing the growth and development of trees operate with the same constraints as L-systems, we have a tidy explanation for the absence of loops in the branching of bio-trees. But perhaps the explanation is a little too tidy. I’ve been saying that trees don’t do loops, and it’s generally true. But what about the tree pictured below—a crepe myrtle I photographed some years ago on a street in Raleigh, North Carolina? (It reminds me of a sinewy Henry Moore sculpture.)

This plant is a tree in the botanical sense, but it’s certainly not a mathematical tree. A single trunk comes out of the ground and immediately divides. At waist height there are four branches, then three of them recombine. At chest height, there’s another split and yet another merger. This rogue tree is flouting all the canons and customs of treedom.

And the crepe myrtle is not the only offender. Banyan trees, native to India, have their horizontal branches propped up by numerous outrigger supports that drop to the ground. The banyan shown below, in Hilo, Hawaii, has a hollowed-out cavity where the trunk ought to be, surrounded by dozens or hundreds of supporting shoots, with cross-braces overhead. The L-system described above could never create such a network. But if the banyan can do this, why don’t other trees adopt the same trick?

In biology, the question “Why *x*?” is shorthand for “What is the evolutionary advantage of *x*?” or “How does *x* contribute to the survival and reproductive success of the organism?” Answering such questions often calls for a leap of imagination. We look at the mottled brown moth clinging to tree bark and propose that its coloration is camouflage, concealing the insect from predators. We look at a showy butterfly and conclude that its costume is aposematic—a warning that says, “I am toxic; you’ll be sorry if you eat me.”

These explanations risk turning into just-so stories,*les contes des pourquoi*.

And if we have a hard time imagining the experiences of animals, the lives of plants are even further beyond our ken. Does the flower lust for the pollen-laden bee? Does the oak tree grieve when its acorns are eaten by squirrels? How do trees feel about woodpeckers? Confronted with these questions, I can only shrug. I have no idea what plants desire or dread.

Others claim to know much more about vegetable sensibilities. Peter Wohlleben, a German forester, has published a book titled *The Hidden Life of Trees: What They Feel, How They Communicate*. He reports that trees suckle their young, maintain friendships with their neighbors, and protect sick or wounded members of their community. To the extent these ideas have a scientific basis, they draw heavily on work done in the laboratory of Suzanne Simard at the University of British Columbia. Simard, leader of the Mother Tree project, studies communication networks formed by tree roots and their associated soil fungi.

I find Simard’s work interesting. I find the anthropomorphic rhetoric unhelpful and offensive. The aim, I gather, is to make us care more about trees and forests by suggesting they are a lot like us; they have families and communities, friendships, alliances. In my view that’s exactly wrong. What’s most intriguing about trees is that they are aliens among us, living beings whose long, immobile, mute lives bear no resemblance to our own frenetic toing-and-froing. Trees are deeply mysterious all on their own, without any overlay of humanizing sentiment.

Dkhar, Jeremy, and Ashwani Pareek. 2014. What determines a leaf’s shape? *EvoDevo* 5:47.

McMahon, Thomas A. 1975. The mechanical design of trees. *Scientific American* 233(1):93–102.

Osnas, Jeanne L. D., Jeremy W. Lichstein, Peter B. Reich, and Stephen W. Pacala. 2013. Global leaf trait relationships: mass, area, and the leaf economics spectrum. *Science* 340:741–744.

Prusinkiewicz, Przemyslaw, and Aristid Lindenmayer, with James S. Hanan, F. David Fracchia, Deborah Fowler, Martin J. M. de Boer, and Lynn Mercer. 1990. *The Algorithmic Beauty of Plants*. New York: Springer-Verlag. PDF edition available at http://algorithmicbotany.org/papers/.

Runions, Adam, Martin Fuhrer, Brendan Lane, Pavol Federl, Anne-Gaëlle Rolland-Lagan, and Przemyslaw Prusinkiewicz. 2005. Modeling and visualization of leaf venation patterns. *ACM Transactions on Graphics* 24(3):702?711.

Runions, Adam, Miltos Tsiantis, and Przemyslaw Prusinkiewicz. 2017. A common developmental program can produce diverse leaf shapes. *New Phytologist* 216:401–418. Preprint and source code.

Tadrist, Loïc, and Baptiste Darbois Texier. 2016. Are leaves optimally designed for self-support? An investigation on giant monocots. arXiv:1602.03353.

Vogel, Steven. 2012. *The Life of a Leaf*. University of Chicago Press.

Wright, Ian J., Ning Dong, Vincent Maire, I. Colin Prentice, Mark Westoby, Sandra Díaz, Rachael V. Gallagher, Bonnie F. Jacobs, Robert Kooyman, Elizabeth A. Law, Michelle R. Leishman, Ülo Niinemets, Peter B. Reich, Lawren Sack, Rafael Villar, Han Wang, and Peter Wilf. 2017. Global climatic drivers of leaf size. *Science* 357:917–921.

Yamazaki, Kazuo. 2011. Gone with the wind: trembling leaves may deter herbivory. *Biological Journal of the Linnean Society* 104:738–747.

Young, David A. 2010 preprint. Growth-algorithm model of leaf shape. arXiv:1004.4388.

]]>When I follow Frost’s trail, it leads me into an unremarkable patch of Northeastern woodland, wedged between highways and houses and the town dump. It’s nowhere dark and deep enough to escape the sense of human proximity. This is not the forest primeval. Still, it is woodsy enough to bring to mind not only the rhymes of overpopular poets but also some tricky questions about trees and forests—questions I’ve been poking at for years, and that keep poking back. Why are trees so tall? Why aren’t they taller? Why do their leaves come in so many different shapes and sizes? Why are the trees trees (in the graph theoretical sense of that word) rather than some other kind of structure? And then there’s the question I want to discuss today:

Taking a quick census along the Frost trail, I catalog hemlock, sugar maple, at least three kinds of oak (red, white, and pin), beech and birch, shagbark hickory, white pine, and two other trees I can’t identify with certainty, even with the help of a Peterson guidebook and iNaturalist. The stand of woods closest to my home is dominated by hemlock, but on hillsides a few miles down the trail, broadleaf species are more common. The photograph below shows a saddle point (known locally as the Notch) between two peaks of the Holyoke Range, south of Amherst. I took the picture on October 15 last year—in a season when fall colors make it easier to detect the species diversity.

Forests like this one cover much of the eastern half of the United States. The assortment of trees varies with latitude and altitude, but at any one place the forest canopy is likely to include eight or ten species. A few isolated sites are even richer; certain valleys in the southern Appalachians, known as cove forests, have as many as 25 canopy species. And tropical rain forests are populated by 100 or even 200 tall tree species.

From the standpoint of ecological theory, all this diversity is puzzling. You’d think that in any given environment, one species would be slightly better adapted and would therefore outcompete all the others, coming to dominate the landscape. _{2}, water, various mineral nutrients—so the persistence of mixed-species woodlands begs for explanation.

Here’s a little demo of competitive exclusion. Two tree species—let’s call them olive and orange—share the same patch of forest, a square plot that holds 625 trees.

Initially, each site is randomly assigned a tree of one species or the other. When you click the *Start* button (or just tap on the array of trees), you launch a cycle of death and renewal. At each time step, one tree is chosen—entirely at random and without regard to species—to get the axe. Then another tree is chosen as the parent of the replacement, thereby determining its species. This latter choice is not purely random, however; there’s a bias. One of the species is better adapted to its environment, exploiting the available resources more efficiently, and so it has an elevated chance of reproducing and putting its offspring into the vacant site. In the control panel below the array of trees is a slider labeled “fitness bias”; nudging it left favors the orange species, right the olives.

The outcome of this experiment should not come as a surprise. The two species are playing a zero-sum game: Whatever territory olive wins, orange must lose, and vice versa. One site at a time, the fitter species conquers all. If the advantage is very slight, the process may take a while, but in the end the less-efficient organism is always banished. (What if the two species are exactly equal? I’ll return to that question in a moment, but for now let’s just pretend it never happens. And I have deviously jiggered the simulation so that you can’t set the bias to zero.)

Competitive exclusion does not forbid *all* cohabitation. Suppose olive and orange rely on two mineral nutrients in the soil—say, iron and calcium. Assume both of these elements are in short supply, and their availability is what limits growth in the populations of the trees. If olive trees are better at taking up iron and oranges assimilate calcium more effectively, then the two species may be able to reach an accommodation where both survive.

In this model, neither species is driven to extinction. At the default setting of the slider control, where iron and calcium are equally abundant in the environment, olive and orange trees also maintain roughly equal numbers on average. Random fluctuations carry them away from this balance point, but not very far or for very long. The populations are stabilized by a negative feedback loop. If a random perturbation increases the proportion of olive trees, each one of those trees gets a smaller share of the available iron, thereby reducing the species’ potential for further population growth. The orange trees are less affected by an iron deficiency, and so their population rebounds. But if the oranges then overshoot, they will be restrained by overuse of the limited calcium supply.

Moving the slider to the left or right alters the balance of iron and calcium in the environment. A 60:40 proportion favoring iron will shift the equilibrium between the two tree species, allowing the olives to occupy more of the territory. But, as long as the resource ratio is not too extreme, the minority species is in no danger of extinction. The two kinds of trees have a live-and-let-live arrangement.

In the idiom of ecology, the olive and orange species escape the rule of competitive exclusion because they occupy distinct niches, or roles in the ecosystem. They are specialists, preferentially exploiting different resources. The niches do not have to be completely disjoint. In the simulation above they overlap somewhat: The olives need calcium as well as iron, but only 25 percent as much; the oranges have mirror-image requirements.

Will this loophole in the law of competitive exclusion admit more than two species? Yes: *N* competing species can coexist if there are at least *N* independent resources or environmental strictures limiting their growth, and if each species has a different limiting factor. Everybody must have a specialty. It’s like a youth soccer league where every player gets a trophy for some unique, distinguishing talent.

This notion of slicing and dicing an ecosystem into multiple niches is a well-established practice among biologists. It’s how Darwin explained the diversity of finches on the Galapagos islands, where a dozen species distinguish themselves by habitat (ground, shrubs, trees) or diet (insects, seeds and nuts of various sizes). Forest trees might be organized in a similar way, with a number of microenvironments that suit different species. The process of creating such a diverse community is known as niche assembly.

Some niche differentiation is clearly present among forest trees. For example, gums and willows prefer wetter soil. In my local woods, however, I can’t detect any systematic differences in the sites colonized by maples, oaks, hickories and other trees. They are often next-door neighbors, on plots of land with the same slope and elevation, and growing in soil that looks the same to me. Maybe I’m just not attuned to what tickles a tree’s fancy.

Niche assembly is particularly daunting in the tropics, where it requires a hundred or more distinct limiting resources. Each tree species presides over its own little monopoly, claiming first dibs on some environmental factor no one else really cares about. Meanwhile, all the trees are fiercely competing for the most important resources, namely sunlight and water. Every tree is striving to reach an opening in the canopy with a clear view of the sky, where it can spread its leaves and soak up photons all day long. Given the existential importance of winning this contest for light, it seems odd to attribute the distinctive diversity of forest communities to squabbling over other, lesser resources.

Where niche assembly makes every species the winner of its own little race, another theory dispenses with all competition, suggesting the trees are not even trying to outrun their peers. They are just milling about at random. According to this concept, called neutral ecological drift, all the trees are equally well adapted to their environment, and the set of species appearing at any particular place and time is a matter of chance. A site might currently be occupied by an oak, but a maple or a birch would thrive there just as well. Natural selection has nothing to select. When a tree dies and another grows in its place, nature is indifferent to the species of the replacement.

This idea brings us back to a question I sidestepped above: What happens when two competing species are exactly equal in fitness? The answer is the same whether there are two species or ten, so for the sake of visual variety let’s look at a larger community.

If you have run the simulation—and if you’ve been patient enough to wait for it to finish—you are now looking at a monochromatic array of trees. I can’t know what the single color on your screen might be—or in other words which species has taken over the entire forest patch—but I know there’s just one species left. The other nine are extinct. In this case the outcome might be considered at least a little surprising. Earlier we learned that if a species has even a slight advantage over its neighbors, it will take over the entire system. Now we see that no advantage is needed. Even when all the players are exactly equal, one of them will emerge as king of the mountain, and everyone else will be exterminated. Harsh, no?

Here’s a record of one run of the program, showing the abundance of each species as a function of time:

At the outset, all 10 species are present in roughly equal numbers, clustered close to the average abundance of \(625/10\). As the program starts up, the grid seethes with activity as the sites change color rapidly and repeatedly. Within the first 70,000 times steps, however, all but three species have disappeared. The three survivors trade the lead several times, as waves of contrasting colors wash over the array. Then, after about 250,000 steps, the species represented by the bright green line drops to zero population—extinction. The final one-on-one stage of the contest is highly uneven—the orange species is close to total dominance and the crimson one is bumping along near extinction—but nonetheless the tug of war lasts another 100,000 steps. (Once the system reaches a monospecies state, nothing more can ever change, and so the program halts.)

This lopsided result is not to be explained by any sneaky bias hidden in the algorithm. At all times and for all species, the probability of gaining a member is exactly equal to the probability of losing a member. It’s worth pausing to verify this fact. Suppose species \(X\) has population \(x\), which must lie in the range \(0 \le x \le 625\). A tree chosen at random will be of species \(X\) with probability \(x/625\); therefore the probability that the tree comes from some other species must be \((625 - x)/625\). \(X\) gains one member if it is the replacement species but not the victim species, an event with a combined probability of \(x(625 - x)/625\). \(X\) loses one member if it is the victim but not the replacement, which has the same probability.

It’s a fair game. No loaded dice. Nevertheless, somebody wins the jackpot, and the rest of the players lose everything, every time.

The spontaneous decay of species diversity in this simulated patch of forest is caused entirely by random fluctuations. Think of the population \(x\) as a random walker wandering along a line segment with \(0\) at one end and \(625\) at the other. At each time step the walker moves one unit right \((+1)\) or left \((-1)\) with equal probability; on reaching either end of the segment, the game ends. The most fundamental fact about such a walk is that it *does* always end. A walk that meanders forever between the two boundaries is not impossible, but it has probability \(0\); hitting one wall or the other has probability \(1\).

How long should you expect such a random walk to last? In the simplest case, with a single walker, the expected number of steps starting at position \(x\) is \(x(625 - x)\). This expression has a maximum when the walk starts in the middle of the line segment; the maximum length is just under \(100{,}000\) steps. In the forest simulation with ten species the situation is more complicated because the multiple walks are correlated, or rather anti-correlated: When one walker steps to the right, another must go left. Computational experiments suggest that the median time needed for ten species to be whittled down to one is in the neighborhood of \(320{,}000\) steps.

From these computational models it’s hard to see how neutral ecological drift could be the savior of forest diversity. On the contrary, it seems to guarantee that we’ll wind up with a monoculture, where one species has wiped out all others. But this is not the end of the story.

One issue to keep in mind is the timescale of the process. In the simulation, time is measured by counting cycles of death and replacement among forest trees. I’m not sure how to convert that into calendar years, but I’d guess that 320,000 death-and-replacement events in a tract of 625 trees might take 50,000 years or more. Here in New England, that’s a very long time in the life of a forest. This entire landscape was scraped clean by the Laurentide ice sheet just 20,000 years ago. If the local woodlands are losing species to random drift, they would not yet have had time to reach the end game.

The trouble is, this thesis implies that forests start out diverse and evolve toward a monoculture, which is not supported by observation. If anything, diversity seems to *increase* with time. The cove forests of Tennessee, which are much older than the New England woods, have more species, not fewer. And the hyperdiverse ecosystem of the tropical rain forests is thought to be millions of years old.

Despite these conceptual impediments, a number of ecologists have argued strenuously for neutral ecological drift, most notably Stephen P. Hubbell in a 2001 book, *The Unified Neutral Theory of Biodiversity and Biogeography*. The key to Hubbell’s defense of the idea (as I understand it) is that 625 trees do not make a forest, and certainly not a planet-girdling ecosystem.

Hubbell’s theory of neutral drift was inspired by earlier studies of the biogeography of islands, in particular the collaborative work of Robert H. MacArthur and Edward O. Wilson in the 1960s. Suppose our little plot of \(625\) trees is growing on an island at some distance from a continent. For the most part, the island evolves in isolation, but every now and then a bird carries a seed from the much larger forest on the mainland. We can simulate these rare events by adding a facility for immigration to the neutral-drift model. In the panel below, the slider controls the immigration rate. At the default setting of \(1/100\), every \(100\)th replacement tree comes not from the local forest but from a stable reserve where all \(10\) species have an equal probability of being selected.

For the first few thousand cycles, the evolution of the forest looks much like it does in the pure-drift model. There’s a brief period of complete tutti-frutti chaos, then waves of color erupt over the forest as it blushes pink, then deepens to crimson, or fades to a sickly green. What’s different is that none of those expanding species ever succeeds in conquering the entire array. As shown in the timeline graph below, they never grow much beyond 50 percent of the total population before they retreat into the scrum of other species. Later, another tree color makes a bid for empire but meets the same fate. (Because there is no clear endpoint to this process, the simulation is designed to halt after 500,000 cycles. If you haven’t seen enough by then, click *Resume*.)

Immigration, even at a low level, brings a qualitative change to the behavior of the model and the fate of the forest. The big difference is that we can no longer say extinction is forever. A species may well disappear from the 625-tree plot, but eventually it will be reimported from the permanent reserve. Thus the question is not whether a species is living or extinct but whether it is present or absent at a given moment. At an immigration rate of \(1/100\), the average number of species present is about \(9.6\), so none of them disappear for long.

With a higher level of immigration, the 10 species remain thoroughly mixed, and none of them can ever make any progress toward world domination. On the other hand, they have little risk of disappearing, even temporarily. Push the slider control all the way to the left, setting the immigration rate at \(1/10\), and the forest display becomes an array of randomly blinking lights. In the timeline graph below, there’s not a single extinction.

Pushing the slider in the other direction, rarer immigration events allow the species distribution to stray much further from equal abundance. In the trace below, with an immigrant arriving every \(1{,}000\)th cycle, the population is dominated by one or two species for most of the time; other species are often on the brink of extinction—or *over* the brink—but they come back eventually. The average number of living species is about 4.3, and there are moments when only two are present.

Finally, with a rate of \(1/10{,}000\), the effect of immigration is barely noticeable. As in the model without immigration, one species invades all the terrain; in the example recorded below, this takes about \(400{,}000\) steps. After that, occasional immigration events cause a small blip in the curve, but it will be a very long time before another species is able to displace the incumbent.

The island setting of this model makes it easy to appreciate how sporadic, weak connections between communities can have an outsize influence on their development. But islands are not essential to the argument. Trees, being famously immobile, have only occasional long-distance communication, even when there’s no body of water to separate them. (It’s a rare event when Birnam Wood marches off to Dunsinane.) Hubbell formulates a model of ecological drift in which many small patches of forest are organized into a hierarchical metacommunity. Each patch is both an island and part of the larger reservoir of species diversity. If you choose the right patch sizes and the right rates of migration between them, you can maintain multiple species at equilibrium. Hubbell also allows for the emergence of entirely new species, which is also taken to be a random or selection-neutral process.

Niche assembly and neutral ecological drift are theories that elicit mirror-image questions from skeptics. With niche assembly we look at dozens or hundreds of coexisting tree species and ask, “Can every one of them have a unique limiting resource?” With neutral drift we ask, “Can all of those species be exactly equal in fitness?”

Hubbell responds to the latter question by turning it upside down. The very fact that we observe coexistence implies equality:

All species that manage to persist in a community for long periods with other species must exhibit net long-term population growth rates of nearly zero…. If this were not the case, i.e., if some species should manage to achieve a positive growth rate for a considerable length of time, then from our first principle of the biotic saturation of landscapes, it must eventually drive other species from the community. But if all species have the same net population growth rate of zero on local to regional scales, then ipso facto they must have identical or nearly identical per capita relative fitnesses.

Herbert Spencer proclaimed: Survival of the fittest. Here we have a corollary: If they’re all survivors, they must all be equally fit.

Now for something completely different.

Another theory of forest diversity was devised specifically to address the most challenging case—the extravagant variety of trees in tropical ecosystems. In the early 1970s J. H. Connell and Daniel H. Janzen, field biologists working independently in distant parts of the world, almost simultaneously came up with the same idea.

A tropical rain forest is a tough neighborhood. Trees are under frequent attack by marauding gangs of predators, parasites, and pathogens. (Connell lumped these bad guys together under the label “enemies.”) Many of the enemies are specialists, targeting only trees of a single species. The specialization can be explained by competitive exclusion: Each tree species becomes a unique resource supporting one type of enemy.

Suppose a tree is beset by a dense population of host-specific enemies. The swarm of meanies attacks not only the adult tree but also any offspring of the host that have taken root near their parent. Since young trees are more vulnerable than adults, the entire cohort could be wiped out. Seedlings at a greater distance from the parent should have a better chance of remaining undiscovered until they have grown large and robust enough to resist attack. In other words, evolution might favor the rare apple that falls far from the tree. Janzen illustrated this idea with a graphical model something like the one at right. As distance from the parent increases, the probability that a seed will arrive and take root grows smaller *(red curve)*, but the probability that any such seedling will survive to maturity goes up *(blue curve)*. The overall probability of successful reproduction is the product of these two factors *(purple curve)*; it has a peak where the red and blue curves cross.

The Connell-Janzen theory predicts that trees of the same species will be widely dispersed in the forest, leaving plenty of room in between for trees of other species, which will have a similarly scattered distribution. The process leads to anti-clustering: conspecific trees are farther apart on average than they would be in a completely random arrangement. This pattern was noted by Alfred Russel Wallace in 1878, based on his own long experience in the tropics:

If the traveller notices a particular species and wishes to find more like it, he may often turn his eyes in vain in every direction. Trees of varied forms, dimensions, and colours are around him, but he rarely sees any one of them repeated. Time after time he goes towards a tree which looks like the one he seeks, but a closer examination proves it to be distinct. He may at length, perhaps, meet with a second specimen half a mile off, or may fail altogether, till on another occasion he stumbles on one by accident.

My toy model of the social-distancing process implements a simple rule. When a tree dies, it cannot be replaced by another tree of the same species, nor may the replacement match the species of any of the eight nearest neighbors surrounding the vacant site. Thus trees of the same species must have at least one other tree between them. To say the same thing in another way, each tree has an exclusion zone around it, where other trees of the same species cannot grow.

It turns out that social distancing is a remarkably effective way of preserving diversity. When you click *Start*, the model comes to life with frenetic activity, blinking away like the front panel of a 1950s Hollywood computer. Then it just keeps blinking; nothing else ever really happens. There are no spreading tides of color as a successful species gains ground, and there are no extinctions. The variance in population size is even lower than it would be with a completely random and uniform assignment of species to sites. This stability is apparent in the timeline graph below, where the 10 species tightly hug the mean abundance of 62.5:

When I finished writing this program and pressed the button for the first time, long-term survival of all ten species was not what I expected to see. My thoughts were influenced by some pencil-and-paper doodling. I had confirmed that only four colors are needed to create a pattern where no two trees of the same color are adjacent horizontally, vertically, or on either of the diagonals. One such pattern is shown at right. I suspected that the social-distancing protocol might cause the model to condense into such a crystalline state, with the loss of species that don’t appear in the repeated motif. I was wrong. Although four is indeed the minimum number of colors for a socially distanced two-dimensional lattice, there is nothing in the algorithm that encourages the system to seek the minimum.

After seeing the program in action, I was able to figure out what keeps all the species alive. There’s an active feedback process that puts a premium on rarity. Suppose that oaks currently have the lowest frequency in the population at large. As a result, oaks are least likely to be present in the exclusion zone surrounding any vacancy in the forest, which means in turn they are most likely to be acceptable as a replacement. As long as the oaks remain rarer than the average, their population will tend to grow. Symmetrically, individuals of an overabundant species will have a harder time finding an open site for their offspring. All departures from the mean population level are self-correcting.

The initial configuration in this model is completely random, ignoring the restrictions on adjacent conspecifics. Typically there are about 200 violations of the exclusion zone in the starting pattern, but they are all eliminated in the first few thousand time steps. Thereafter the rules are obeyed consistently. Note that with ten species and an exclusion zone consisting of nine sites, there is always at least one species available to fill a vacancy. If you try the experiment with nine or fewer species, some vacancies must be left as gaps in the forest. I should also mention that the model uses toroidal boundary conditions: the right edge of the grid is adjacent to the left edge, and the top wraps around to the bottom. This ensures that all sites in the lattice have exactly eight neighbors.

Connell and Janzen envisioned much larger exclusion zones, and correspondingly larger rosters of species. Implementing such a model calls for a much larger computation. A recent paper by Taal Levi *et al.* reports on such a simulation. They find that the number of surviving species and their spatial distribution remain reasonably stable over long periods (200 billion tree replacements).

Could the Connell-Janzen mechanism also work in temperate-zone forests? As in the tropics, the trees of higher latitudes do have specialized enemies, some of them notorious—the vectors of Dutch elm disease and chestnut blight, the emerald ash borer, the gypsy moth caterpillars that defoliate oaks. The hemlocks in my neighborhood are under heavy attack by the woolly adelgid, a sap-sucking bug. Thus the forces driving diversification and anti-clustering in the Connell-Janzen model would seem to be present here. However, the observed spatial structure of the northern forests is somewhat different. Social distancing hasn’t caught on here. The distribution of trees tends to be a little clumpy, with conspecifics gathering in small groves.

Plague-driven diversification is an intriguing idea, but, like the other theories mentioned above, it has certain plausibility challenges. In the case of niche assembly, we need to find a unique limiting resource for every species. In neutral drift, we have to ensure that selection really is neutral, assigning exactly equal fitness to trees that look quite different. In the Connell-Janzen model we need a specialized pest for every species, one that’s powerful enough to suppress all nearby seedlings. Can it be true that *every* tree has its own deadly nemesis?

*Invade* more than once, since a new arrival may die out before becoming established. Also note that I have slowed down this simulation, lest it all be over in a flash.*Invade* button.

Lacking enemies, the invader can flout the social-distancing rules, occupying any forest vacancy regardless of neighborhood. Once the invader has taken over a majority of the sites, the distancing rules become less onerous, but by then it’s too late for the other species.

One further half-serious thought on the Connell-Janzen theory: In the war between trees and their enemies, humanity has clearly chosen sides. We would wipe out those insects and fungi and other tree-killing pests if we could figure out how to do so. Everyone would like to bring back the elms and the chestnuts, and save the eastern hemlocks before it’s too late. On this point I’m as sentimental as the next treehugger. But if Connell and Janzen are correct, and if their theory applies to temperate-zone forests, eliminating all the enemies would actually cause a devastating collapse of tree diversity. Without pest pressure, competitive exclusion would be unleashed, and we’d be left with one-tree forests everywhere we look.

Species diversity in the forest is now matched by theory diversity in the biology department. The three ideas I have discussed here—niche assembly, neutral drift, and social distancing—all seem to be coexisting in the minds of ecologists. And why not? Each theory is a success in the basic sense that it can overcome competitive exclusion. Each theory also makes distinctive predictions. With niche assembly, every species must have a unique limiting resource. Neutral drift generates unusual population dynamics, with species continually coming and going, although the overall number of species remains stable. Social distancing entails spatial anticlustering.

How can we choose a winner among these theories (and perhaps others)? Scientific tradition says nature should have the last word. We need to conduct some experiments, or at least go out in the field and make some systematic observations, then compare those results with the theoretical predictions.

There have been quite a few experimental tests of competitive exclusion. For example, Thomas Park and his colleagues ran a decade-long series of experiments with two closely related species of flour beetles. One species or the other always prevailed. In 1969 Francisco Ayala reported on a similar experiment with fruit flies, in which he observed coexistence under circumstances that were thought to forbid it. Controversy flared, but in the end the result was not to overturn the theory but to refine the mathematical description of where exclusion applies.

Wouldn’t it be grand to perform such experiments with trees? Unfortunately, they are not so easily grown in glass vials. And conducting multigenerational studies of organisms that live longer than we do is a tough assignment. With flour beetles, Park had time to observe more than 100 generations in a decade. With trees, the equivalent experiment might take 10,000 years. But field workers in biology are a resourceful bunch, and I’m sure they’ll find a way. In the meantime, I want to say a few more words about theoretical, mathematical, and computational approaches to the problem.

Ecology became a seriously mathematical discipline in the 1920s, with the work of Alfred J. Lotka and Vito Volterra. To explain their methods and ideas, one might begin with the familiar fact that organisms reproduce themselves, thereby causing populations to grow. Mathematized, this observation becomes the differential equation

\[\frac{d x}{d t} = \alpha x,\]

which says that the instantaneous rate of change in the population \(x\) is proportional to \(x\) itself—the more there are, the more there will be. The constant of proportionality \(\alpha\) is called the intrinsic reproduction rate; it is the rate observed when nothing constrains or interferes with population growth. The equation has a solution, giving \(x\) as a function of \(t\):

\[x(t) = x_0 e^{\alpha t},\]

where \(x_0\) is the initial population. This is a recipe for unbounded exponential growth (assuming that \(\alpha\) is positive). In a finite world such growth can’t go on forever, but that needn’t worry us here.

\[\begin{align}

\frac{d x}{d t} &= \alpha x -\gamma x y\\

\frac{d y}{d t} &= -\beta y + \delta x y

\end{align}\]

The prey species \(x\) prospers when left to itself, but suffers as the product \(x y\) increases. The situation is just the opposite for the predator \(y\), which can’t get along alone (\(x\) is its only food source) and whose population swells when \(x\) and \(y\) are both abundant.

Competition is a more symmetrical relation: Either species can thrive when alone, and the interaction between them is negative for both parties.

\[\begin{align}

\frac{d x}{d t} &= \alpha x -\gamma x y\\

\frac{d y}{d t} &= \beta y - \delta x y

\end{align}\]

The Lotka-Volterra equations yield some interesting behavior. At any instant \(t\), the state of the two-species system can be represented as a point in the \(x, y\) plane, whose coordinates are the two population levels. For some combinations of the \(\alpha, \beta, \gamma, \delta\) parameters, there’s a point of stable equilibrium. Once the system has reached this point, it stays put, and it returns to the same neighborhood following any small perturbation. Other equilibria are unstable: The slightest departure from the balance point causes a major shift in population levels. And the *really* interesting cases have no stationary point; instead, the state of the system traces out a closed loop in the \(x, y\) plane, continually repeating a cycle of states. The cycles correspond to oscillations in the two population levels. Such oscillations have been observed in many predator-prey systems. Indeed, it was curiosity about the periodic swelling and contraction of populations in the Canadian fur trade and Adriatic fisheries that inspired Lotka and Volterra to work on the problem.

The 1960s and 70s brought more surprises. Studies of equations very similar to the Lotka-Volterra system revealed the phenomenon of “deterministic chaos,” where the point representing the state of the system follows an extremely complex trajectory, though it’s wandering are not random. There ensued a lively debate over complexity and stability in ecosystems. Is chaos to be found in natural populations? Is a community with many species and many links between them more or less stable than a simpler one?

Viewed as abstract mathematics, there’s much beauty in these equations, but it’s sometimes a stretch mapping the math back to the biology. For example, when the Lotka-Volterra equations are applied to species competing for resources, the resources appear nowhere in the model. The mathematical structure describes something more like a predator-predator interaction—two species that eat each other.

Even the organisms themselves are only a ghostly presence in these models. The differential equations are defined over the continuum of real numbers, giving us population levels or densities, but not individual plants or animals—discrete things that we can count with integers. The choice of number type is not of pressing importance as long as the populations are large, but it leads to some weirdness when a population falls to, say, 0.001—a millitree. Using finite-difference equations instead of differential equations avoids this problem, but the mathematics gets messier.

Another issue is that the equations are rigidly deterministic. Given the same inputs, you’ll always get exactly the same outputs—even in a chaotic model. Determinism rules out modeling anything like neutral ecological drift. Again, there’s a remedy: stochastic differential equations, which include a source of noise or uncertainty. With models of this kind, the answers produced are not numbers but probability distributions. You don’t learn the population of \(x\) at time \(t\); you get a probability \(P(x, t)\) in a distribution with a certain mean and variance. Another approach, called Markov Chain Monte Carlo (MCMC), uses a source of randomness to sample from such distributions. But the MCMC method moves us into the realm of computational models rather than mathematical ones.

Computational methods generally allow a direct mapping between the elements of the model and the things being modeled. You can open the lid and look inside to find the trees and the resources, the births and the deaths. These computational objects are not quite tangible, but they’re discrete, and always finite. A population is neither a number nor a probability distribution but a collection of individuals. I find models of this kind intellectually less demanding. Writing a differential equation that captures the dynamics of a biological system requires insight and intuition. Writing a program to implement a few basic events in the life of a forest—a tree dies, another takes its place—is far easier.

The six little models included in this essay serve mainly as visualizations; they expend most of their computational energy painting colored dots on the screen. But larger, more ambitious models are certainly feasible, as in the work of Taal Levi et al. mentioned above.

However, if computational models are easier to create, they can also be harder to interpret. If you run a model once and species \(X\) goes extinct, what can you conclude? Not much. On the next run \(X\) and \(Y\) might coexist. To make reliable inferences, you need to do some statistics over a large ensemble of runs—so once again the answer takes the form of a probability distribution.

The concreteness and explicitness of Monte Carlo models is generally a virtue, but it has a darker flip side. Where a differential equation model might apply to any “large” population, that vague description won’t work in a computational context. You have to name a number, even though the choice is arbitrary. The size of my forest models, 625 trees, was chosen for mere convenience. With a larger grid, say \(100 \times 100\), you’d have to wait millions of time steps to see anything interesting happen. Of course the same issue arises with experiments in the lab or in the field.

Both kinds of model are always open to a charge of oversimplifying. A model is the Marie Kondo version of nature—relentlessly decluttered and tidied up. Sometime important parts get tossed out. In the case of the forest models, it troubles me that trees have no life history. One dies, and another pops up full grown. Also missing from the models are pollination and seed dispersal, and rare events such a hurricanes and fires that can reshape entire forests. Would we learn more if all those aspects of life in the woods had a place in the equations or the algorithms? Perhaps, but where do you stop?

My introduction to models in ecology came through a book of that title by John Maynard Smith, published in 1974. I recently reread it, learning more than I did the first time through. Maynard Smith makes a distinction between simulations, useful for answering questions about specific problems or situations, and models, useful for testing theories. He offers this advice: “Whereas a good simulation should include as much detail as possible, a good model should include as little as possible.”

Ayala, F. J. 1969. Experimental invalidation of the principle of competitive exclusion. *Nature* 224:1076–1079.

Clark, James S. 2010. Individuals and the variation needed for high species diversity in forest trees. *Science* 327:1129–1132.

Connell, J. H. 1971. On the role of natural enemies in preventing competitive exclusion in some marine animals and in rain forest trees. In *Dynamics of Populations*, P. J. Den Boer and G. Gradwell, eds., Wageningen, pp. 298–312.

Gilpin, Michael E., and Keith E. Justice. 1972. Reinterpretation of the invalidation of the principle of competitive exclusion. *Nature* 236:273–301.

Hardin, Garrett. 1960. The competitive exclusion principle. *Science* 131(3409): 1292–1297.

Hubbell, Stephen P. 2001. *The Unified Neutral Theory of Biodiversity and Biogeography*. Princeton, NJ: Princeton University Press.

Hutchinson, G. E. 1959. Homage to Santa Rosalia, or why are there so many kinds of animals? *The American Naturalist* 93:145–159.

Janzen, Daniel H. 1970. Herbivores and the number of tree species in tropical forests. *The American Naturalist* 104(940):501–528.

Kricher, John. C. 1988. *A Field Guide to Eastern Forests, North America.* The Peterson Field Guide Series. Illustrated by Gordon Morrison. Boston: Houghton Mifflin.

Levi, Taal, Michael Barfield, Shane Barrantes, Christopher Sullivan, Robert D. Holt, and John Terborgh. 2019. Tropical forests can maintain hyperdiversity because of enemies. *Proceedings of the National Academy of Sciences of the USA* 116(2):581–586.

Levin, Simon A. 1970. Community equilibria and stability, and an extension of the competitive exclusion principle. *The American Naturalist*, 104(939):413–423.

MacArthur, R. H., and E. O. Wilson. 1967. *The Theory of Island Biogeography.* Monographs in Population Biology. Princeton University Press, Princeton, NJ.

May, Robert M. 1973. Qualitative stability in model ecosystems. *Ecology*, 54(3):638–641.

Maynard Smith, J. 1974. *Models in Ecology.* Cambridge: Cambridge University Press.

Richards, Paul W. 1973. The tropical rain forest. *Scientific American* 229(6):58–67.

Schupp, Eugene W. 1992. The Janzen-Connell model for tropical tree diversity: population implications and the importance of spatial scale. *The American Naturalist* 140(3):526–530.

Strobeck, Curtis. 1973. *N* species competition. *Ecology*, 54(3):650–654.

Tilman, David. 2004. Niche tradeoffs, neutrality, and community structure: A stochastic theory of resource competition, invasion, and community assembly. *Proceedings of the National Academy of Sciences of the USA* 101(30):10854–10861.

Wallace, Alfred R. 1878. *Tropical Nature, and Other Essays.* London: Macmillan and Company.

The economy’s swan dive is truly breathtaking. In response to the coronavirus threat we have shut down entire commercial sectors: most retail stores, restaurants, sports and entertainment. Travel and tourism are moribund. Manufacturing is threatened too, not only by concerns about workplace contagion but also by softening demand and disrupted supply chains. All of the automakers have closed their assembly plants in the U.S., and Boeing has stopped production at its plants near Seattle, which employ 70,000. Thus it comes as no great surprise—though it’s still a shock—that 3,283,000 Americans filed claims for unemployment compensation last week. That’s by far the highest weekly tally since the program was created in the 1930s. It’s almost five times the previous record from 1982, and 15 times the average for the first 10 weeks of this year. The graph is a dramatic hockey stick:

Here’s the same graph, updated to include new unemployment claims for the weeks ending 28 March and 4 April. The four-week total of new claims is over 16 million, which is roughly 10 percent of the American workforce. [Edited 2020-04-02 and 2020-04-09.]

I’ve been brooding about the economic collapse for a couple of weeks. I worry that the consequences of unemployment and business failures could be even more dire than the direct harm caused by the virus. Recovering from a deep recession can take years, and those who suffer most are the poor and the young. I don’t want to see millions of lives blighted and the dreams of a generation thwarted. But Covid-19 is still rampant. Relaxing our defenses could swamp the hospitals and elevate the death rate. No one is eager to take that risk (except perhaps Donald Trump, who dreams of an Easter resurrection).

The other day I was squabbling about these economic perils with the person I shelter-in-place with. Yes, she said, we’re facing a steep decline, but what makes you so sure it’s going to last for years? Why can’t the economy bounce back? I patiently mansplained about the irreversibility of events like bankruptcy and eviction and foreclosure, which are almost as hard to undo as death. That argument didn’t settle the matter, but we let the subject drop. (We’re hunkered down 24/7 here; we need to get along.)

In the middle of the night, the question came back to me. Why *won’t* it bounce back? Why can’t we just pause the economy like a video, then a month or two later press the play button to resume where we left off?

One problem with pausing the economy is that people can’t survive in suspended animation. They need a continuous supply of air, water, food, shelter, TV shows, and toilet paper. You’ve got to keep that stuff coming, no matter what. But people are only part of the economy. There are also companies, corporations, unions, partnerships, non-profit associations—all the entities we create to organize the great game of getting and spending. A company, considered as an abstraction, has no need for uninterrupted life support. It doesn’t eat or breathe or get bored. So maybe companies could be put in the deep freeze and then thawed when conditions improve.

Lying awake in the dark, I told myself a story:

Clare owns a little café at the corner of Main and Maple in a New England college town. In the middle of March, when the college sent the students home, she lost half her customers. Then, as the epidemic spread, the governor ordered all restaurants to close. Clare called up Rory the Roaster to cancel her order for coffee beans, pulled her ad from the local newspaper, and taped a “C U Soon” sign to the door. Then she sat down with her only employee, Barry the Barista, to talk about the bad news.

Barry was distraught. “I have rent coming due, and my student loan, and a car payment.”

“I wish I could be more help,” Clare replied. “But the rent on the café is also due. If I don’t pay it, we could lose the lease, and you won’t have a job to come back to. We’ll both be on the street.” They sat glumly in the empty shop, six feet apart. Seeing the lost-puppy look in Barry’s eyes, Clare added: “Let me call up Larry the Landlord and see if we can work something out.”

Larry was sympathetic. He’d been hearing from lots of tenants, and he genuinely wanted to help. But he told Clare what he’d told the rest: “The building has a mortgage. If I don’t pay the bank, I’ll lose the place, and we’ll all be on the street.”

You can guess what Betty the Banker said. “I have obligations to my depositors. Accounts earn interest every month. People are redeeming CDs. If I don’t maintain my cash reserves, the FDIC will come in and seize our assets. We’ll all be on the street.”

Everyone in this little fable wants to do the right thing. No one wants to put Clare out of business or leave Barry without an income. And yet my nocturnal meditations come to a dark end, in which the failure of Clare’s corner coffee shop triggers a worldwide recession. Barry gets evicted, Larry defaults on his loan, Betty’s bank goes bottom up. Rory the Roaster also goes under, and the Colombian farm that supplies the beans lays off all its workers. With Clare’s place now an empty storefront, there are fewer shoppers on Main Street, and the bookstore a few doors away folds up. The newspaper where Clare used to advertise ceases publication. The town’s population dwindles. The college closes.

At this point I feel like Ebenezer Scrooge pleading with the Ghost of Christmas Future to save Tiny Tim, or George Bailey desperate to escape the mean streets of Potterville and get back to the human warmth of Bedford Falls. Surely there must be some way to avert this catastrophe.

Here’s my idea. The rent and loan payments that cause all this economic mayhem are different from the transactions that Clare handles at her cash register. In her shopkeeper economy, money comes in only when coffee goes out; the two events are causally connected and simultaneous. And if she’s not selling any coffee, she can stop buying beans. The payment of her rent, on the other hand, is triggered by nothing but the ticking of the clock. She is literally buying time. Now the remedy is obvious: Stop the clock, or reset it. This is easier than you might think. We just go skipping down the Yellow Brick Road and petition the wizard to issue a proclamation. The wizard’s decree says this:

In the year 2020, April 30 shall be followed by April 1.

*Redux* is Latin for “a thing brought back or restored.” The word was introduced—or brought back—into the modern American vocabulary by John Updike’s 1971 novel *Rabbit Redux*, having been used earlier in titles of works by Dryden and Trollope. It’s one of those words I’ve always avoided saying aloud because of doubt about the pronunciation. The OED says it’s *re-ducks*.

How does this fiddling with the calendar help Clare? Consider what happens when the calendar flips from April 30 to April 1 Redux. It’s the first of the month, and the rent is due. But wait! No it’s not. She already paid the rent for April, a month ago. It won’t be due again until May 1, and that’s a month away. It’s the same with Larry’s mortgage payment, and Barry’s car loan. Of course stopping the clock cuts both ways. If you get a monthly pension or Social Security payment, that won’t be coming in April Redux, nor will the bank pay you interest on your deposits.

By means of this sly maneuver we have broken a vicious cycle. Larry doesn’t get a rent check from Clare, but he also doesn’t have to write a mortgage-loan check to Betty, who doesn’t have to make payments to her depositors and creditors. Each of them gets a month’s reprieve. With this extra slack, maybe Clare can keep Barry on the payroll and still have a viable business when her customers finally come out of hiding.

But isn’t this just a sneaky scheme to deprive the creditor class of money they are legally entitled to receive under the terms of contracts that both parties willingly signed? Yes it is, and a clever one at that. It is also a way to more equitably distribute the risks and costs of the present crisis. At the moment the burden falls heavily on Clare and Barry, who are forbidden to sell me a cup of coffee; but Larry and Betty are free to go on collecting their rents and loan payments. In addition to spreading around the financial pain, the scheme might also reduce the likelihood of a major, lasting economic contraction, which none of these characters would enjoy.

In spite of these appeals to the greater good of society as a whole, you may still feel there’s something dishonest about April Redux. If so, we can have the wizard issue a second decree:

In the 30 months from May 2020 through November 2022,

every month shall have one day fewer than the usual number.

every month shall have one day fewer than the usual number.

During this period every scheduled payment will come due a day sooner than usual. At the end, lenders and borrowers are even-steven.

The last time anybody tinkered with the calendar in the English-speaking world was 1752, when the British isles and their colonies finally adopted the Gregorian calendar (introduced elsewhere as early as 1592). *Past & Present*, no. 149, 1995, pp. 95–139. JSTOR (paywall).*was* concern and controversy about the proper calculation of wages, rents, and interest in the abbreviated September.

Riots in the streets are clearly a no-no in this period of social distancing, so presumably we won’t have to worry about mob action when April repeats itself. Besides, who’s going to complain about having 30 days *added* to their lifespan? I suppose there may be some grumbling from people with April birthdays, who think they are suddenly two years older. And back-to-back April Fool days could test the nation’s patience.

Although my plan for an April do-over is presented in the spirit of the season, I do think it illuminates a serious issue—an aspect of modern commerce that makes the current situation especially dangerous. Our problem is not that we have shut down the whole economy. The problem is that we’ve shut down only *half* the economy. The other half carries on with business as usual, creating imbalances that leave the whole edifice teetering on the brink of collapse.

The $2 trillion rescue package enacted last week addresses some of these issues. The cash handout for individual taxpayers, and a sweetening of unemployment benefits, should help Barry muddle through and pay his bills. A program of loans for small businesses could keep Clare afloat, and the loan would be forgiven if she keeps Barry on the payroll. These are thoughtful and useful measures, and a refreshing change from earlier bailout practices. We are not sending all the funds directly to investment banks and insurance companies. But a big share will wind up there anyway, since we are effectively subsidizing the rent and mortgage payments of individuals and small businesses. I wonder if it wouldn’t be fairer, more effective, and less expensive to curtail some of those payments. I’m not suggesting that we shut down the banks along with the shops; that would make matters worse. But we might require financial institutions to defer or forgo certain payments from distressed small businesses and the employs they lay off.

Voluntary efforts along these lines promise to soften the impact for at least a few lucky workers and businesses that have lost their revenue stream. In my New England college town, some of the banks are offering to defer monthly payments on mortgage loans, and there’s social pressure on landlords to do defer rents.

But don’t count on everyone to follow that program. On March 31, following announcements of layoffs and furloughs by Macy’s, Kohl’s, and other large retailers, the *New York Times* reported: “Last week, Taubman, a large owner of shopping malls, sent a letter to its tenants saying that the company expected them to keep paying their rent amid the crisis. Taubman, which oversees well-known properties like the Mall at Short Hills in New Jersey, reminded its tenants that it also had obligations to meet, and was counting on the rent to pay lenders and utilities.” [Added 2020-03-31.]

The coronavirus crisis is being treated as a unique event (and I certainly hope we’ll never see the like of it again). The associated economic crisis is also unique, at least within my memory. Most panics and recessions have their roots in the financial markets. At some point investors realize that tech stocks with an infinite price-to-earnings ratio are not such a bargain after all, or that bundling together thousands of risky mortgages doesn’t actually make them less risky. When the bubble bursts, the first casualties are on Wall Street; only later do the ripple effects reach Clare’s café. Now, we are seeing a rare disturbance that travels in the opposite direction. Do we know how to fix it?

]]>`img`

tag. The process was cumbersome and the product was ugly. In 2009 I wrote an `e^x = 1 + x + \frac{x^2}{2!} + \frac{x^3}{3!} + \cdots`

and it would appear on your screen as:

\[e^x = 1 + x + \frac{x^2}{2!} + \frac{x^3}{3!} + \cdots\]

All the work of parsing the TeX code and typesetting the math was done by a JavaScript program downloaded into your browser along with the rest of the web page.

Cervone’s jsMath soon evolved into MathJax, an open-source project initially supported by the AMS and SIAM. There are now about two dozen sponsors, and the project is under the aegis of NumFOCUS.

MathJax has made a big difference in my working life, transforming a problem into a pleasure. Putting math on the web is fun! Sometimes I do it just to show off. Furthermore, the software has served as an inspiration as well as a helpful tool. Until I saw MathJax in action, it simply never occurred to me that interesting computations could be done within the JavaScript environment of a web browser, which I had thought was there mainly to make things blink and jiggle. With the example of MathJax in front of me, I realized that I could not only display mathematical ideas but also explore and animate them within a web page.

Last fall I began hearing rumors about MathJax 3.0, “a complete rewrite of MathJax from the ground up using modern techniques.” It’s the kind of announcement that inspires both excitement and foreboding. What will the new version add? What will it take away? What will it fix? What will it break?

Before committing all of bit-player to the new version, I thought I would try a small-scale experiment. I have a standalone web page that makes particularly tricky use of MathJax. The page is a repository of the Dotster programs extracted from a recent bit-player post, My God, it’s full of dots. In January I got the Dotster page running with MathJax 3.

Most math in web documents is static content: An equation needs to be formatted once, when the page is first displayed, and it never changes after that. The initial typesetting is handled automatically by MathJax, in both the old and the new versions. As soon as the page is downloaded from the server, MathJax makes a pass through the entire text, identifying elements flagged as TeX code and replacing them with typeset math. Once that job is done, MathJax can go to sleep.

The Dotster programs are a little different; they include equations that change dynamically in response to user input. Here’s an example:

The slider on the left sets a numerical value that gets plugged into the two equation on the right. Each time the slider is moved, the equations need to be updated and reformatted. Thus with each change to the slider setting, MathJax has to wake up from its slumbers and run again to typeset the altered content.

The MathJax program running in the little demo above is the older version, 2.7. Cosmetically, the result is not ideal. With each change in the slider value, the two equations contract a bit, as if pinched between somebody’s fingers, and then snap back to their original size. They seem to wink at us.

The winking effect is caused by a MathJax feature called Fast Preview. The system does a quick-and-dirty rendering of the math content without calculating the correct final sizes for the various typographic elements. (Evidently that calculation takes a little time). You can turn off Fast Preview by right-clicking or control-clicking one of the equations and then navigating through the submenus shown at right. However, you’ll probably judge the result to be worse rather than better. Without Fast Preview, you’ll get a glimpse of the raw TeX commands. Instead of winking, the equations do jumping jacks.

I am delighted to report that all of this visual noise has been eliminated in the new MathJax. On changing a slider setting, the equations are updated in place, with no unnecessary visual fuss. And there’s no need for a progress indication, because the change is so quick it appears to be instantaneous. See for yourself:

Thus version 3 looks like a big win. There’s a caveat: Getting it to work did not go quite as smoothly as I had hoped. Nevertheless, this is a story with a happy ending.

If you have only static math content in your documents, making the switch to MathJax 3 is easy. In your HTML file you change a URL to load the new MathJax version, and convert any configuration options to a new format. As it happens, all the default options work for me, so I had nothing to convert. What’s most important about the upgrade path is what you *don’t* need to do. In most cases you should not have to alter any of the TeX commands present in the HTML files being processed by MathJax. (There are a few small exceptions.)

With dynamic content, further steps are needed. Here is the JavaScript statement I used to reawaken the typesetting engine in MathJax version 2.7:

```
MathJax.Hub.Queue(["Typeset", MathJax.Hub, mathjax_demo_box]);
```

The statement enters a `Typeset`

command into a queue of pending tasks. When the command reaches the front of the queue, MathJax will typeset any math found inside the HTML element designated by the identifier `mathjax_demo_box`

, ignoring the rest of the document.

In MathJax 3, the documentation suggested I could simply replace this command with a slightly different and more direct one:

```
MathJax.typeset([mathjax_demo_box]);
```

I did that. It didn’t work. When I moved the slider, the displayed math reverted to raw TeX form, and I found an error message in the JavaScript console:

What has gone wrong here? JavaScript’s `appendChild`

method adds a new node to the treelike structure of an HTML document. It’s like hanging an ornament from some specified branch of a Christmas tree. The error reported here indicates that the specified branch does not exist; it is `null`

.

Let’s not tarry over my various false starts and wrong turns as I puzzled over the source of this bug. I eventually found the cause and the solution in the “issues” section of the MathJax repository on GitHub. Back in September of last year Mihai Borobocea had reported a similar problem, along with the interesting observation that the error occurs only when an existing TeX expression is being replaced in a document, not when a new expression is being added. Borobocea had also discovered that invoking the procedure `MathJax.typesetClear()`

before `MathJax.typeset()`

would prevent the error.

A comment by Cervone explains much of what’s going on:

You are correct that you should use

`MathJax.typesetClear()`

if you have removed previously typeset math from the page. (In version 3, there is information stored about the math in a list of typeset expressions, and if you remove typeset math from the page and replace it with new math, that list will hold pointers to math that no longer exists in the page. That is what is causing the error you are seeing . . . )

I found that adding `MathJax.typesetClear()`

did indeed eliminate the error. As a practical matter, that solved my problem. But Borobocea pointed out a remaining loose end. Whereas `MathJax.typeset([mathjax_demo_box])`

operates only on the math inside a specific container, `MathJax.typesetClear()`

destroys the list of math objects for the entire document, an act that might later have unwanted consequences. Thus it seemed best to reformat all the math in the document whenever any one expression changes. This is inefficient, but with the 20-some equations in the Dotster web page the typesetting is so fast there’s no perceptible delay.

In January a fix for this problem was merged into MathJax 3.0.1, which is now the shipping version. Cervone’s comment on this change says that it “prevents the error message,” which left me with the impression that it might suppress the message without curing the error itself. But as far as I can tell the entire issue has been cleared up. There’s no longer any need to invoke `MathJax.typesetClear()`

.

In my first experiments with version 3.0 I stumbled onto another bit of weirdness, but it turned out to be a quirk of my own code, not something amiss in MathJax.

I was seeing occasional size variations in typeset math that seemed reminiscent of the winking problem in version 2.7. Sometimes the initial, automatic typesetting would leave the equations in a slightly smaller size; they would grow back to normal as soon as `MathJax.typeset()`

was applied. In the image at right I have superimposed the two states, with the correct, larger image colored red. It looks like Fast Preview has come back to haunt us, but that can’t be right, because Fast Preview has been removed entirely from version 3.

My efforts to solve this mystery turned into quite a debugging debacle. I got a promising clue from an exchange on the MathJax wiki, discussing size anomalies when math is composed inside an HTML element temporarily flagged `display: none`

, a style rule that makes the math invisible. In that circumstance MathJax has no information about the surrounding text, and so it leaves the typeset math in a default state. The same mechanism might account for what I was seeing—except that my page has no elements with a `display: none`

style.

I first observed this problem in the Chrome browser, where it is intermittent; when I repeatedly reloaded the page, the small type would appear about one time out of five. What fun! It takes multiple trials just to know whether an attempted fix has had any effect. Thus I was pleased to discover that in Firefox the shrunken type appears consistently, every time the page is loaded. Testing became a great deal easier.

I soon found a cure, though not a diagnosis. While browsing again in the MathJax issues archive and in a MathJax user forum, I came across suggestions to try a different form of output, with mathematical expressions constructed not from text elements in HTML and style rules in CSS but from paths drawn in Scalable Vector Graphics, or SVG. I found that the SVG expressions were stable and consistent in size, and in other respects indistinguishable from their HTML siblings. Again my problem was solved, but I still wanted to know the underlying cause.

Here’s where the troubleshooting report gets a little embarrassing. Thinking I might have a new bug to report, I set out to build a minimal exemplar—the smallest and simplest program that would trigger the bug. I failed. I was starting from a blank page and adding more and more elements of the original program—`div`

s nested inside `div`

s in the HTML, various stylesheet rules in the CSS, bigger collections of more complex equations—but none of these additions produced the slightest glitch in typesetting. So I tried working in the other direction, starting with the complex misbehaving program and stripping away elements until the problem disappeared. But it didn’t disappear, even when I reduced the page to a single equation in a plain white box.

As often happens, I found the answer not by banging my head against the problem but by going for a walk. Out in the fresh air, I finally noticed the one oddity that distinguished the failing program from all of the correctly working ones. Because the Dotster program began life embedded in a WordPress blog post, I could not include a link to the CSS stylesheet in the `head`

section of the HTML file. Instead, a JavaScript function constructed the link and inserted it into the `head`

. That happened *after* MathJax made its initial pass over the text. At the time of typesetting, the elements in which the equations were placed had no styles applied, and so MathJax had no way of determining appropriate sizes.

When Don Knuth unveiled TeX, circa 1980, I was amazed. Back then, typewriter-style word processing was impressive enough. TeX did much more: real typesetting, with multiple fonts (which Knuth also had to create from scratch), automatic hyphenation and justification, and beautiful mathematics.

Thirty years later, when Cervone created MathJax, I was amazed again—though perhaps not for the right reasons. I had supposed that the major programming challenge would be capturing all the finicky rules and heuristics for building up math expressions—placing and sizing superscripts, adjusting the height and width of parentheses or a radical sign to match the dimensions of the expression enclosed, spacing and aligning the elements of a matrix. Those are indeed nontrivial tasks, but they are just the beginning. My recent adventures have helped me see that another major challenge is making TeX work in an alien environment.

In classic TeX, the module that typesets equations has direct access to everything it might ever need to know about the surrounding text—type sizes, line spacing, column width, the amount of interword “glue” needed to justify a line of type. Sharing this information is easy because all the formatting is done by the same program. MathJax faces a different situation. Formatting duties are split, with MathJax handling mathematical content but the browser’s layout engine doing everything else. Indeed, the document is written in two different languages, TeX for the math and HTML/CSS for the rest. Coordinating actions in the two realms is not straightforward.

There are other complications of importing TeX into a web page. The classic TeX system runs in batch mode. It takes some inputs, produces its output, and then quits. Batch processing would not offer a pleasant experience in a web browser. The entire user interface (such as the buttons and sliders in my Dotster programs) would be frozen for the duration. To avoid this kind of rudeness to the user, MathJax is never allowed to monopolize JavaScript’s single thread of execution for more than a fraction of a second. To ensure this cooperative behavior, earlier versions relied on a hand-built scheme of queues (where procedures wait their turn to execute) and callbacks (which signal when a task is complete). Version 3 takes advantage of a new JavaScript construct called a *promise*. When a procedure cannot compute a result immediately, it hands out a promise, which it then redeems when the result becomes available.

Wait, there’s more! MathJax is not just a TeX system. It also accepts input written in MathML, a dialect of XML specialized for mathematical notation. Indeed, the internal language of MathJax is based on MathML. And MathJax can also be configured to handle AsciiMath, a cute markup language that aims to make even the raw form of an expression readable. Think of it as math with emoticons: Type ``oo``

and you’ll get \(\infty\), or ``:-``

for \(\div\).

MathJax also provides an extensive suite of tools for accessibility. Visually impaired readers can have an equation read aloud. As I learned at the January Joint Math Meetings, there are even provisions for generating Braille output—but that’s a subject that deserves a post of its own.

When I first encountered MathJax, I saw it as a marvel, but I also considered it a workaround or stopgap. Reading a short document that includes a single equation entails downloading the entire MathJax program, which can be much larger than the document itself. And you need to download it all again for every other mathy document (unless your browser cache hangs onto a copy). What an appalling waste of bandwidth.

Several alternatives seemed more promising as a long-term solution. The best approach, it seemed to me then, was to have support for mathematical notation built into the browser. Modern browsers handle images, audio, video, SVG, animations—why not math? But it hasn’t happened. Firefox and Safari have limited support for MathML; none of the browsers I know are equipped to deal with TeX.

Another strategy that once seemed promising was the browser plugin. A plugin could offer the same capabilities as MathJax, but you would download and install it only once. This sounds like a good deal for readers, but it’s not so attractive for the author of web content. If there are multiple plugins in circulation, they are sure to have quirks, and you need to accommodate all of them. Furthermore, you need some sort of fallback plan for those who have not installed a plugin.

Still another option is to run MathJax on the server, rather than sending the whole program to the browser. The document arrives with TeX or MathML already converted to HTML/CSS or SVG for display. This is the preferred modus operandi for several large websites, most notably Wikipedia. I’ve considered it for bit-player, but it has a drawback: Running on the server, MathJax cannot provide the kind of on-demand typesetting seen in the demos above.

As the years go by, I am coming around to the view that MathJax is not just a useful stopgap while we wait for the right thing to come along; it’s quite a good approximation to the right thing. As the author of a web page, I get to write mathematics in a familiar and well-tested notation, and I can expect that any reader with an up-to-date browser will see output that’s much like what I see on my own screen. At the same time, the reader also has control over how the math is rendered, via the context menu. And the program offers accessibility features that I could never match on my own.

To top it off, the software is open-source—freely available to everyone. That is not just an economic advantage but also a social one. The project has a community that stands ready to fix bugs, listen to suggestions and complaints, offer help and advice. Without that resource, I would still be struggling with the hitches and hiccups described above.

]]>Wandering around in these cavernous spaces always leaves me feeling a little disoriented and dislocated. It’s not just that I’m lost, although often enough I am—searching for Lobby D, or Meeting Room 407, or a toilet. I’m also dumbfounded by the very existence of these huge empty boxes, monuments to the human urge to congregate. If you build it, we will come.

It seems every city needs such a place, commensurate with its civic stature or ambitions. It’s no mystery why the cities make the investment. The JMM attracted more than 5,500 mathematicians (plus a few interlopers like me). I would guess we each spent on the order of $1,000 in payments to hotels, restaurants, taxis, and such, and perhaps as much again on airfare and registration fees. The revenue flowing to the city and its businesses and citizens must be well above $5 million. Furthermore, from the city’s point of view it’s all free money; the visitors do not send their children to the local schools or add to the burden on other city services, and they don’t vote in Denver.

However, this calculation tells only half the story. Although visitors to the Colorado Convention Center leave wads of cash in Denver, at the same time Denver residents are flying off to meetings elsewhere, withdrawing funds from the local economy and spreading the money around in Phoenix, Seattle, or Boston. If the convention-going traffic is symmetrical, the exchange will come out even for everyone. So why don’t we all save ourselves a lot of bother—not to mention millions of dollars—and just stay home? From inside the convention center, you may not be able to tell what city you’re in anyway.

While I was in Denver, I looked at the schedule of upcoming events for the convention center. A boat show was getting underway even as the mathematicians were still roaming the corridors, and tickets were also on sale for some sort of motorcycling event. The drillers and frackers were coming to town a few weeks later, and then in March the American Physical Society would hold its biggest annual gathering, with about twice as many participants as the JMM. The APS meeting was scheduled for this week, Monday through Friday (March 2–6). But late last Saturday night the organizers decided to cancel the entire conference because of the coronavirus threat. Some attendees were already in Denver or on their way.

I was taken aback by this decision, which is not to say I believe it was wrong. A year from now, if the world is still recovering from an epidemic that killed many thousands, the decisionmakers at the APS will be seen as prescient, prudent, and public-spirited. On the other hand, if Covid-19 sputters out in a few weeks, they may well be mocked as alarmists who succumbed to panic. But the latter judgment would be a little unfair. After all, the virus might be halted precisely *because* those 11,000 physicists stayed home.

I have not yet heard of other large scientific conferences shutting down, but a number of meetings in the tech industry have been called off, postponed, or gone virtual, along with some sports and entertainment events. The American Chemical Society is “monitoring developments” in advance of their big annual meeting, scheduled for later this month in Philadelphia. [Update: On March 9 the ACS announced "we are cancelling (terminating) the ACS Spring 2020 National Meeting & Expo."] Even if the events go on, some prospective participants will not be able to attend. I’ve just received an email from Harvard with stern warnings and restrictions on university-related travel.

Presumably, the Covid-19 threat will run its course and dissipate, and life will return to something called normal. But it’s also possible we have a new normal, that we have crossed some sort of demographic or epidemiological threshold, and novel pathogens will be showing up more frequently. Furthermore, the biohazard is not the only reason to question the future of megameetings; the ecohazard may be even more compelling.

All in all, it seems an apt moment to reflect on the human urge to come together in these large, temporary encampments, where we share ideas, opinions, news, gossip—and perhaps viruses—before packing up and going home until next year. Can the custom be sustained? If not, what might replace it?

Mathematicians and physicists have not always formed roving hordes to plunder defenseless cities. Until the 20th century there weren’t enough of them to make a respectable motorcycle gang. Furthermore, they had no motorcycles, or any other way to travel long distances in a reasonable time.

Before the airplane and the railroad, meetings between scientists were generally one-on-one. Consider the sad story of Neils Henrik Abel, a young Norwegian mathematician in the 1820s. Feeling cut off from his European colleagues, he undertook a two-year-long trek from Oslo to Berlin and Paris, traveling almost entirely on foot. In Paris he visited Lagrange and Cauchy, who received him coolly and did not read his proof of the unsolvability of quintic equations. So Abel walked home again. Somewhere along the way he picked up a case of tuberculosis and died two years later, at age 27, impoverished and probably unaware that his work was finally beginning to be noticed. I like to think the outcome would have been happier if he’d been able to present his results in a contributed-paper session at the JMM.

For Abel, the take-a-hike model of scholarly communication proved ineffective; perhaps more important, it doesn’t scale well. If everyone must make individual *tête-à-tête* visits, then forming connections between \(n\) scientists would require \(n(n - 1) / 2\) trips. Having everyone converge at a central point reduces the number to \(n\). From this point of view, the modern mass meeting looks not like a travel extravagance but like a strategy for minimizing total air miles. Still, staying home would be even more frugal, whether the cost is measured in dollars, kelvins, or epidemiological risk.

Most of the big disciplinary conferences got their start toward the end of the 19th century, and by the 1930s and 40s had hundreds of participants. Writing about mathematical life in that era, Ralph Boas notes: “One reason for going to meetings was that photocopying hadn’t been invented; it was at meetings that one found out what was going on.” But now photocopying *has* been invented—and superseded. There’s no need for a cross-country trip to find out what’s new; on any weekday morning you can just check the arXiv. Yet attendance at these meetings is up by another order of magnitude.

Even in a world with faster channels of communication, there are still moments of high excitement in the big convention halls. At the 1987 March meeting of the APS, the recent discovery of high-temperature superconductivity in cuprate ceramics was presented and discussed in a lively session that lasted past 3 a.m. The event is known as the Woodstock of Physics. I missed it—as well as the original Woodstock. But I was at the JMM in 2014 when progress toward confirming the twin prime conjecture caused a big stir. The conjecture (still unproved) says there are infinitely many pairs of prime numbers, such as 11 and 13, separated by exactly 2. Yiting Zhang had just proved there are infinitely many primes separated by no more than 70 million. Several talks discussed this finding and followup work by others, and Zhang himself spoke to a packed room.

Boas emphasized the motive of *hearing* what’s new, but one must not ignore the equally important impulse to *tell* what’s new. At the recent JMM, with its 5,500 visitors, the book of abstracts listed 2,529 presentations. In other words, almost half the visitors came to *deliver* a talk, which is probably a stronger motivation than hearing what others have to say. (When I first saw those numbers, I had the thought: “So, on average every presentation had one speaker and one listener.” The truth is not quite as bad as that, but it’s still worth keeping in mind that a meeting of this kind is not like a rock concert or a football game, with only a dozen or so performers and thousands in the audience.)

At some gatherings, the aim is not so much to talk about math and science but to *do* it. Groups of three or four huddle around blackboards or whiteboards, collaborating. But this activity is commoner at small, narrowly focused meetings—maybe at Aspen for the physicists or Banff for the mathematicians. No doubt such things also happen at the bigger meetings, but they are not a major item on the agenda for most attendees.

For one subpopulation of meeting-goers the main motivation is very practical: getting a job. Again this is a matter of efficiency. Someone looking for a postdoc position can arrange a dozen interviews at a single meeting.

There are many reasons to make the pilgrimage to the Colorado Convention Center, but I think the most important factor is yet to be stated. Dennis Flanagan, who was my employer, friend, and mentor many years ago at *Scientific American*, wrote that “science is intensely social.”

In an active scientific discipline everyone knows everyone else, if not in person, then by their writings and reputation. Scientists attend at least as many meetings and conventions as salesmen. Flanagan’s Version, 1988, p. 15.

You might interpret this comment as saying that scientists—like salesmen—are a bunch of genial, gregarious party animals who like to go out on the town, drink to excess, and misbehave. But I’m pretty sure that’s not what Dennis had in mind. He was arguing that social interactions are essential to the *process* of science. Becoming a mathematician or a physicist is tantamount to joining a club, and you can’t do that in isolation. You have to absorb the customs, the tastes, the values of the culture. For example, you need to internalize the community standard for deciding what is true. (It’s rather different in physics and mathematics.) Even subtler is the standard for deciding what is *interesting*—what ideas are worth pursuing, what problems are worth solving.

Meetings and conferences are not the only way of inculcating culture; the apprenticeship system known as graduate school is clearly more imporant overall. Still, discipline-wide gatherings have a role. By their very nature they are more cosmopolitan than any one university department. They acquaint you with the norms of the population but also with the range of variance, and thereby improve the probability that you’ll figure out where you fit in.

The quintessential big-meeting event is running into someone in the hallway whom you see only once a year. You stop and shake hands, or even hug. (In future we’ll bump elbows.) You’re both in a hurry. If you chat too long, you’ll miss the opening sentences of the next talk, which may be the only sentences you’ll understand. So the exchange of words is brief and unlikely to be deep. As I and my cohort grow older, it often amounts to little more than, “Wow. I’m still alive and so are you!” But sometimes it’s worth traveling a thousand miles to get that human validation.

If we have to dispense with such gatherings, science and math will muddle through somehow. We’ll meet more in the sanitary realm of bits and pixels, less in this fraught environment of atoms. We’ll become more hierarchical, with greater emphasis on local meetings and less on national and international ones. The alternatives can be made to work, and the next generation will view them as perfectly natural, if not inevitable. But I’m going to miss the ugly carpet, the uncomfortable folding/stacking chairs, and the ballrooms where nobody dances.

]]>In mathematics abstraction serves as a kind of stairway to heaven—as well as a test of stamina for those who want to get there.

Some years later you reach higher ground. The symbols representing particular numbers give way to the \(x\)s and \(y\)s that stand for quantities yet to be determined. They are symbols for symbols. Later still you come to realize that this algebra business is not just about “solving for \(x\),” for finding a specific number that corresponds to a specific letter. It’s a magical device that allows you to make blanket statements encompassing *all* numbers: \(x^2 - 1 = (x + 1)(x - 1)\) is true for any value of \(x\).

Continuing onward and upward, you learn to manipulate symbolic expressions in various other ways, such as differentiating and integrating them, or constructing functions of functions of functions. Keep climbing the stairs and eventually you’ll be introduced to areas of mathematics that openly boast of their abstractness. There’s *abstract algebra*, where you build your own collections of numberlike things: groups, fields, rings, vector spaces. *category theory*, where you’ll find a collection of ideas with the disarming label *abstract nonsense*.

Not everyone is filled with admiration for this Jenga tower of abstractions teetering atop more abstractions. Consider Andrew Wiles’s proof of Fermat’s last theorem, and its reception by the public. The theorem, first stated by Pierre de Fermat in the 1630s, makes a simple claim about powers of integers: If \(x, y, z, n\) are all integers greater than \(0\), then \(x^n + y^n = z^n\) has solutions only if \(n \le 2\). The proof of this claim, published in the 1990s, is not nearly so simple. Wiles (with contributions from Richard Taylor) went on a scavenger hunt through much of modern mathematics, collecting a truckload of tools and spare parts needed to make the proof work: elliptic curves, modular forms, Galois groups, functions on the complex plane, *L*-series. It is truly a *tour de force*.

*E* with certain properties. But the properties deduced on the left and right branches of the diagram turn out to be inconsistent, implying that *E* does not exist, nor does the counterexample that gave rise to it.

Is all that heavy machinery really needed to prove such an innocent-looking statement? Many people yearn for a simpler and more direct proof, ideally based on methods that would have been available to Fermat himself. *Parade* columnist, takes an even more extreme position, arguing that Wiles strayed so far from the subject matter of the theorem as to make his proof invalid. (For a critique of her critique, see Boston and Granville.)

Almost all of this grumbling about illegimate methods and excess complexity comes from outside the community of research mathematicians. Insiders see the Wiles proof differently. For them, the wide-ranging nature of the proof is actually what’s most important. The main accomplishment, in this view, was cementing a connection between those far-flung areas of mathematics; resolving FLT was just a bonus.

Yet even mathematicians can have misgivings about the intricacy of mathematical arguments and the ever-taller skyscrapers of abstraction. Jeremy Gray, a historian of mathematics, believes anxiety over abstraction was already rising in the 19th century, when mathematics seemed to be “moving away from reality, into worlds of arbitrary dimension, for example, and into the habit of supplanting intuitive concepts (curves that touch, neighboring points, velocity) with an opaque language of mathematical analysis that bought rigor at a high cost in intelligibility.”

*MAA Focus* by Adriana Salerno. The thesis was to be published in book form last fall by Birkhäuser, but the book doesn’t seem to be available yet.

I like to imagine abstraction (abstractly ha ha ha) as pulling the strings on a marionette. The marionette, being “real life,” is easily accessible. Everyone understands the marionette whether it’s walking or dancing or fighting. We can see it and it makes sense. But watch instead the hands of the puppeteers. Can you look at the hand movements of the puppeteers and know what the marionette is doing?… Imagine it gets worse. Much, much worse. Imagine that the marionettes we see are controlled by marionettoids we don’t see which are in turn controlled by pre-puppeteers which are finally controlled by actual puppeteers.

Keep all those puppetoids in mind. I’ll be coming back to them, but first I want to shift my attention to computer science, where the towers of abstraction are just as tall and teetery, but somehow less scary.

Suppose your computer is about to add two numbers…. No, wait, there’s no need to suppose or imagine. In the orange panel below, type some numbers into the \(a\) and \(b\) boxes, then press the “+” button to get the sum in box \(c\). Now, please describe what’s happening inside the machine as that computation is performed.

a

b

c

You can probably guess that somewhere behind the curtains there’s a fragment of code that looks like `c = a + b`

. And, indeed, that statement appears verbatim in the JavaScript program that’s triggered when you click on the plus button. But if you were to go poking around among the circuit boards under the keyboard of your laptop, you wouldn’t find anything resembling that sequence of symbols. The program statement is a high-level abstraction. If you really want to know what’s going on inside the computing engine, you need to dig deeper—down to something as tangible as a jelly bean.

How about an electron? `c = a + b`

by tracing the motions of all the electrons (perhaps \(10^{23}\) of them) through all the transistors (perhaps \(10^{11}\)).

To understand how electrons are persuaded to do arithmetic for us, we need to introduce a whole sequence of abstractions.

- First, step back from the focus on individual electrons, and reformulate the problem in terms of continuous quantities: voltage, current, capacitance, inductance.
- Replace the physical transistors, in which voltages and currents change smoothly, with idealized devices that instantly switch from totally off to fully on.
- Interpret the two states of a transistor as logical values (
*true*and*false*) or as numerical values (\(1\) and \(0\)). - Organize groups of transistors into “gates” that carry out basic functions of Boolean logic, such as and, or, and not.
- Assemble the gates into larger functional units, including adders, multipliers, comparators, and other components for doing base-\(2\) arithmetic.
- Build higher-level modules that allow the adders and such to be operated under the control of a program. This is the conceptual level of the instruction-set architecture, defining the basic operation codes (
*add, shift, jump*, etc.) recognized by the computer hardware. - Graduating from hardware to software, design an operating system, a collection of services and interfaces for abstract objects such as files, input and output channels, and concurrent processes.
- Create a compiler or interpreter that knows how to translate programming language statements such as
`c = a + b`

into sequences of machine instructions and operating-system requests.

From the point of view of most programmers, the abstractions listed above represent computational *infrastructure*: They lie beneath the level where you do most of your thinking—the level where you describe the algorithms and data structures that solve your problem. But computational abstractions are also a tool for building *superstructure*, for creating new functions beyond what the operating system and the programming language provide. For example, if your programming language handles only numbers drawn from the real number line, you can write procedures for doing arithmetic with complex numbers, such as \(3 + 5i\). (Go ahead, try it in the orange box above.) And, in analogy with the mathematical practice of defining functions of functions, we can build compiler compilers and schemes for metaprogramming—programs that act on other programs.

In both mathematics and computation, rising through the various levels of abstraction gives you a more elevated view of the landscape, with wider scope but less detail. Even if the process is essentially the same in the two fields, however, it doesn’t feel that way, at least to me. In mathematics, abstraction can be a source of anxiety; in computing, it is nothing to be afraid of. In math, you must take care not to tangle the puppet strings; in computing, abstractions are a defense against such confusion. For the mathematician, abstraction is an intellectual challenge; for the programmer, it is an aid to clear thinking.

Why the difference? How can abstraction have such a friendly face in computation and such a stern mien in math? One possible answer is that computation is just plain easier than mathematics.

Another possible explanation is that computer systems are engineered artifacts; we can build them to our own specifications. If a concept is just too hairy for the human mind to master, we can break it down into simpler pieces. Math is not so complaisant—not even for those who hold that mathematical objects are invented rather than discovered. We can’t just design number theory so that the Riemann hypothesis will be true.

But I think the crucial distinction between math abstractions and computer abstractions lies elsewhere. It’s not in the abstractions themselves but in the boundaries between them.

*abstraction barrier* in Abelson and Sussman’s Structure and Interpretation of Computer Programs, circa 1986. The underlying idea is surely older; it’s implicit in the “structured programming” literature of the 1960s and 70s. But *SICP* still offers the clearest and most compelling introduction.*information hiding* is considered a virtue, not an impeachable offense. If a design has a layered structure, with abstractions piled one atop the other, the layers are separated by *abstraction barriers*. A high-level module can reach across the barrier to make use of procedures from lower levels, but it won’t know anything about the implementation of those procedures. When you are writing programs in Lisp or Python, you shouldn’t need to think about how the operating system carries out its chores; and when you’re writing routines for the operating system, you needn’t think about the physics of electrons meandering through the crystal lattice of a semiconductor. Each level of the hierarchy can be treated (almost) independently.

Mathematics also has its abstraction barriers, although I’ve never actually heard the term used by mathematicians. A notable example comes from Giuseppe Peano’s formulation of the foundations of arithmetic, circa 1900. Peano posits the existence of a number \(0\), and a function called *successor*, \(S(n)\), which takes a number \(n\) and returns the next number in the counting sequence. Thus the natural numbers begin \(0, S(0), S(S(0)), S(S(S(0)))\), and so on. Peano deliberately refrains from saying anything more about what these numbers look like or how they work. They might be implemented as sets, with \(0\) being the empty set and successor the operation of adjoining an element to a set. Or they could be unary lists: (), (|), (||), (|||), . . . The most direct approach is to use Church numerals, in which the successor function itself serves as a counting token, and the number \(n\) is represented by \(n\) nested applications of \(S\).

From these minimalist axioms we can define the rest of arithmetic, starting with addition. In calculating \(a + b\), if \(b\) happens to be \(0\), the problem is solved: \(a + 0 = a\). If \(b\) is *not* \(0\), then it must be the successor of some number, which we can call \(c\). Then \(a + S(c) = S(a + c)\). Notice that this definition doesn’t depend in any way on how the number \(0\) and the successor function are represented or implemented. Under the hood, we might be working with sets or lists or abacus beads; it makes no difference. An abstraction barrier separates the levels. From addition you can go on to define multiplication, and then exponentiation, and again abstraction barriers protect you from the lower-level details. There’s never any need to think about how the successor function works, just as the computer programmer doesn’t think about the flow of electrons.

The importance of not thinking was stated eloquently by Alfred North Whitehead, more than a century ago:

Alfred North Whitehead, It is a profoundly erroneous truism, repeated by all copybooks and by eminent people when they are making speeches, that we should cultivate the habit of thinking of what we are doing. The precise opposite is the case. Civilisation advances by extending the number of important operations which we can perform without thinking about them. Operations of thought are like cavalry charges in a battle—they are strictly limited in number, they require fresh horses, and must only be made at decisive moments.An Introduction of Mathematics, 1911, pp. 45–46.

If all of mathematics were like the Peano axioms, we would have a watertight structure, compartmentalized by lots of leakproof abstraction barriers. And abstraction would probably not be considered “the hardest part about math.” But, of course, Peano described only the tiniest corner of mathematics. We also have the puppet strings.

In Piper Harron’s unsettling vision, the puppeteers high above the stage pull strings that control the pre-puppeteers, who in turn operate the marionettoids, who animate the marionettes. Each of these agents can be taken as representing a level of abstraction. The problem is, we want to follow the action at both the top and the bottom of the hierarchy, and possibly at the middle levels as well. The commands coming down from the puppeteers on high embody the abstract ideas that are needed to build theorems and proofs, but the propositions to be proved lie at the level of the marionettes. There’s no separating these levels; the puppet strings tie them together.

In the case of Fermat’s Last Theorem, you might choose to view the Wiles proof as nothing more than an elevated statement about elliptic curves and modular forms, but the proof is famous for something else—for what it tells us about the elementary equation \(x^n + y^n = z^n\). Thus the master puppeteers work at the level of algebraic geometry, but our eyes are on the dancing marionettes of simple number theory. What I’m suggesting, in other words, is that abstraction barriers in mathematics sometimes fail because events on both sides of the barrier make simultaneous claims on our interest.

In computer science, the programmer can ignore the trajectories of the electrons because those details really are of no consequence. Indeed, the electronic guts of the computing machinery could be ripped out and replaced by fluidic devices or fiber optics or hamsters in exercise wheels, and that brain transplant would have no effect on the outcome of the computation. Few areas of mathematics can be so cleanly floated away and rebuilt on a new foundation.

Can this notion of leaky abstraction barriers actually explain why higher mathematics looks so intimidating to most of the human population? It’s surely not the whole story, but maybe it has a role.

In closing I would like to point out an analogy with a few other areas of science, where problems that cross abstraction barriers seem to be particularly difficult. Physics, for example, deals with a vast range of spatial scales. At one end of the spectrum are the quarks and leptons, which rattle around comfortably inside a particle with a radius of \(10^{-15}\) meter; at the other end are galaxy clusters spanning \(10^{24}\) meters. In most cases, effective abstraction barriers separate these levels. When you’re studying celestial mechanics, you don’t have to think about the atomic composition of the planets. Conversely, if you are looking at the interactions of elementary particles, you are allowed to assume they will behave the same way anywhere in the universe. But there are a few areas where the barriers break down. For example, near a critical point where liquid and gas phases merge into an undifferentiated fluid, forces at all scales from molecular to macroscopic become equally important. Turbulent flow is similar, with whirls upon whirls upon whirls. It’s not a coincidence that critical phenomena and turbulence are notoriously difficult to describe.

Biology also covers a wide swath of territory, from molecules and single cells to whole organisms and ecosystems on a planetary scale. Again, abstraction barriers usually allow the biologist to focus on one realm at a time. To understand a predator-prey system you don’t need to know about the structure of cytochrome *c*. But the barriers don’t always hold. Evolution spans all these levels. It depends on molecular events (mutations in DNA), and determines the shape and fate of the entire tree of life. We can’t fully grasp what’s going on in the biosphere without keeping all these levels in mind at once.