In psychology and literature, this kind of mental rambling is called *stream of consciousness*, a metaphor we owe to William James. It’s not the metaphor I would have chosen. My own consciousness, as I experience it, does not flow smoothly from one topic to the next but seems to flit across a landscape of ideas, more like a butterfly than a river, sometimes alighting daintily on one flower and then the next, sometimes carried away by gusts of wind, sometimes revisiting favorite spots over and over.

As a way of probing the architecture of my own memory, I have tried a more deliberate experiment in free association. I began with the same herbal recipe—parsley, sage, rosemary, and thyme—but for this exercise I wasn’t strolling through the garden spots of the Berkeley hills; I was sitting at a desk taking notes. The diagram below is my best effort at reconstructing the complete train of thought.

Scrolling through the chart from top to bottom reveals the items in the order my brain presented them to me, but the linkages between nodes do not form a single linear sequence. Instead the structure is treelike, with short chains of sequential associations ending with an abrupt return to an earlier node, as if I were being snapped back by a rubber band. These interruptions are marked in the diagram by green upward arrows; the red “X” at the bottom is where I decided to end the experiment.

My apologies to the half of humanity born since 1990, who will doubtless find many of the items mentioned in the diagram antiquated or obscure. You can hover over the labels for pop-up explanations, although I doubt they will make the associations any more meaningful. Memories, after all, are personal; they live inside your head. If you want a collection of ideas that resonate with your own experience, you’ll just have to create your own free-association diagram. I highly recommend it: You may discover something you didn’t know you knew.

The destination of my daily walk down the hill in Berkeley is the Simons Institute for the Theory of Computing, where I am immersed in a semester-long program on the Brain and Computation. It’s an environment that inspires thoughts about thoughts. I begin to wonder: What would it take to build a computational model of the free-association process? Among the various challenges proposed for artificial intelligence, this one looks easy. There’s no need for deep ratiocination; what we are asked to simulate is just woolgathering or daydreaming—what the mind does when it’s out of gear and the engine is idling. It ought to be effortless, no?

For the design of such a computational model, the first idea that comes to mind (at least to *my* mind) is a random walk on a mathematical graph, or network. The nodes of the network are things stored in memory—ideas, facts, events—and the links are various kinds of associations between them. For example, a node labeled *butterfly* might have links to *moth, caterpillar, monarch,* and *frittillary,* as well as the translations mentioned in the diagram above, and perhaps some less obvious connections, such as *Australian crawl, shrimp, Muhammad Ali, pellagra, throttle valve,* and *stage fright*. The data structure for a node of the network would include a list of pointers to all of these associated nodes. The pointers could be numbered from \(1\) to \(n\); the program would generate a pseudorandom number in this range, and jump to the corresponding node, where the whole procedure would start afresh.

This algorithm captures a few basic features of free association, but it also misses quite a lot. The model assumes that all destination nodes are equally likely, which is implausible. To accommodate differences in probability, we could give each link \(i\) a weight \(w_i\), then make the probabilities proportional to the weights.

A further complication is that the weights depend on context—on one’s recent history of mental activity. If it weren’t for the combination of Mrs. Robinson and Jackie Robinson, would I have thought of Joe DiMaggio? And now, as I write this, Joltin’ Joe brings to mind Marilyn Monroe, and then Arthur Miller, and I am helpless to stop another whole train of associations. Reproducing this effect in a computer model would require some mechanism for dynamically adjusting the probabilities of entire categories of nodes, depending on which other nodes have been visited lately.

Recency effects of another kind should also be taken into account. The rubber band that repeatedly yanks me back to Simon and Garfunkel and Mrs. Robinson needs to have a place in the model. Perhaps each recently visited node should be added to the list of candidate destinations even if it is not otherwise linked to the current node. On the other hand, habituation is also a possibility: Ideas revisited too often become tiresome, and so they need to be suppressed in the model.

One final challenge: Some memories are not isolated facts or ideas but parts of a story. They have a narrative structure, with events unfolding in chronological order. Nodes for such episodic memories require a *next* link, and maybe a *previous* link, too. That chain of links holds your whole life together, to the extent you remember it.

Could a computational model like this one reproduce my mental meanderings? Gathering data for the model would be quite a chore, but that’s no surprise, since it has taken me a lifetime to fill my cranium with that jumble of herbs, Herbs, Simons, Robinsons, and Hoffmans. More worrisome than the volume of data is the fiddly nature of the graph-walking algorithm. It’s easy to say, “Pick a node according to a set of weighted probabilities,” but when I look at the gory details of how it’s done, I have a hard time imagining anything like that happening in the brain.

Here’s the simplest algorithm I know for random weighted selection.

In code—specifically in the Julia programming language—the node selection procedure looks like this:

```
function select_next(links, weights)
total = sum(weights)
cum_weights = cumsum(weights)
probabilities = cum_weights / total
x = rand()
for i in 1:length(probabilities)
if probabilities[i] >= x
return i
end
end
end
```

I have slogged through these tedious details of cumulative sums and pseudorandom numbers as a way of emphasizing that the graph-walking algorithm is not as simple as it seems on first glance. And we still haven’t dealt with the matter of adjusting the probabilities on the fly, as attention drifts from topic to topic.

Even harder to fathom is the process of learning—adding new nodes and links to the network. I ended my session of free associating when I came to a question I couldn’t answer: “What’s the Russian for butterfly?” But I *can* answer it now. The next time I play this game, I’ll add *babochka* to my list of butterfly terms. In the computational model, inserting a node for *babochka* is easy enough, but the new node also needs to be linked to all the other butterfly nodes already present. Furthermore, *babochka* would introduce additional links of its own. It’s phonetically close to *babushka* (grandmother), one of the few Russian words in my vocabulary. The *-ochka* suffix is a diminutive, so it needs to be associated with French *-ette* and Italian *-ini*. The literal meaning of *babochka* is “little soul,” which suggests still more associations. Ultimately, learning a single new word might require a full reindexing of an entire tree of knowledge.

Let’s try a different model. Forget about the random walk on a network, with its spaghetti tangle of pointers to nodes. Instead, let’s just try to keep all similar things in the same neighborhood. In the memory banks of a digital computer, that means similar things have to be stored at nearby addresses. Here’s a hypothetical segment of memory centered on the concept *dog*. The nearby slots are occupied by other words, things, and categories that are likely to be evoked by the thought of *dog*: the obvious *cat* and *puppy*, various breeds of dogs and a few individual dogs (Skippy was the family pet when I was a kid), and some quirkier possibilities. Each item has a numeric address. The address has no intrinsic meaning, but it’s important that all the memory cells are numbered sequentially.

address | content |
---|---|

19216805 | god |

19216806 | the dog that didn’t bark in the night |

19216807 | Skippy |

19216808 | Lassie |

19216809 | canine |

19216810 | cat |

19216811 | dog |

19216812 | puppy |

19216813 | wolf |

19216814 | cave canem |

19216815 | Basset Hound |

19216816 | Weimaraner |

19216817 | dogmatic |

A program for idly exploring this memory array could be quite simple. It would execute a random walk over the memory addresses, but with a bias in favor of small steps. For example, the next address to be visited might be determined by sampling from a normal distribution centered on the present location. Here’s the Julia code. (The function `randn()`

returns a random real number drawn from the normal distribution with mean \(\mu = 0\) and standard deviation \(\sigma = 1\).)

```
function gaussian_ramble(addr, 𝜎)
r = randn() * 𝜎
return addr + round(Int, r)
end
```

The scheme has some attractive features. There’s no need to tabulate all the possible destinations as a preliminary to choosing one of them. Probabilities are not stored as numbers but are encoded by position within the array, and further modulated by the parameter 𝜎, which determines how far afield the procedure is willing to reach in the array. Although the program is still doing some arithmetic in order to sample from a normal distribution, that function could probably be in a simpler way.

But the procedure also has a dreadful defect. In surrounding *dog* with all of its immediate associates, we leave no room for *their* associates. The doggy terms are fine in their own context, but what about the *cat* in the list? Where do we put *kitten* and *tiger* and *nine lives* and *Felix*? In a one-dimensional array there’s no hope of embedding every memory within its own proper neighborhood.

So let’s shift to two dimensions! By splitting the addresses into two components, we can set up two orthogonal axes. The first half of each address becomes a \(y\) coordinate and the second half an \(x\) coordinate. Now *dog* and *cat* are still close neighbors, but they also have private spaces where they can play with their own friends.

However, two dimensions aren’t enough, either. If we try to fill in all the correlatives of *The Cat in the Hat*, they will inevitably collide and conflict with those of *the dog that didn’t bark in the night*. Evidently we need more dimensions—a lot more.

Now would be a good moment for me to acknowledge that I am not the first person ever to think about how memories could be organized in the brain. A list of my predecessors might start with Plato, who compared memory to an aviary; we recognize our memories by their plumage, but sometimes we have trouble retrieving them as they flutter about in the cranial cage. The 16th-century Jesuit Matteo Ricci wrote of a “memory palace,” where we stroll through various rooms and corridors in search of treasures from the past. Modern theories of memory tend to be less colorful than these but more detailed, aiming to move beyond metaphor to mechanism. My personal favorite is a mathematical model devised in the 1980s by Pentti Kanerva, who is now at the Redwood Center for Theoretical Neuroscience here in Berkeley. He calls the idea sparse distributed memory, which I’m going to abbreviate as SDM. It makes clever use of the peculiar geometry of high-dimensional spaces.

Think of a cube in three dimensions. If the side length is taken as one unit, then the eight vertices can be labeled by vectors of three binary digits, starting with \(000\) and continuing through \(111\). At any vertex, changing a single bit of the vector takes you to a nearest-neighbor vertex. Changing two bits moves you to a next-nearest-neighbor, and flipping all three bits leads to the opposite corner of the cube—the most distant vertex.

The four-dimensional cube works the same way, with \(16\) vertices labeled by vectors that include all patterns of binary digits from \(0000\) through \(1111\). And indeed the description generalizes to \(N\) dimensions, where each vertex has an \(N\)-bit vector of coordinates. If we measure distance by the Manhattan metric—always moving along the edges of the cube and never taking shortcuts across a diagonal—the distance between any two vertices is simply the number of positions where the two coordinate vectors differ (also known as the Hamming distance). *bun*. It reflects the interpretation of the XOR operation as binary addition modulo 2. Kanerva prefers ∗ or ⊗, on the grounds that the role of XOR in high-dimensional computing is more like multiplication than addition. I have decided to duck this controversy by adopting the symbol ⊻, an alternative notation for XOR common among logicians. It’s a modification of ∨, the symbol for inclusive OR. Conveniently, it’s also the XOR symbol in Julia programs.

```
0 ⊻ 0 = 0
0 ⊻ 1 = 1
1 ⊻ 0 = 1
1 ⊻ 1 = 0
```

A Julia function for measuring the distance between vertices applies the XOR function to the two coordinate vectors and counts the \(1\)s in the result.

```
function distance(u, v)
w = u ⊻ v
return count_ones(w)
end
```

As \(N\) grows large, some curious properties of the \(N\)-cube come into view. Consider the \(1{,}000\)-dimensional cube, which has \(2^{1000}\) vertices. If you choose two of those vertices at random, what is the expected distance between them? Even though this is a question about distance, we can answer it without delving into any geometric details; it’s simply a matter of tallying the positions where the two binary vectors differ. For random vectors, each bit is \(0\) or \(1\) with equal probability, and so the vectors can be expected to differ at half of the bit positions. In the case of a \(1{,}000\)-bit vector, the typical distance is \(500\) bits. This outcome is not a great surprise. What *does* seem noteworthy is the way all the vertex-to-vertex distances cluster tightly around the mean value of 500.

For \(1{,}000\)-bit vectors, almost all randomly chosen pairs lie at a distance between \(450\) and \(550\) bits. In a sample of \(100\) million random pairs *(see graph above)* none were closer than \(400\) bits or farther apart than \(600\) bits. Nothing about our life in low-dimensional space prepares us for this condensation of probability in the middle distance. Here on Earth, you might be able to find a place to stand where you’re all alone, and almost everyone else is several thousand miles away; however, there’s no way to arrange the planet’s population so that *everyone* has this experience simultaneously. But that’s the situation in \(1{,}000\)-dimensional space.

Needless to say, it’s hard to visualize a \(1{,}000\)-dimensional cube, but it’s possible to get a little intuition about the geometry from as few as five dimensions. Tabulated below are all the vertex coordinates of a five-dimensional unit cube, arranged according to their Hamming distance from the origin \(00000\). A majority of the vertices (20 out of 32) are at the middle distances of either two or three bits. The table would have the same shape if any other vertex were taken as the origin.

A serious objection to all this talk of \(1{,}000\)-dimensional cubes is that we’ll never build one; there aren’t enough atoms in the universe for a structure with \(2^{1000}\) parts. But Kanerva points out that we need storage locations only for the items that we actually want to store. We could construct hardware for a random sample of, say, \(10^8\) vertices (each with a \(1{,}000\)-bit address) and leave the rest of the cube as a ghostly, unbuilt infrastructure. Kanerva calls the subset of vertices that exist in hardware *hard locations*. A set of \(10^8\) random hard locations would still exhibit the same squeezed distribution of distances as the full cube; indeed, this is precisely what the graph above shows.

The relative isolation of each vertex in the high-dimensional cube hints at one possible advantage of sparse distributed memory: A stored item has plenty of elbow room, and can spread out over a wide area without disturbing the neighbors. This is indeed one distinguishing feature of SDM, but there’s more to it.

Conventional computer memory enforces a one-to-one mapping between addresses and stored data items. The addresses are consecutive integers in a fixed range, such as \([0, 2^{64})\). Every integer in this range refers to a single, distinct location in the memory, and every location is associated with exactly one address. Also, each location holds just one value at a time; writing a new value wipes out the old one.

SDM breaks all of these rules. It has a huge address space—at least \(2^{1000}\)—but only a tiny, random fraction of those locations exist as physical entities; this is why the memory is said to be *sparse*. A given item of information is not stored in just one memory location; multiple copies are spread throughout a region—hence *distributed*. Furthermore, each individual address can hold multiple data items simultaneously. Thus information is both smeared out over a broad area and smushed together at the same site. The architecture also blurs the distinction between memory addresses and memory content; in many cases, the pattern of bits to be stored acts as its own address. Finally, the memory can respond to a partial or approximate address and find the correct item with high probability. Where the conventional memory is an “exact match machine,” SDM is a “best match machine,” retrieving the item most similar to the requested one.

In his 1988 book Kanerva gives a detailed quantitative analysis of a sparse distributed memory with \(1{,}000\) dimensions and \(1{,}000{,}000\) hard locations. The hard locations are chosen randomly from the full space of \(2^{1000}\) possible address vectors. Each hard location has room to store multiple \(1{,}000\)-bit vectors. The memory as a whole is designed to hold at least \(10{,}000\) distinct patterns. In what follows I’m going to consider this the canonical SDM model, although it is small by mammalian standards, and in his more recent work Kanerva has emphasized vectors with at least \(10{,}000\) dimensions.

Here’s how the memory works, in a simple computer implementation. The command `store(X)`

writes the vector \(X\) into the memory, treating it as both address and content. The value \(X\) is stored in all the hard locations that lie within a certain distance of the address \(X\). For the canonical model this distance is 451 bits. It defines an “access circle” designed to encompass about \(1{,}000\) hard locations; in other words, each vector is stored in about \(1/1{,}000\)th of the million hard locations.

It’s important to note that the stored item \(X\) does not have to be chosen from among the \(1{,}000{,}000\) binary vectors that are addresses of hard locations. On the contrary, \(X\) can be any of the \(2^{1000}\) possible binary patterns.

Suppose a thousand copies of \(X\) have already been written into the SDM when a new item \(Y\) comes along, to be stored in its own set of a thousand hard locations. There might be some overlap between the two sets of locations—sites where both \(X\) and \(Y\) are stored. The later-arriving value does not overwrite or replace the earlier one; both values are retained. When the memory has been filled to its capacity of \(10{,}000\) vectors, each of them stored \(1{,}000\) times, a typical hard location will hold copies of \(10\) distinct patterns.

Now the question is: How can we make sense of this memory mélange? In particular, how can we retrieve the correct value of \(X\) without interference from \(Y\) and all the other items jumbled together in the same storage locations?

The readout algorithm makes essential use of the curious distance distribution in a high-dimensional space. Even if \(X\) and \(Y\) are nearest neighbors among the \(10{,}000\) stored patterns, they are likely to differ by 420 or 430 bits; as a result, the number of hard locations where both values are stored is quite small—typically four, five, or six. The same is true of all the other patterns overlapping \(X\). There are thousands of them, but no one interfering pattern is present in more than a handful of copies inside the access circle of \(X\).

The command `fetch(X)`

should return the value that was earlier written by `store(X)`

. The first step in reconstructing the value is to gather up all information stored within the 451-bit access circle centered on \(X\). Because \(X\) was previously written into all of these locations, we can be sure of getting back \(1{,}000\) copies of it. We’ll also receive about \(10{,}000\) copies of *other* vectors, stored in locations whose access circles overlap that of \(X\). But because the overlaps are small, each of these vectors is present in only a few copies. In the aggregate, then, each of their \(1{,}000\) bits is equally likely to be a \(0\) or a \(1\). If we apply a majority-rule function to all the data gathered at each bit position, the result will be dominated by the \(1{,}000\) copies of \(X\). The probability of getting any result other than \(X\) is about \(10^{-19}\).

The bitwise majority-rule procedure is shown in more detail below, for a toy example of five data vectors of 20 bits each. The output is another vector where each bit reflects the majority of the corresponding bits in the data vectors. (If the number of data vectors is even, ties are broken by choosing \(0\) or \(1\) at random.) An alternative writing-and-reading scheme, also illustrated below, forgoes storing all the patterns individually and instead keeps a tally of the number of \(0\) and \(1\) bits at each position. A hard location has a \(1{,}000\)-bit counter, initialized to all \(0\)s. When a pattern is written into the location, each bit counter is incremented for a \(1\) or decremented for a \(0\). The readout algorithm simply examines the sign of each bit counter, returning \(1\) for positive, \(0\) for negative, and a random value when the counter bit is \(0\).

The two storage schemes give identical results.

From a computer-engineering point of view, this version of sparse distributed memory looks like an elaborately contrived joke. To remember \(10{,}000\) items we need a million hard locations, in which we store a thousand redundant copies of every pattern. Then, in order to retrieve just one item from memory, we harvest data on \(11{,}000\) stored patterns and apply a subtle majority-rule mechanism to unscramble them. And all we accomplish through these acrobatic maneuvers is to retrieve a vector we already had. Conventional memory works with much less fuss: Both writing and reading access a single location.

But an SDM can do things the conventional memory can’t. In particular, it can retrieve information based on a partial or approximate cue. Suppose a vector \(Z\) is a corrupted version of \(X\), where \(100\) of the \(1{,}000\) bits have been altered. Because the two vectors are similar, the command `fetch(Z)`

will probe many of the same sites where \(X\) is stored. At a Hamming distance of 100, \(X\) and \(Z\) can be expected to share about 300 hard locations. Because of this extensive overlap, the vector returned by `fetch(Z)`

—call it \(Z^{\prime}\)—will be closer to \(X\) than \(Z\) is. Now we can repeat the process with the command `fetch(Z′)`

, which will return a result \(Z^{\prime\prime}\) even closer \(X\). After only a few iterations the procedure reaches \(X\) itself.

Kanerva shows that this converging sequence of recursive read operations will succeed with near certainty as long as the starting pattern is not too far from the target. In other words, there is a critical radius: Any probe of the memory starting at a location inside the critical circle will almost surely converge to the center, and do so rather quickly. An attempt to recover the stored item from outside the critical circle fails, as the recursive recall process wanders away into the middle distance. Kanerva’s analysis yields a critical radius of 209 bits for the canonical SDM. In other words, if you know roughly 80 percent of the bits, you can reconstruct the whole pattern.

The illustration below traces the evolution of recursive-recall sequences using initial cues that differ from a target \(X\) by \(0, 5, 10, 15 \dots 1{,}000\) bits. In this experiment all sequences starting at a distance of \(205\) or less converged to \(X\) in fewer than \(10\) iterations *(blue trails)*. All sequences starting at a greater initial distance wandered aimlessly through the huge open spaces of the \(1{,}000\)-dimensional cube, staying roughly 500 bits from anywhere.

The transition from convergent to divergent trajectories is not perfectly sharp, as shown in the bad-hair-day graphic below. Here we have zoomed in to look at the fate of trajectories beginning at displacements of \(175, 176, 177, \dots 225\) bits. All trails whose starting point is within 209 bits of the target are colored blue; those starting at a greater distance are red. Most of the blue trajectories converge, quickly going to zero distance, and most of the red ones don’t. Near the critical distance, however, there are lots of exceptions.

The graph below offers yet another view of how initial distance from the target affects the likelihood of eventually converging on the correct memory address. At a distance of \(170\) bits almost all trials succeed; at \(240\) bits almost none do. The crossover point (where success and failure are equally likely) seems to lie at about \(203\) bits, a little lower than Kanerva’s result of \(209\).

The ability to reconstruct memories from partial information is a familiar element of human experience. You notice an actor in a television show, and you realize you’ve seen him before, but you don’t remember where. After a few minutes it comes to you: He’s Mr. Bates from *Downton Abbey*, but without his butler suit. Then there’s the high school reunion challenge: Looking at the stout, balding gentleman across the room, can you recognize the friend you last knew as a lanky teenager in track shorts? Sometimes, filling in the blanks requires a prolonged struggle. I have written before about my own inexplicable memory blind spot for the flowering vine wisteria, which I can name only after patiently working my way through a catalogue of false scents: hydrangea, verbena, forsythia.

Could our knack for recovering memories from incomplete or noisy inputs work something like the recursive recall process with high-dimensional vectors? It’s an attractive hypothesis, but there are also reasons for caution. For one thing, the brain seems to be able to tease meaning out of much skimpier clues. I don’t need to hear four-fifths of the Fifth Symphony before I recognize it; the first four notes will do. A flash of color moving through the trees instantly brings to mind the appropriate species—cardinal, bluejay, goldfinch. A mere whiff of chalkdust transports me back to the drowsy, overheated classroom where I doodled on the desktop all afternoon. These memories are evoked by a tiny fraction of the information they represent, far less than 80 percent.

Kanerva cites another quirk of human memory that might be modeled by an SDM: the tip-of-the-tongue phenomenon, whose essence is that you know you know something, even though you can’t immediately name it. This feeling is a bit mysterious: If you can’t find what you’re looking for, how do you know it’s there? The recursive recall process of the SDM offers a possible answer. When the successive patterns retrieved from memory are getting steadily closer together, you can be reasonably sure they will converge on a target, even before they get there.

In the struggle to retrieve a stubborn fact from memory, many people find that banging on the same door repeatedly is not a wise strategy. Rather than demanding immediate answers—getting bossy with your brain—it’s often better to set the problem aside, go for a walk, maybe even take a nap; the answer may then come to you, seemingly unbidden. Can this observation be explained by the SDM model? Perhaps, at least in part. If a sequence of recalled patterns is not converging, pursuing it further is probably fruitless. Starting over from a nearby point in the memory space might lead to a better outcome. But there’s a conundrum here: How do you find a new point of departure with better prospects? You might think you could just randomly flip a few bits in the input pattern in the hope that you’ll wind up closer to the target, but this is unlikely to work. If a vector is \(250\) bits from the target, then \(750\) bits are already correct (but you don’t know *which* \(750\) bits); any random change has a \(3/4\) chance of moving farther away rather than closer. To make progress you need to know which way to turn, and that’s a tricky question in \(1{,}000\)-dimensional space.

One aspect of the SDM architecture that seems to match human experience is the effect of repetition or rehearsal on memory. If you repeatedly recite a poem or practice playing a piece of music, you expect to remember it more easily in the future. A computational model of memory ought to exhibit the same training effect. Conventional computer memory certainly does not: There’s no benefit to writing the same value multiple times at the same address. With an SDM, in contrast, each repetition of a pattern adds another copy to all the hard locations within the pattern’s access circle. As a result, there’s less interference from overlapping patterns, and the critical radius for recall is enlarged. The effect is dramatic: When a single extra copy of a pattern is written into the memory, the critical radius grows from about \(200\) bits to more than \(300\).

By the same token, increasing the representation of one pattern can make others harder to recover. This is a form of forgetting, as the heavily imprinted pattern crowds out its neighbors and takes over part of their territory. This effect is also dramatic in the SDM—unrealistically so. A vector stored eight or ten times seems to monopolize most of the memory; it becomes an obsession, the answer to all questions.

A notable advantage of sparse distributed memory is its resilience in the face of hardware failures or errors. I would be unhappy with my own brain if the loss of a single neuron could leave a hole in my memory, so that I could no longer recognize the letter *g* or remember how to tie my shoelaces. SDM does not suffer from such fragility. With a thousand copies of every stored pattern, no one site is essential. Indeed, it’s possible to wipe out all information stored in \(60\) percent of the hard locations and still get perfect recall of \(10{,}000\) stored items, assuming you supply the exact address as the cue. With partial cues, the critical radius contracts as more sites are lost. After destroying \(60\) percent of the sites, the critical radius shrinks from \(200+\) bits to about \(150\) bits. With \(80\) percent of the sites gone, memory is seriously degraded but not extinguished.

And what about woolgathering? Can we traipse idly through the meadows of sparse distributed memory, serendipitously leaping from one stored pattern to the next? I’ll return to this question.

Most of the narrative above was written several weeks ago. At the time, I was reading about various competing theories of memory, and discussing their merits with my colleagues at the Simons Institute. I wrote up my thoughts on the subject, but I held off publishing because of nagging doubts about whether I truly understood the mathematics of sparse distributed memory. I’m glad I waited.

The Brain and Computation program ended in May. The participants have scattered; I am back in New England, where sage and rosemary are small potted plants rather than burgeoning shrubs spilling over the sidewalk. My morning strolls to the Berkeley campus, a daily occasion for musing about the nature of memory and learning, have themselves become “engrams” stored somewhere in my head (though I still don’t know where to look for them).

I have not given up the quest. Since I left Berkeley I’ve continued reading on theories of memory. I’ve also been writing programs to explore Pentti Kanerva’s sparse distributed memory and his broader ideas on “hyperdimensional computing.” Even if this project fails to reveal the secrets of human memory, it is certainly teaching me something about the mathematical and computational art of navigating high-dimensional spaces.

The diagram below represents the “right” way to implement SDM, as I understand it. The central element is a crossbar matrix in which the rows correspond to the memory’s hard locations and the columns carry signals representing the individual bits of an input vector. The canonical memory has a million rows, each with a randomly assigned \(1{,}000\)-bit address, and \(1{,}000\) columns; this toy version has 20 rows and 8 columns.

The process illustrated in the diagram is the storage of a single input vector in an otherwise empty memory. The eight input bits are compared simultaneously with all \(20\) hard-location addresses. Wherever an input bit and an address bit match—\(0\) with \(0\) or \(1\) with \(1\)—we place a dot at the intersection of the column and the row. Then we count the number of dots in each row, and if the count meets or exceeds a threshold, we write the input vector into the register associated with that row *(blue boxes)*. In the example shown, the threshold is \(5\), and \(8\) of the \(20\) addresses have at least \(5\) matches. In the \(1{,}000\)-bit memory, the threshold would be \(451\), and only about a thousandth of the registers would be selected.

The magic in this design is that all of the bit comparisons—a billion of them in the canonical model—happen concurrently. As a result, the access time for both reading and writing is independent of the number of hard locations, and can be very fast. Circuitry of this general type, known as an associative memory or content-addressable memory, has a role in certain specialized computing applications, such as triggering the particle detectors at the Large Hadron Collider and steering packets through the routers of the internet backbone. And the circuit diagram might also be plausibly mapped onto certain structures in the brain. Kanerva points out that the cerebellum looks a lot like such a matrix. The rows are flat, fanlike Purkinje cells, arranged like the pages of a book; the columns are parallel fibers threaded through the whole population of Purkinje cells. (However, the cerebellum is not the region of the mammalian brain where cognitive memory is thought to reside.)

It would be wonderful to build an SDM simulation based on this crossbar design; unfortunately, I don’t know how to do that with any computer hardware I can lay my hands on. A conventional processor offers no way to compare all the input bits with all the hard-location bits simultaneously. Instead I have to scan through a million hard locations one by one, and at each location compare a thousand pairs of bits. That’s a billion bit comparisons for every item stored into or retrieved from the memory. Add to that the time needed to write or read a million bits (a thousand copies of a \(1{,}000\)-bit vector), and we’re talking about quite a lumbering process. Here’s the code for storing a vector:

```
function store(v::BitVector)
for loc in SDM
if hamming_distance(v, loc.address) <= r
write_to_register!(loc.register, v)
end
end
end
```

This implementation needs almost an hour to stock the memory with \(10{,}000\) remembered patterns. (The complete program, in the form of a Jupyter notebook, is available on GitHub.)

Is there a better algorithm for simulating the SDM on conventional hardware? One possible strategy avoids repeatedly searching for the set of hard locations within the access circle of a given vector; instead, when the vector is first written into the memory, the program keeps a pointer to each of the thousand-or-so locations where it is stored. On any future reference to the same vector, the program can just follow the \(1{,}000\) saved pointers rather than scanning the entire array of a million hard locations. The cost of this caching scheme is the need to store all those pointers—\(10\) million of them for the canonical SDM. Doing so is feasible, and it might be worthwhile if you only wanted to store and retrieve exact, known values. But think about what happens in response to an approximate memory probe, with the recursive recall of \(Z^{\prime}\) and \(Z^{\prime\prime}\) and \(Z^{\prime\prime\prime}\), and so on. None of those intermediate values will be found in the cache, and so the full scan of all hard locations is still needed.

Perhaps there’s a cleverer shortcut. A recent review article on “Approximate Nearest Neighbor Search in High Dimensions,” by Alexandr Andoni, Piotr Indyk, and Ilya Razenshteyn, mentions an intriguing technique called locality sensitive hashing, but I can’t quite see how to adapt it to the SDM problem.

The ability to reconstruct memories from partial cues is a tantalizingly lifelike trait in a computational model. Perhaps it might be extended to yield a plausible mechanism for wandering idly through the chambers of memory, letting one idea lead to the next.

At first I thought I knew how this might work. A pattern \(X\) stored in the SDM creates a basin of attraction around itself, where any recursive probe of the memory starting within a critical radius will converge to \(X\). Given \(10{,}000\) such attractors, I can imagine them partitioning the memory space into a matrix of separate compartments, like a high-dimensional foam of soap bubbles. The basin for each stored item occupies a distinct volume, surrounded on all sides by other basins and bumping up against them, with sharp boundaries between adjacent domains. In support of this notion, I would note that the average radius of a basin of attraction shrinks when more content is poured into the memory, as if the bubbles were being compressed by overcrowding.

This vision of what’s going on inside the SDM suggests a simple way to drift from one domain to the next: Randomly flip enough bits in a vector to take it outside the present basin of attraction and into an adjacent one, then apply the recursive recall algorithm. Repeating this procedure will generate a random walk through the set of topics stored in the memory.

The only trouble is, it doesn’t work. If you try it, you will indeed wander aimlessly in the \(1{,}000\)-dimensional lattice, but you will never find anything stored there. The entire plan is based on a faulty intuition about the geometry of the SDM. The stored vectors with their basins of attraction are *not* tightly packed like soap bubbles; on the contrary, they are isolated galaxies floating in a vast and vacant universe, with huge tracts of empty space between them. A few calculations show the true nature of the situation. In the canonical model the critical radius defining the basin of attraction is about \(200\). The volume of a single basin—measured as the number of vectors inside it—is

$$\sum_{k = 1}^{200} \binom{1000}{k},$$

which works out to roughly \(10^{216}\). Thus all \(10{,}000\) basins occupy a volume of \(10^{220}\). That’s a big number, but it’s still a tiny fraction of the \(1{,}000\)-dimensional cube. Among all the vertices of the cube, only \(1\) out of \(10^{80}\) lies within 200 bits of a stored pattern. You could wander forever without stumbling into one of those basins.

(Forever? Oh, all right, maybe not forever. Because the hypercube is a finite structure, any path through it must eventually become recurrent, either hitting a fixed point from which it never escapes or falling into a repeating cycle. The stored vectors are fixed points, and there are also many other fixed points that don’t correspond to any meaningful pattern. For what it’s worth, in all my experiments with SDM programs, I have yet to run into a stored pattern “by accident.”)

Hoping to salvage this failed idea, I tried a few more experiments. In one case I deliberately stored a bunch of related concepts at nearby addresses (“nearby” meaning within 200 or 300 bits). Within this cluster, perhaps I could skip blithely from point to point. But in fact the entire cluster congealed into one big basin of attraction for the central pattern, which thus became a black hole swallowing up all its companions. I also tried fiddling with the value of \(r\), the radius of the access circle for all reading and writing operations. In the canonical model \(r = 451\). I thought that writing to a slightly smaller circle or reading from a slightly larger one might allow some wiggle room for randomness in the results, but this hope was also disappointed.

All of these efforts were based on a misunderstanding of high-dimensional vector spaces. Trying to find clusters of nearby values in the hypercube is hopeless; the stored patterns are sprinkled much too sparsely throughout the volume. And deliberately creating dense clusters is pointless, because it destroys the very property that makes the system interesting—the ability to converge on a stored item from any point in the surrounding basin of attraction. If we’re going to create a daydreaming algorithm for the SDM, it will have to work some other way.

In casting about for an alternative daydreaming mechanism, we might consider smuggling some graph theory into the world of sparse distributed memory. Then we could take a step back toward the original idea of mental rambling as a random walk on a graph or network. The key to building such graphs in the SDM turns out to be a familiar tool: the exclusive OR operator.

As discussed above, the Hamming distance between two vectors is calculated by taking their bitwise XOR and then counting the \(1\)s in the result. But the XOR operation provides more information than just the distance between two vectors; it also reveals the orientation or direction of the line that joins them. Specifically, the operation \(u \veebar v\) yields a vector that lists the bits that need to be changed to transform \(u\) into \(v\) or vice versa. You might also think of the \(1\)s and \(0\)s in the XOR vector as a sequence of directions to be followed to trace a path from \(u\) to \(v\).

XOR has always been my personal favorite among the Boolean functions. It is a difference operator, but unlike subtraction, XOR is symmetric: \(u \veebar v = v \veebar u\). Furthermore, XOR is its own inverse. This concept is easy to understand with functions of a single argument: \(f(x)\) is its own inverse if \(f(f(x)) = x\), so that applying the function twice you can get back to where you started. For a two-argument function such as XOR the situation is more complicated, but it’s still true that doing the same thing twice restores the original state. Specifically, if \(u \veebar v = w\), then \(u \veebar w = v\) and \(v \veebar w = u\). The three vectors \(u\), \(v\), and \(w\) form a tiny, closed universe. You can apply the XOR operator to any pair of them and you’ll get back the third element of the set. Below is my attempt to illustrate this idea. Each square represents a \(10{,}000\)-bit vector arranged as a \(100\)-by-\(100\) tableau of light and dark pixels. The three patterns appear to be random and independent, but hovering with the mouse pointer will show that each panel is in fact the XOR of the other two. For example, in the leftmost square, each red pixel matches either a green pixel or a blue pixel, but not both.

The self-inverse property suggests a new way of organizing information in the SDM. Suppose the word *butterfly* and its French equivalent *papillon* are stored as arbitrary, random vectors. They will not be close together; the distance between them is likely to be about 500 bits. Now we compute the XOR of these vectors, *butterfly* ⊻ *papillon*; the result is another vector that can also be stored in the SDM. This new vector encodes the relation *English-French*. Now we are equipped to translate. Given the vector for *butterfly*, we XOR it with the *English-French* vector and get *papillon*. The same trick works in the other direction.

This pair of words and the relation between them forms the nucleus of a semantic network. Let’s grow it a little. We can store the word *caterpillar* at an arbitrary address, then compute *butterfly* ⊻ *caterpillar* and call this new relation *adult-juvenile*. What’s the French for *caterpillar*? It’s *chenille*. We add this fact to the network by storing *chenille* at the address *caterpillar* ⊻ *English-French*. Now some magic happens: If we take *papillon* ⊻ *chenille*, we’ll learn that these words are connected by the relation *adult-juvenile*, even though we did not explicitly state that fact. It is a constraint imposed by the geometry of the construction.

The graph could be extended further by adding more English-French cognates (*dog-chien, horse-cheval*) or more adult-juvenile pairs: (*dog-puppy, tree-sapling*). And there are plenty of other relations to be explored: synonyms, antonyms, siblings, cause-effect, predator-prey, and so on. There’s also a sweet way of linking a set of events into a chronological sequence, just by XORing the addresses of a node’s predecessor and successor.

The XOR method of linking concepts is a hybrid of geometry and graph theory. In ordinary mathematical graph theory, distances and directions are irrelevant; all that matters is the presence or absence of connecting edges between nodes. In the SDM, on the other hand, the edge representing a relation between nodes is a vector of definite length and orientation within the \(1{,}000\)-dimensional space. Given a node and a relation, the XOR operation “binds” that node to a specific position elsewhere in the hypercube. The resulting structure is completely rigid; you can’t move a node without changing all the relations it participates in. In the case of the butterflies and caterpillars, the configuration of four nodes is necessarily a parallelogram, with pairs of opposite sides that have the same length and orientation.

Another distinctive feature of the XOR-linked graph is that the nodes and the edges have exactly the same representation. In most computer implementations of graph-theoretical ideas, these two entities are quite different; a node might be a list of attributes, and an edge would be a pair of pointers to the nodes it connects. In the SDM, both nodes and edges are simply high-dimensional vectors. Both can be stored in the same format.

As a model of human memory, XOR binding offers the prospect of connecting any two concepts through any relation we can invent. But the scheme also has some deficiencies. Many real-world relations are asymmetric; they don’t share the self-inverse property of XOR. An XOR vector can declare that Edward and Victoria are parent and child, but it can’t tell you which is which. Worse, the XOR vector connects exactly two nodes, never more, so a parent of multiple children presents faces an awkward predicament. Another challenge is keeping all the branches of a large graph consistent with one another. You can’t just add nodes and edges willy-nilly; they must be joined to the graph in the right order. Inserting a pupal stage between the butterfly and the caterpillar would require rewiring most of the diagram, moving several nodes to new locations within the hypercube and recalculating the relation vectors that connect them, all the while taking care that each change on the English side is mirrored correctly on the French side.

Some of these issues are addressed in another XOR-based technique that Kanerva calls bundling. The idea is to create a kind of database by storing attribute-value pairs. An entry for a book might have attributes such as *author*, *title*, and *publisher*, each of which is paired with a corresponding value. The first step in bundling the data is to separately XOR each attribute-value pair. Then the vectors resulting from these operations are combined to form a single sum vector, using the same algorithm described above for storing multiple vectors in a hard location of the SDM. Taking the XOR of an attribute name with this combined vector will extract an approximation to the corresponding value, close enough to identify it by the recursive recall method. In experiments with the canonical model I found that a single \(1{,}000\)-bit vector could hold six or seven attribute-value pairs without much risk of confusion.

Binding and bundling are not mentioned in Kanerva’s 1988 book, but he discusses them in detail in several more recent papers. (See Further Reading, below.) He points out that with these two operations the set of high-dimensional vectors acquires the structure of an algebraic field—or at least an approximation to a field. The canonical example of a field is the set of real numbers together with the operations of addition and multiplication and their inverses. The reals form a closed set under these operations: Adding, subtracting, multiplying or dividing any two real numbers yields another real number (except for division by zero, which is always the joker in the pack). Likewise a set of binary vectors is closed under binding and bundling, except that sometimes the result extracted from a bundled vector has to be “cleaned up” by the recursive recall process in order to recover a member of the set.

Can binding and bundling offer any help when we try to devise a woolgathering algorithm? They provide some basic tools for navigating through a semantic graph, including the possibility of performing a random walk. Starting from any node of an XOR-linked graph, a random-walk algorithm chooses from among all the relations available at that node. Selecting a relation vector at random and XORing it with the address of the node leads to a different node, where the procedure can be repeated. Similarly, in bundled attribute-value pairs, a randomly selected attribute calls forth the corresponding value, which becomes the next node to explore.

But how does the algorithm know which relations or which attributes are available for choosing? The relations and attributes are represented as vectors and stored in the memory just like any other objects, but there is no obvious means of retrieving those vectors unless you already know what they are. You can’t say to the memory, “Show me all the relations.” You can only present a pattern and ask, “Is this vector present? Have you seen it or something like it?”

With a conventional computer memory, you can do a core dump: Step through all the addresses and print out the value found at each location. There’s no such procedure for a distributed memory. I learned this troubling fact the hard way. While building a computational model of the SDM, I got the pieces working well enough that I could store a few thousand randomly generated patterns in the memory. But I could not retrieve them, because I didn’t know what to ask for. The solution was to maintain a separate list, outside the SDM itself, keeping a record of everything I stored. But it seems farfetched to suppose that the brain would maintain both a memory and an index to that memory. Why not just use the index, which is so much simpler?

In view of this limitation, it seems that sparse distributed memory is equipped to serve the senses but not the imagination. It can recognize familiar patterns and store novel ones, which will then be recognized when next encountered, even from partial or corrupted cues. With binding or bundling, the memory can also keep track of relations between pairs of stored items. But whatever is put into the memory can be gotten out only by supplying a suitable cue.

When I look at the publicity poster for *The Graduate*, I see Dustin Hoffman, more leery than leering, regarding the stockinged leg of Anne Bancroft, who plays Mrs. Robinson. This visual stimulus excites several subsets of neurons in my cerebral cortex, corresponding to my memories of the actors, the characters, the story, the soundtrack, the year 1967. All of this brain activity might be explained by the SDM memory architecture, if we grant that subsets of neurons can be represented in some abstract way by long, random binary vectors. What’s not so readily explained is how I can summon to mind all the same sensations without having the image in front of me. How do I draw those particular long, random sequences out of the great tangle of vectors without already knowing where they are?

So ends my long ramble, on a note of doubt and disappointment. It’s hardly surprising that I have failed to get to the bottom of it all. These are deep waters.

On the very first day of the Simons brain-and-computation program, Jeff Lichtman, who is laboring to trace the wiring diagram of the mouse brain, asked whether neuroscience has yet had its Watson-Crick moment. In molecular genetics we have reached the point where we can extract a strand of DNA from a living cell and read many its messages. We can even write our own messages and put them back into an organism. The equivalent capability in neuroscience would be to examine a hunk of brain tissue and read out the information stored there—the knowledge, the memories, the world view. Maybe we could also write information directly into the brain.

Science is not even close to achieving this feat—to the great relief of many. That includes me: I don’t look forward to having my thoughts sucked out of my head through electrodes or pipettes, to be replaced with #fakenews. However, I really *do* want to know how the brain works.

The Simons program left me dazzled by recent progress in neuroscience, but it also revealed that some of the biggest questions remain wide open. The connectomics projects of Lichtmann and others are producing a detailed map of millions of neurons and their interconnections. New recording techniques allow us to listen in on the signals emitted by individual nerve cells and to follow waves of excitation across broad regions of the brain. We have a pretty comprehensive catalogue of neuron types, and we know a lot about their physiology and biochemistry. All this is impressive, but so are the mysteries. We can record neural signals, but for the most part we don’t know what they mean. We don’t know how information is encoded or stored in the brain. It’s rather like trying to understand the circuitry of a digital computer without knowing anything of binary arithmetic or Boolean logic.

Pentti Kanerva’s sparse distributed memory is an attempt to fill in some of these gaps. It is not the only such attempt. A better-known alternative is John Hopfield’s conception of a neural network as a dynamical system settling into an energy-minimizing attractor. The two ideas have some basic principles in common: Information is scattered across large numbers of neurons, and it is encoded in a way that would not be readily understood by an outside observer, even one with access to all the neurons and the signals passing between them. Schemes of this kind, essentially mathematical and computational, occupy a conceptual middle ground between high-level psychology and low-level neural engineering. It’s the layer where the meaning is.

Hopfield, J. J. (1982). Neural networks and physical systems with emergent collective computational abilities. *Proceedings of the National Academy of Sciences* 79(8):2554–2558.

Kanerva, Pentti. 1988. *Sparse Distributed Memory*. Cambridge, Mass.: MIT Press.

Kanerva, Pentti. 1996. Binary spatter-coding of ordered *K*-tuples. In C. von der Malsburg, W. von Seelen, J. C. Vorbruggen and B. Sendhoff, eds. *Artificial Neural Networks—ICANN 96 Proceedings*, pp. 869–873. Berlin: Springer.

Kanerva, Pentti. 2000. Large patterns make great symbols: An example of learning from example. In S. Wermter and R. Sun, eds. *Hybrid Neural Systems*, pp. 194–203. Heidelberg: Springer. PDF

Kanerva, Pentti. 2009. Hyperdimensional computing: An introduction to computing in distributed representation with high-dimensional random vectors. *Cognitive Computation* 1(2):139–159. PDF

Kanerva, Pentti. 2010. What we mean when we say “What’s the Dollar of Mexico?”: Prototypes and mapping in concept space. Report FS-10-08-006, AAAI Fall Symposium on Quantum Informatics for Cognitive, Social, and Semantic Processes. PDF

Kanerva, Pentti. 2014. Computing with 10,000-bit words. Fifty-second Annual Allerton Conference, University of Illinois at Urbana-Champagne, October 2014. PDF

Plate, Tony. 1995. Holographic reduced representations. IEEE Transactions on Neural Networks 6(3):623–641. PDF

Plate, Tony A. 2003. *Holographic Reduced Representation: Distributed Representation of Cognitive Structure*. Stanford, CA: CSLI Publications.

Rahimi, Abbas, Sohum Datta, Denis Kleyko, E. Paxon Frady, Bruno Olshausen, Pentti Kanerva, and Jan M. Rabaey. 2017. High-dimensional computing as a nanoscalable paradigm. *IEEE Transactions on Circuits and Systems* 64(9):2508–2521. Preprint PDF

- Donald O. Hebb’s
*The Organization of Behavior: A Neuropsychological Theory.*This is the book that introduced a fundamental hypothesis about learning and memory, captured in the slogan “Neurons that fire together get wired together.” - Norbert Wiener’s
*Cybernetics: or, Control and Communication in the Animal and the Machine*, an eccentric and wide-ranging masterpiece with a crucial chapter on “Computing Machines and the Nervous System.” - Claude Shannon’s
*The Mathematical Theory of Communication*, the foundational document of information theory. (Shannon’s part of this work had appeared a year earlier in the*Bell System Technical Journal*; the book version includes an interpretive essay by Warren Weaver.)

When I got the three volumes home, I made a surprising discovery: They were all published at roughly the same time, in 1948 and 1949. What are the odds of that? Perhaps it means nothing—just the long arm of coincidence reaching out to tap me on the shoulder. On the other hand, maybe there was something in the air circa 1950, something that made the period unusually fertile for studies of information, communication, and computation in brains and machines.

I have done a little digging in library catalogues and Wikipedia, as well as in my own files, looking for other titles that might belong on this list of distinguished midcentury milestones.

It turns out that George Kingsley Zipf’s *Human Behavior and the Principle of Least Effort* was also published in 1949. (This is the one about the curious power-law distribution seen in rankings of word frequencies, city sizes, and so on.)

Gilbert Ryle’s *The Concept of Mind* is another 1949 title, though I’ve never read it. Also from 1949: Nicholas Metropolis and Stanislaw Ulam published the first open account of the Monte Carlo method.

Drifting forward into 1950, we find another cluster of notables. There is John Nash’s one-page paper introducing what we now call the Nash equilibrium. Elsewhere in game theory, 1950 was the debut year for prisoner’s dilemma, although Merrill Flood’s paper describing it did not appear until two years later. Richard Hamming published “Error Detecting and Error Correcting Codes” in 1950. (It’s another paper from the *Bell System Technical Journal*.) Finally, there’s Alan M. Turing’s famous essay on “Computing Machinery and Intelligence.”

Does the density of high-octane publications really make 1948–50 an exceptional season of intellectual history? I can’t offer any solid statistical support for that notion. In the first place, my criteria for inclusion on the list are way too vague. (“Subjects I find interesting” may be closest to the truth.) In the second place, I can’t offer any evidence that other intervals were not equally productive. As a matter of fact, in my bibliographic rummaging I came across a nexus of brilliance five years earlier:

- Warren S. McCollough and Walter H. Pitts, “A logical calculus of the ideas immanent in nervous activity,” 1943.
- John von Neumann and Oskar Morgenstern,
*Theory of Games and Economic Behavior*, 1944. - Erwin Schrödinger,
*What Is Life? The Physical Aspect of the Living Cell,*1944. - Vannevar Bush, “As We May Think,” 1945.
- John von Neumann, “First Draft of a Report on the EDVAC,” 1945

I acknowledge a further reason for caution when I cite 1949 as a year of special distinction. It’s *my* year, the year of my birth.

In 1994 a document called the QED Manifesto made the rounds of certain mathematical mailing lists and Usenet groups.

QED is the very tentative title of a project to build a computer system that effectively represents all important mathematical knowledge and techniques. The QED system will conform to the highest standards of mathematical rigor, including the use of strict formality in the internal representation of knowledge and the use of mechanical methods to check proofs of the correctness of all entries in the system.

The ambitions of the QED project—and its eventual failure—were front and center in a talk by Thomas Hales (University of Pittsburgh) on Formal Abstracts in Mathematics. Hales is proposing another such undertaking: A comprehensive database of theorems and other mathematical propositions, along with the axioms, assumptions, and definitions on which the theorems depend, all represented in a formal notation readable by both humans and machines. Unlike QED, however, these “formal abstracts” would *not* include proofs of the theorems. Excluding proofs is a huge retreat from the aims of the QED group, but Hales argues that it’s necessary to make the project feasible with current technology.

Hales has plenty of experience in this field. In 1998 he announced a proof of the Kepler conjecture—the assertion that the grocer’s stack of oranges embodies the densest possible arrangement of equal-size spheres in three-dimensional space. Hales’s proof was long and complex, so much so that it stymied the efforts of journal referees to fully check it. Hales and 21 collaborators then spent a dozen years constructing a formal, computer-mediated verification of the proof.

What’s the use of a database of mathematical assertions if it doesn’t include proofs? Hales held out several potential benefits, two of which I found particularly appealing. First, the database could answer global questions about the mathematical literature; one could ask, “How many theorems depend on the Riemann hypothesis?” Second, the formal abstracts would capture the meaning of mathematical statements, not just their surface form. A search for all mentions of the equation \(x^m - y^n = 1\) would find instances that use symbols other than \(x, y, m, n,\) or that take slightly different forms, such as \(x^m - 1 = y^n\).

Hales’s formal abstracts sound intriguing, but I have to confess to a certain level of disappointment and bafflement. All around us, triumphant machines are conquering one domain after another—chess, go, poker, Jeopardy, the driver’s seat. But not proofs, apparently.

Am I the last person in the whole republic of numbers to learn that Sperner’s lemma is a discrete version of the Brouwer fixed-point theorem? Francis Su and John Stillwell clued me in.

The lemma—first stated in 1928 by the German mathematician Emanuel Sperner—seems rather narrow and specialized, but it turns up everywhere. It concerns a triangle whose vertices are assigned three distinct colors:

Divide the triangle into smaller triangles, constrained by two rules. First, no edge or segment of an edge can be part of more than two triangles. Second, if a vertex of a new small triangle lies on an edge of the original main triangle, the new vertex must be given one of the two colors found at the end points of that main edge. For example, a vertex along the red-green edge on the left side of the main triangle must be either red or green. Vertices strictly inside the main triangle can be given any of the three colors, without restriction.

The lemma states that at least one interior triangle must have a full complement of red, green, and blue vertices. Actually, the lemma’s claim is slightly stronger: The number of trichromatic inner triangles must be odd. In the augmented diagram below, adding a single new red vertex has created two more RGB triangles, for a total of three.

Su gave a quick proof of the lemma. Consider the set of all edge segments that have one red and one green endpoint. On the exterior boundary of the large triangle, such segments can appear only along the red-green edge, and there must be an odd number of them. Now draw a path that enters the large triangle from the outside, that crosses only red-green segments, and that crosses each such segment at most once.

One possible fate of this RG path is to enter through one red-green segment and exit through another. But since the number of red-green segments on the boundary is odd, there must be at least one path that enters the large triangle and never exits. The only way it can become trapped is to enter a red-green-blue triangle. (There’s nothing special about red-green segments, so this argument also holds for paths crossing red-blue and blue-green segments.)

So much for Sperner’s lemma. What do these nested triangles have to do with the Brouwer fixed-point theorem? That theorem operates in a continuous domain, which seems remote from the discrete network of Sperner’s triangulated triangle.

As the story goes (I can’t vouch for its provenance), L. E. J. Brouwer formulated his theorem at the breakfast table. Stirring his coffee, he noticed that there always seemed to be at least one stationary point on the surface of the moving liquid. He was able to prove this fact not just for the interior of a coffee cup but for any bounded, closed, and convex region, and not just for circular motion but for any continuous function that maps points within such a region to points in the same region. For each such function \(f\), there is a point \(p\) such that \(f(p) = p\).

Brouwer’s fixed-point theorem was a landmark in the development of topology, and yet Brouwer himself later renounced the theorem—or at least his proof of it, because the proof was nonconstructive: It gave no procedure for finding or identifying the fixed point. John Stillwell argues that a proof based on Sperner’s lemma comes as close as possible to a constructive proof, though it would still have left Brouwer unsatisfied.

The proof relies on the same kind of paths represented by yellow arrows in the diagram above. At least one such path comes to an end inside a tri-colored triangle, which Sperner’s lemma shows must exist in any properly colored triangulated network. If we continue subdividing the triangles under the Sperner rules, and proceed to the limit where the edge lengths go to zero, then the path ends at a single, stationary point. (It’s the “proceed to the limit” step that Brouwer would not have liked.)

You have five muffins to share among three students; lets call the students April, May, and June. One solution is to give each student one whole muffin, then divide the remaining two muffins into pieces of size one-third and two-thirds. Then the portions are divvied up as follows:

This allotment is quantitatively fair, in that each student receives five-thirds of a muffin, but June complains that her two small pieces are less appetizing than the others’ larger ones. She feels she’s been given leftover crumbs. Hence the division is not envy-free.

There are surely many ways of addressing this complaint. You might cut *all* the muffins into pieces of size one-third, and give each student five equal pieces. Or you might give each student a muffin and a half, then eat the leftover half yourself. These are practical and sensible strategies, but they are not what Bill Gasarch was seeking when he gave a talk on the problem Saturday afternoon. Gasarch asked a specific question: What is the maximum size of the minimum piece? Can we do better than one-third?

The answer is yes. Here is a division that cuts one muffin in half and divides each of the other four muffins into portions of size seven-twelfths and five-twelfths. April and May each get \(\frac{1}{2} + \frac{7}{12} + \frac{7}{12}\); June gets \(4 \times \frac{5}{12}\).

Five-twelfths is larger than one-third, and thus should seem less crumby. Indeed, Gasarch and his colleagues have proved five-twelfths is the best result possible: It is the maximum of the minimum. (Nevertheless, I worry that June may still be unhappy. Her portion is cut up into four pieces, whereas the others get three pieces each; furthermore, all of June’s pieces are smaller than April’s and May’s. Again, however, these concerns lie outside the scope of the mathematical problem.)

A key observation is that the smallest piece can never be larger than one-half. This is thunderously obvious once you know it, but I failed to see it when I first started thinking about the problem.

Fair-division problems have a long history (going back at least as far as the Talmud), and cake-cutting versions have been proliferating for decades. A 1961 article by L. E. Dubins and E. H. Spanier (*American Mathematical Monthly* 68:1–17) inspired much further work. There are even connections with Sperner’s lemma. Nevertheless, the genre is not exhausted yet; the muffin problem seems to be a new wrinkle. Gasarch and six co-authors (three of them high school students) have prepared a 166-page manuscript describing a year’s worth of labor on the problem, with optimal results for all instances with up to six students (and any number of muffins), as well as upper and lower bounds on solutions to larger instances, and various conjectures on open problems.

Long-time readers of bit-player may remember that Gasarch has been mentioned here before. Back in 2009 he offered (and eventually paid) \($17^2\) for a four-coloring of a 17-by-17 lattice such that no four lattice points forming a rectangle all have the same color. That problem attracted considerable attention both here and on Gasarch’s own Computational Complexity blog (conducted jointly with Lance Fortnow).

Note: In the comments Jim Propp points out that the muffin problem was invented by Alan Frank. The omission of this fact is my fault; Gasarch mentions it in his paper. The problem’s first appearance in print seems to be in a *New York Times* Numberplay column by Gary Antonick. Frank’s priority is acknowledged only in a footnote, which seems unfair. I apologize for again giving him credit only as an afterthought.

Last week I spent five days in the driver’s seat, crossing the country from east to west, mostly on Interstate 80. I’ve made the trip before, though never on this route. In particular, the 900-mile stretch from Lincoln, Nebraska, across the southern tier of Wyoming, and down to Salt Lake City was new to me.

Driving is a task that engages only a part of one’s neural network, so the rest of the mind is free to wander. On this occasion my thoughts took a political turn. After all, I was boring through the bright red heart of America. Especially in Wyoming.

Based on the party affiliations of registered voters, Wyoming is far and away the most Republican state in the union, with the party claiming the allegiance of two-thirds of the electorate. The Democrats have 18 percent. A 2013 Gallup poll identified Wyoming as the most “conservative” state, with just over half those surveyed preferring that label to “moderate” or “liberal.”

The other singular distinction of Wyoming is that it has the smallest population of all the states, estimated at 579,000. The entire state has fewer people than many U.S. cities, including Albuquerque, Milwaukee, and Baltimore. The population density is a little under six people per square mile.

I looked up these numbers while staying the night in Laramie, the state’s college town, and I was mulling them over as I continued west the next morning, climbing through miles of rolling grassland and sagebrush with scarcely any sign of human habitation. A mischievous thought came upon me. What would it take to flip Wyoming? If we could somehow induce 125,000 liberal voters to take up legal residence here, the state would change sides. We’d have two more Democrats in the Senate, and one more in the House. Berkeley, California, my destination on this road trip, has a population of about 120,000. Maybe we could persuade everyone in Berkeley to give up Chez Panisse and Moe’s Books, and build a new People’s Republic somewhere on Wyoming’s Medicine Bow River.

Let me quickly interject: This is a daydream, or maybe a nightmare, and not a serious proposal. Colonizing Wyoming for political purposes would not be a happy experience for either the immigrants or the natives. The scheme belongs in the same category as a plan announced by a former Mormon bishop to build a new city of a million people in Vermont. (Vermont has a population of about 624,000, the second smallest among U.S. states.)

Rather than trying to flip Wyoming, maybe one should try to fix it. *Why* is it the least populated state, and the most Republican? Why is so much of the landscape vacant? Why aren’t entrepreneurs with dreams of cryptocurrency fortunes flocking to Cheyenne or Casper with their plans for startup companies?

The experience of driving through the state on I-80 suggests some answers to these questions. I found myself wondering how even the existing population of a few hundred thousand manages to sustain itself. Wikipedia says there’s some agriculture in the state (beef, hay, sugar beets), but I saw little evidence of it. There’s tourism, but that’s mostly in the northwest corner, focused on Yellowstone and Grand Teton national parks and the cowboy-chic enclave of Jackson Hole. The only conspicuous economic activity along the I-80 corridor is connected with the mining and energy industries. My very first experience of Wyoming was olfactory: Coming downhill from Pine Bluffs, Nebraska, I caught of whiff of the Frontier oil refinery in Cheyenne; as I got closer to town, I watched the sun set behind a low-hanging purple haze that might also be refinery-related. The next day, halfway across the state, the Sinclair refinery announced itself in a similar way.

Still farther west, coal takes over where oil leaves off. The Jim Bridger power plant, whose stacks and cooling-tower plumes are visible from the highway, burns locally mined coal and exports the electricity.

As the author of a book celebrating industrial artifacts, I’m hardly the one to gripe about the presence of such infrastructure. On the other hand, oil and coal are not much of a foundation for a modern economy. Even with all the wells, the pipelines, the refineries, the mines, and the power plants, Wyoming employment in the “extractive” sector is only about 24,000 (or 7 percent of the state’s workforce), down sharply from a peak of 39,000 in 2008. If this is the industry that will build the state’s future, then the future looks bleak.

Economists going all the way back to Adam Smith have puzzled over the question: Why do some places prosper while others languish? Why, for example, are Denver and Boulder so much livelier than Cheyenne and Laramie? The Colorado cities and the Wyoming ones are only about 100 miles apart, and they share similar histories and physical environments. But Denver is booming, with a diverse and growing economy and a population approaching 700,000—greater than the entire state of Wyoming. Cheyenne remains a tenth the size of Denver, and in Cheyenne you don’t have to fight off hordes of hipsters to book a table for dinner. What makes the difference? I suspect the answer lies in a Yogi Berra phenomenon. Everybody wants to go to Denver because everyone is there already. Nobody wants to be in Cheyenne because it’s so lonely. If this guess is correct, maybe we’d be doing Wyoming a favor by bringing in that invasion of 125,000 sandal-and-hoodie–clad bicoastals.

One more Wyoming story. At the midpoint of my journey across the state, near milepost 205 on I-80, I passed the sign shown at left. I am an aficionado of continental divide crossings, and so I took particular note. Then, 50 miles farther along, I passed another sign, shown at right. On seeing this second crossing, I put myself on high alert for a *third* such sign. This is a matter of simple topology, or so I thought. If a line—perhaps a very wiggly one—divides an area into two regions, then if you start in one region and end up in the other, you must have crossed the line an odd number of times. Shown below are some possible configurations. In each case the red line is the path of the continental divide, and the dashed blue line is the road’s trajectory across it. At far left the situation is simple: The road intersects the divide in a single point. The middle diagram shows three crossings; it’s easy to see how further elaboration of the meandering path could yield five or seven or any odd number of crossings. An arrangement that might seem to generate just two crossings is show at right. One of the “crossings” is not a crossing at all but a point of tangency. Depending on your taste in such matters, the tangent intersection could be counted as crossing the divide twice or not at all; in either case, the total number of crossings remains odd.

In the remainder of my trip I never saw a sign marking a third crossing of the divide. The explanation has nothing to do with points of tangency. I should have known that, because I’ve actually written about this peculiarity of Wyoming topography before. Can you guess what’s happening? Wikipedia tells all.

]]>Twenty years ago, Kimberly-Clark, the Kleenex company, introduced a line of toilet paper embossed with the kite-and-dart aperiodic tiling discovered by Roger Penrose. When I first heard about this, I thought: How clever. Because the pattern never repeats, the creases in successive layers of a roll would never line up over any extended region, and so the sheets would be less likely to stick together.

Sir Roger Penrose had a different response. Apparently be believes the pattern is subject to copyright protection, and he also managed to get a patent issued in 1979, although that would have expired about the time of the toilet paper scandal. Penrose assigned his rights to a British company called Pentaplex Ltd. An article in the *Times* of London quoted a representative of Pentaplex:

So often we read of very large companies riding roughshod over small businesses or individuals, but when it comes to the population of Great Britain being invited by a multinational [company] to wipe their bottoms on what appears to be the work of a knight of the realm without his permission, then a last stand must be made.

Sir Roger sued. I haven’t been able to find a documented account of how the legal action was resolved, but it seems Kimberly-Clark quickly withdrew the product.

Some years ago I was given a small sample of the infamous Penrose toilet paper. It came to me from Phil and Phylis Morrison; a note from Phylis indicates that they acquired it from Marion Walter. Now I would like to pass this treasure on to a new custodian. The specimen is unused though not pristine, roughly a foot long, and accompanied by a photocopy of the abovementioned *Times* news item. In the photograph below I have boosted the contrast to make the raised ridges more visible; in real life the pattern is subtle.

Are you interested in artifacts with unusual symmetries? Would you like to add this object to your collection? Send a note with a U.S. mailing address to brian@bit-player.org. If I get multiple requests, I’ll figure out some Solomonic procedure for choosing the recipient(s). If there are no takers, I guess I’ll use it for its intended purpose.

I must also note that my hypothesis about the special non-nesting property of the embossed paper is totally bogus. In the first place, a roll of toilet paper is an Archimedian spiral, so that the circumference increases from one layer to the next; even a perfectly regular pattern will come into coincidence with itself only when the circumference equals an integer multiple of the pattern period. Second, the texture imprinted on the toilet paper is surely not a real aperiodic tiling. The manufacturing process would have involved passing the sheet between a pair of steel crimping cylinders bearing the incised network of kites and darts. Those cylinders are necessarily of finite diameter, and so the pattern must in fact repeat. If Kimberly-Clark had contested the law suit, they might have used that point in their defense.

]]>A year later, after the Mosaic browser came on the scene, my eyes were opened. I wrote a gushing article on the marvels of the WWW.

There have long been protocols for transferring various kinds of information over the Internet, but the Web offers the first seamless interface to the entire network . . . The Web promotes the illusion that all resources are at your fingertips; the universe of information is inside the little box that sits on your desk.

I was still missing half the story. Yes, the web (which has since lost its capital *W*) opened up an amazing portal onto humanity’s accumulated storehouse of knowledge. But it did something else as well: It empowered all of us to put our own stories and ideas before the public. Economic and technological barriers were swept away; we could all become creators as well as consumers. Perhaps for the first time since Gutenberg, public communication became a reasonably symmetrical, two-way social process.

The miracle of the web is not just that the technology exists, but that it’s accessible to much of the world’s population. The entire software infrastructure is freely available, including the HTTP protocol that started it all, the languages for markup, styling, and scripting (HTML, CSS, JavaScript), server software (Apache, Nginx), content-management systems such as WordPress, and also editors, debuggers, and other development tools. Thanks to this community effort, I get to have my own little broadcasting station, my personal media empire.

But can it last?

In the U.S., the immediate threat to the web is the repeal of net-neutrality regulations. Under the new rules (or non-rules), Internet service providers will be allowed to set up toll booths and roadblocks, fast lanes and slow lanes. They will be able to expedite content from favored sources (perhaps their own affiliates) and impede or block other kinds of traffic. They could charge consumers extra fees for access to some sites, or collect back-channel payments from publishers who want preferential treatment. For a glimpse of what might be in store, a New York *Times* article looks at some recent developments in Europe. (The European Union has its own net-neutrality law, but apparently it’s not being consistently enforced.)

The loss of net neutrality has elicited much wringing of hands and gnashing of teeth. I’m as annoyed as the next netizen. But I also think it’s important to keep in mind that the web (along with the internet more generally) has always lived at the edge of the precipice. Losing net neutrality will further erode the foundations, but it is not the only threat, and probably not the worst one.

Need I point out that the internet lost its innocence a long time ago? In the early years, when the network was entirely funded by the federal government, most commercial activity was forbidden. That began to change circa 1990, when crosslinks with private-enterprise networks were put in place, and the general public found ways to get online through dial-up links. The broadening of access did not please everyone. Internet insiders recoiled at the onslaught of clueless newbies (like me); commercial network operators such as CompuServe and AmericaOnline feared that their customers would be lured away by a heavily subsidized competitor. Both sides were right about the outcome.

As late as 1994, hucksterism on the internet was still a social trangression if not a legal one. Advertising, in particular, was punished by vigorous and vocal vigilante action. But the cause was already lost. The insular, nerdy community of internet adepts was soon overwhelmed by the dot-com boom. Advertising, of course, is now the engine that drives most of the largest websites.

Commerce also intruded at a deeper level in the stack of internet technologies. When the internet first became *inter*—a network of networks—bits moved freely from one system to another through an arrangement called peering, in which no money changed hands. By the late 1990s, however, peering was reserved for true peers—for networks of roughly the same size. Smaller carriers, such as local ISPs, had to pay to connect to the network backbone. These pay-to-play arrangements were never affected by network neutrality rules.

Express lanes and tolls are also not a novelty on the internet. Netflix, for example, pays to place disk farms full of videos at strategic internet nodes around the world, reducing both transit time and network congestion. And Google has built its own private data highways, laying thousands of miles of fiber optic cable to bypass the major backbone carriers. If you’re not Netflix or Google, and you can’t quite afford to build your own global distribution system, you can hire a content delivery network (CDN) such as Akamai or Cloudflare to do it for you. What you get for your money: speedier delivery, caching of static content near the destination, and some protection against malicious traffic. Again the network neutrality rules do not apply to CDNs, even when they are owned and run by companies that also act as telecommunications carriers and ISPs, such as AT&T.

In pointing out that there’s already a lot of money grubbing in the temple of the internet, I don’t mean to suggest that the repeal of net neutrality doesn’t matter or won’t make a difference. It’s a stupid decision. As a consumer, I dread the prospect of buying internet service the way one buys bundles of cable TV channels. As a creator of websites, I fear losing affordable access to readers. As a citizen, I denounce the reckless endangerment of a valuable civic asset. This is nothing but muddy boots trampling a cultural treasure.

Still and all, it could be worse. Most likely it *will* be. Here are three developments that make me uneasy about the future of the web.

**Dominance**. In round numbers, the web has something like a billion sites and four billion users—an extraordinarily close match of producers to consumers. For any other modern medium—television stations and their viewers, newspaper and their readers—the ratio is surely orders of magnitude larger. Yet the ratio for the web is also misleading. Three fourths of those billion web sites have no content and no audience (they are “parked” domain names), and almost all the rest are tiny. Meanwhile, Facebook gets the attention of roughly half of the four billion web users. Google and Facebook together, along with their subsidiaries such as YouTube, account for 70 percent of all internet traffic. The wealth distribution of the web is even more skewed than that of the world economy.

It’s not just the scale of the few large sites that I find intimidating. Facebook in particular seems eager not just to dominate the web but to supplant it. They make an offer to the consumer: We’ll give you a *better* internet, a curated experience; we’ll show you what you want to see and filter out the crap. And they make an offer to the publisher and advertiser: This is where the people are. If you want to reach them, buy a ticket and join the party.

If everyone follows the same trail to the same few destinations, net neutrality is meaningless.

**Fragmentation**. The web is built on open standards and a philosophy of sharing and cooperation. If I put up a public website, anyone can visit without asking my permission; they can use whatever software they please when they read my pages; they can publish links to what I’ve written, which any other web user can then follow. This crosslinked body of literature is now being shattered by the rise of *apps*. Facebook and Twitter and Google and other large internet properties would really prefer that you visit them not on the open web but via their own proprietary software. And no wonder: They can hold you captive in an environment where you can’t wander away to other sites; they can prevent you from blocking advertising or otherwise fiddling with what they feed you; and they can gather more information about you than they could from a generic web browser. The trouble is, when every website requires its own app, there’s no longer a web, just a sheaf of disconnected threads.

This battle seems to be lost already on mobile platforms.

**Suppression**. All of the challenges to the future of the web that I have mentioned so far are driven by the mere pursuit of money. Far scarier are forms of manipulation and discrimination based on noneconomic motives.

Governments have ultimate control over virtually all communications media—radio and TV, newspapers, books, movies, the telephone system, the postal service, and certainly the internet. Nations that we like to think of as enlightened have not hesitated to use that power to shape public discourse or to suppress unpopular or inconvenient opinions, particularly in times of stress. With internet technology, surveillance and censorship are far easier and more efficient than they ever were with earlier media. A number of countries (most notoriously China) have taken full advantage of those capabilities. Others could follow their example. Controls might be introduced overtly through legislation or imposed surreptitiously through hacking or by coercing service providers.

Still another avenue of suppression is inciting popular sentiment—burning down websites with tiki torches. I can’t say I’m sorry to see the Nazi site *Daily Stormer* hounded from the web by public outcry; no one, it seems, will register their domain name or host their content. Historically, however, this kind of intimidation has weighed most heavily on the other end of the political spectrum. It is the labor movement, racial and ethnic and religious minorities, socialists and communists and anarchists, feminists, and the LGBT community who have most often had their speech suppressed. Considering who wields power in Washington just now, a crackdown on “fake news” on the internet is hardly an outlandish possibility.

In spite of all these forebodings, I remain strangely optimistic about the web’s prospects for survival. The internet is a resilient structure, not just in its technological underpinnings but also in its social organization. Over the past 20 years, for many of us, the net has wormed its way into every aspect of daily life. It’s too big to fail now. Even if some basement command center in the White House had a big red switch that shuts down the whole network, no one would dare to throw it.

]]>One of the quirks of life with Dennis was that he didn’t hear well, as a result of childhood ear infections. In an unpublished memoir he lists his deafness as a major influence on his path through life. It was a hardship in school, because he missed much of what his teachers were saying. On the other hand, it kept him out of the military in World War II.

Later in life, hearing aids helped considerably, but only on one side. When we went to lunch, I learned to sit to his right, so that I could speak to the better ear. When we took someone out to lunch, the guest got the favored chair. In our monthly editorial meetings, however, he turned his deaf ear to Gerard Piel, the magazine’s co-founder and publisher. (They didn’t always get along.) In Dennis’s last years, after both of us had left the magazine, we would take long walks through Lower Manhattan, with stops in coffee shops and sojourns on park benches, and again I made sure I was the right-hand man. Dennis died in 2005. I miss him all the time.

Although I was always aware of Dennis’s hearing impairment, I never had an inkling of what his asymmetric sensory experience might feel like from inside his head. Now I have a chance to find out. A few days ago I had a sudden failure of hearing in my left ear. At the time I had no idea what was happening, so I can’t reconstruct an exact chronology, but I think the ear went from normal function to zilch in a matter of seconds or minutes. It was like somebody pulled the plug.

I have since learned that this is a rare phenomenon (5 to 20 cases per 100,000 population) but well-known to the medical community. It has a name: Sudden Sensorineural Hearing Loss. It is a malfunction of the cochlea, the inner-ear transducer between mechanical vibration and neural activity. An audiological exam confirmed that my eardrum and the delicate linkage of tiny bones in the middle ear are functioning normally, but the signal is not getting through to the brain. In most cases of SSNH, the cause is never identified. I’m under treatment, and there’s a decent chance that at least some level of hearing will be restored.

I don’t often write about matters this personal, and I’m not doing so now to whine about my fate or to elicit sympathy. I want to record what I’m going through because I find it fascinating as well as distressing. A great deal of what we know about the human brain comes from accidents and malfunctions, and now I’m learning some interesting lessons at first hand.

The obvious first-order effect of losing an ear is cutting in half the amplitude of the received acoustic signal. This is perhaps the least disruptive aspect of the impairment, and the easiest to mitigate.

The second major effect is more disturbing: trouble locating the source of a sound. Binaural hearing is key to localization. For low-pitched sounds, with wavelengths greater than the diameter of the head, the brain detects the phase difference between waves reaching the two ears. The phase measurement can yield an angular resolution of just a few degrees. At higher frequencies and shorter wavelengths, the head effectly blocks sound, and so there is a large intensity difference between the two ears, which provides another localizing cue. This mechanism is somewhat less acurate, but you can home in on a source by turning your head to null the intensity difference.

With just one ear, both kinds of directional guidance are lacking. This did not come as a surprise to me, but I had never thought about what it would be like to perceive nonlocalized sounds. You might imagine it would be like switching the audio system from stereophonic to monoaural. In that case, you lose the illusion that the strings are on the left side of the stage and the brasses on the right; the whole orchestra is all mixed up in front of you. Nevertheless, in your head you are still localizing the sounds; they are all coming from the speakers across the room. Having one ear is not like that; it’s not just life in mono.

In my present state I can’t identify the sources of many sounds, but they don’t come from nowhere. Some of them come from everywhere. The drone of the refrigerator surrounds me; I hear it radiating from all four walls and the floor and ceiling; it’s as if I’m somehow inside the sound. And one night there was a repetitive thrub-a-dub that puzzled me so much I had to get out of bed and go searching for the cause. The search was essentially a random one: I determined it was not the heating system, and nothing in the kitchen or bathroom. Finally I discovered that the noise was rain pouring off the roof into the gutters and downspouts.

The failures of localization are most disturbing when the apparent source is not vague or unknown but rather quite definite—and wrong! My phone rings, and I reach out to my right to pick it up, but in fact it’s in my shirt pocket. While driving the other day, I heard the whoosh of a car that seemed to be passing me on the right, along the shoulder of the road. I almost veered left to make room. If I had done so, I would have run into the overtaking vehicle, which was of course actually on my left. (Urgent priority: Learn to ignore deceptive directional cues.)

In the first hour or so after this whole episode began, I did not recognize it as a loss of hearing; what I noticed instead was a distracting barrage of echoes. I was chatting with three other people in a room that has always seemed acoustically normal, but words were coming at me from all directions like high-velocity ping-pong balls. The echoes have faded a little in the days since, but I still hear double in some situations. And, interestingly, the echo often seems to be coming from the nonfunctioning ear. I have a hypothesis about what’s going on. Echoes are real, after all; sounds really do bounce off walls, so that the ears receive multiple instances of a sound separated by millisecond delays. Normally, we don’t perceive those echoes. The ears must be sensing them, but some circuitry in the brain is suppressing the perception. (Telephone systems have such circuitry too.) Based on my experience, I suspect that the suppression mechanism depends on the presence of signals from both ears.

Similar to echo suppression is noise suppression. I find I have lost the benefit of the “cocktail party effect,” whereby we select a single voice to attend to and filter out the background chatter. The truth is, I was never very good at that trick, but I’m notably worse now. A possibly related development is that I have the illusion of *enhanced* hearing acuity for some kinds of noise. The sound of water running from a faucet carries all through the house now. And the sound of my own chewing can be thunderous. In the past, perhaps the binaural screening process was turning down the gain on such commonplace distractions.

Even though no sounds of the outside world are reaching me from the left side of my head, that doesn’t mean the ear is silent. It seems to emit a steady hiss, which I’m told is common in this condition. Occasionally, in a very quiet room, I also hear faint chimes of pure sine tones. Do any of these signals actually originate in the affected cochlea, or are they phantoms that the brain merely attributes to that source?

The most curious interior noise is one that I’ve taken to calling the motor. In the still of the night, if I turn my head a certain way, I hear a putt-putt-putt with the rhythm of a sputtering lawn-mower engine, though very faint and voiceless. The intriguing thing is, the sound is altered by my breathing. If I hold my breath for a few seconds, the putt-putting slows and sometimes stops entirely. Then when I take a breath, the motor revs up again. Could this response indicate sensitivity to oxygen levels in the blood reaching my head? I like to imagine that the source of the noise is a single lonely neuron in the cochlea, bravely tapping out its spike train—the last little drummer boy in my left ear. But I wouldn’t be surprised to learn it comes from somewhere higher up in the auditory pathway.

One of the first manuscripts I edited at *Scientific American* (published in October 1973) was an article by the polymath Gerald Oster.

Ordinary beat tones are elementary physics: Whenever two waves combine and interfere, they create a new wave whose frequency is equal to the difference between the two original frequencies. In the case of sound waves at frequencies a few hertz apart, we perceive the beat tone as a throbbing modulation of the sound intensity. Oster asked what happens when the waves are not allowed to combine and interfere but instead are presented separately to the two ears. In certain frequency ranges it turns out that most people still hear the beats; evidently they are generated by some interference process within the auditory networks of the brain. Oster suggested that a likely site is the superior olivary nucleus. There are two of these bodies arrayed symmetrically just to the left and right of the midline in the back of the brain. They both receive signals from both ears.

Whatever the mechanism generating the binaural beats, it has to be happening somewhere inside the head. It’s a dramatic reminder that perception is not a passive process. We don’t really see and hear the world; we fabricate a model of it based on the sensations we receive—or fail to receive.

I’m hopeful that this little experiment of nature going on inside my cranium will soon end, but if it turns out to be a permanent condition, I’ll cope. As it happens, my listening skills will be put to the test over the next several months, as I’m going to be spending a lot of time in lecture halls. There’s the annual Joint Mathematics Meeting coming up in early January, then I’m spending the rest of the spring semester at the Simons Institute for the Theory of Computing in Berkeley. Lots of talks to attend. You’ll find me in the front of the room, to the left of the speaker.

My years with Dennis Flanagan offer much comfort when I consider the prospect of being half-deaf. His deficit was more severe than mine, and he put up with it from childhood. It never held him back—not from creating one of the world’s great magazines, not from leading several organizations, not from traveling the world, not from spearing a 40-pound bass while free diving in Great South Bay.

One worry I face is music—will I ever be able to enjoy it again?—but Dennis’s example again offers encouragement. We shared a great fondness for Schubert. I can’t know exactly what Dennis was hearing when we listened to a performance of the Trout Quintet together, but he got as much pleasure out of it as I did. And in his sixties he went beyond appreciation to performance. He had wanted to learn the cello, but a musician friend advised him to take up the brass instrument of the same register. He did so, and promptly learned to play a Bach suite for unaccompanied cello on the slide trombone.

]]>Given the present state of life in America, what we really need is an Approximation to Rationality Day, but that may have to wait for 20/1/21. In the meantime, let us merrily fiddle with numbers, searching for ratios of integers that brazenly invade the personal space of famous irrationals.

When I was a teenager, somebody told me about the number 355/113, which is an exceptionally good approximation to *π*. The exact value is

3.141592920353982520964564173482358455657958984375,

correct through the first six digits after the decimal point. In other words, it differs from the true value by less than one-millionth. I was intrigued, and so I set out to find an even better approximation. My search was necessarily a pencil-and-paper affair, since I had no access to any electronic or even mechanical aids to computation. The spiral-bound notebook in which I made my calculations has not survived, and I remember nothing about the outcome of the effort.

A dozen years later I acquired some computing machinery: a Hewlett-Packard programmable calculator, called the HP-41C. Here is the main loop of an HP-41C program that searches for good rational approximations. Note the date at the top of the printout (written in middle-endian format). Apparently I was finishing up this program just before Approximation Day in 1981.

What’s that you say? You’re not fluent in the 30-year-old Hewlett-Packard dialect of reverse Polish notation? All right, here’s a program that does roughly the same thing, written in an oh-so-modern language, Julia.

```
function approximate(T, dmax)
d = 1
leastError = T
while d <= dmax && leastError > 0
n = Int(round(d * T))
err = abs(T - n/d) / T
merit = 1 / ((n + d)^2 * err)
if err < leastError
println("$n/$d = $(n/d) error = $err merit = $merit")
leastError = err
end
d += 1
end
end
```

The algorithm is a naive, sequential search for fractions \(n/d\) that approximate the target number \(T\). For each value of \(d\), you need to consider only one value of \(n\), namely the integer nearest to \(d \times T\). (What happens if \(d \times T\) falls halfway between two integers? That can’t happen if \(T\) is irrational.) Thus you can begin with \(d = 1\) and continue up to a specified largest denominator \(d = dmax\). The accuracy of the approximation is measured by the error term \(|T - n/d| / T\). Whenever a value of \(n/d\) yields a new minimum error, the program prints a line of results. (This version of the algorithm works correctly only for \(T \gt 1\), but it can readily be adapted to \(T \lt 1\).)

The HP-41C has a numerical precision of 10 decimal digits, and so the closest possible approximation to *π* is 3.141592654. Back in 1981 I ran the program until it found a fraction equal to this value—a *perfect* approximation, from the program’s point of view. According to a note on the printout, that took 13 hours. The Julia program above, running on a laptop, completes the same computation in about three milliseconds. You’re welcome to take a scroll through the results, below. (The numbers are not digit-for-digit identical to those generated by the HP-41C because Julia calculates with higher precision, about 16 decimal digits.)

3/1 = 3.0 error = 0.045070341573315915 merit = 1.3867212410256813 13/4 = 3.25 error = 0.03450712996224109 merit = 0.10027514940370374 16/5 = 3.2 error = 0.018591635655129744 merit = 0.12196741256165179 19/6 = 3.1666666666666665 error = 0.007981306117055373 merit = 0.20046844169789904 22/7 = 3.142857142857143 error = 0.0004024993041452083 merit = 2.9541930379680195 179/57 = 3.1403508771929824 error = 0.00039526983405584675 merit = 0.04542368072920613 201/64 = 3.140625 error = 0.0003080138345651019 merit = 0.04623150469956595 223/71 = 3.140845070422535 error = 0.00023796324342470652 merit = 0.04861781754719378 245/78 = 3.141025641025641 error = 0.0001804858353094197 merit = 0.053107007660473673 267/85 = 3.1411764705882352 error = 0.00013247529441315622 merit = 0.060922789404334425 289/92 = 3.141304347826087 error = 9.177070539240495e-5 merit = 0.07506646742266793 311/99 = 3.1414141414141414 error = 5.6822320879624425e-5 merit = 0.10469195703580983 333/106 = 3.141509433962264 error = 2.6489760736525772e-5 merit = 0.19588127575835135 355/113 = 3.1415929203539825 error = 8.478310581938076e-8 merit = 53.85164473263654 52518/16717 = 3.1415923909792425 error = 8.37221074104896e-8 merit = 0.00249177288308447 52873/16830 = 3.141592394533571 error = 8.259072954625822e-8 merit = 0.0024921016732136797 53228/16943 = 3.1415923980404887 error = 8.147444291923546e-8 merit = 0.0024926612882136163 53583/17056 = 3.141592401500938 error = 8.03729477091334e-8 merit = 0.0024934520351304946 53938/17169 = 3.1415924049158366 error = 7.928595172899531e-8 merit = 0.0024944743578840687 54293/17282 = 3.141592408286078 error = 7.821317056655376e-8 merit = 0.0024957288257085445 54648/17395 = 3.141592411612532 error = 7.715432730151448e-8 merit = 0.002497216134767719 55003/17508 = 3.1415924148960475 error = 7.610915194012454e-8 merit = 0.0024989371196291283 55358/17621 = 3.1415924181374497 error = 7.507738155653036e-8 merit = 0.0025008927426067996 55713/17734 = 3.1415924213375437 error = 7.405876001006156e-8 merit = 0.0025030840968725283 56068/17847 = 3.1415924244971145 error = 7.305303737979925e-8 merit = 0.002505512419906649 56423/17960 = 3.1415924276169265 error = 7.20599703886498e-8 merit = 0.002508179074048983 56778/18073 = 3.141592430697726 error = 7.107932141383905e-8 merit = 0.0025110855755419263 57133/18186 = 3.14159243374024 error = 7.01108591937022e-8 merit = 0.002514233565685482 57488/18299 = 3.1415924367451775 error = 6.915435783817789e-8 merit = 0.0025176248413626597 57843/18412 = 3.1415924397132304 error = 6.820959725288218e-8 merit = 0.0025212613363967255 58198/18525 = 3.141592442645074 error = 6.727636243231866e-8 merit = 0.002525145143834103 58553/18638 = 3.141592445541367 error = 6.635444374259433e-8 merit = 0.0025292785028112976 58908/18751 = 3.141592448402752 error = 6.544363663870371e-8 merit = 0.0025336638062423296 59263/18864 = 3.141592451229856 error = 6.454374152317083e-8 merit = 0.002538303603848205 59618/18977 = 3.1415924540232916 error = 6.365456332197522e-8 merit = 0.002543200616913158 59973/19090 = 3.1415924567836564 error = 6.277591190862598e-8 merit = 0.002548357720152209 60328/19203 = 3.1415924595115348 error = 6.190760125601375e-8 merit = 0.0025537779743748956 60683/19316 = 3.1415924622074964 error = 6.10494500018427e-8 merit = 0.0025594646031786867 61038/19429 = 3.1415924648720983 error = 6.020128088319864e-8 merit = 0.002565421015548036 61393/19542 = 3.141592467505885 error = 5.936292059519092e-8 merit = 0.0025716508123781218 61748/19655 = 3.141592470109387 error = 5.853420007366852e-8 merit = 0.0025781577749599853 62103/19768 = 3.1415924726831244 error = 5.771495407114599e-8 merit = 0.002584945883912429 62458/19881 = 3.141592475227604 error = 5.690502101544554e-8 merit = 0.002592019327133724 62813/19994 = 3.141592477743323 error = 5.6104242868339024e-8 merit = 0.0025993825084809985 63168/20107 = 3.1415924802307655 error = 5.531246526690591e-8 merit = 0.0026070400439016164 63523/20220 = 3.1415924826904056 error = 5.4529537523533324e-8 merit = 0.0026149967637792084 63878/20333 = 3.141592485122707 error = 5.375531191912607e-8 merit = 0.002623257749852838 64233/20446 = 3.141592487528123 error = 5.2989644268538606e-8 merit = 0.0026318283126966317 64588/20559 = 3.141592489907097 error = 5.22323933551431e-8 merit = 0.0026407140236596287 64943/20672 = 3.1415924922600618 error = 5.148342135490336e-8 merit = 0.002649920699086574 65298/20785 = 3.1415924945874427 error = 5.0742592988226976e-8 merit = 0.002659454449139831 65653/20898 = 3.1415924968896545 error = 5.0009776226755164e-8 merit = 0.0026693216486930156 66008/21011 = 3.141592499167103 error = 4.928484186928889e-8 merit = 0.002679528965991537 66363/21124 = 3.1415925014201855 error = 4.8567663400430846e-8 merit = 0.0026900833784673454 66718/21237 = 3.1415925036492913 error = 4.7858116990585446e-8 merit = 0.0027009921818650063 67073/21350 = 3.141592505854801 error = 4.715608149595883e-8 merit = 0.0027122629998437182 67428/21463 = 3.1415925080370872 error = 4.6461438175842924e-8 merit = 0.002723903810648984 67783/21576 = 3.1415925101965145 error = 4.577407111668933e-8 merit = 0.002735922933992634 68138/21689 = 3.1415925123334407 error = 4.5093866383961494e-8 merit = 0.0027483290931549346 68493/21802 = 3.1415925144482157 error = 4.442071258756658e-8 merit = 0.002761131395876878 68848/21915 = 3.141592516541182 error = 4.375450074049751e-8 merit = 0.002774339356802981 69203/22028 = 3.1415925186126747 error = 4.309512411747499e-8 merit = 0.0027879629217230834 69558/22141 = 3.1415925206630235 error = 4.244247783087354e-8 merit = 0.002802012512429091 69913/22254 = 3.14159252269255 error = 4.179645953751142e-8 merit = 0.0028164989998024 70268/22367 = 3.1415925247015695 error = 4.115696873186072e-8 merit = 0.0028314337694556623 70623/22480 = 3.1415925266903915 error = 4.0523907028763286e-8 merit = 0.002846828724926181 70978/22593 = 3.141592528659319 error = 3.989717788071482e-8 merit = 0.00286269633032941 71333/22706 = 3.1415925306086496 error = 3.9276686719222797e-8 merit = 0.0028790496258831624 71688/22819 = 3.1415925325386738 error = 3.86623409548065e-8 merit = 0.0028959022542887716 72043/22932 = 3.141592534449677 error = 3.805404969428105e-8 merit = 0.0029132685103826087 72398/23045 = 3.1415925363419395 error = 3.7451723882115376e-8 merit = 0.0029311633622333107 72753/23158 = 3.1415925382157353 error = 3.685527615907423e-8 merit = 0.002949602495467867 73108/23271 = 3.1415925400713336 error = 3.626462086221821e-8 merit = 0.002968602349703417 73463/23384 = 3.1415925419089974 error = 3.567967430761971e-8 merit = 0.002988180133716996 73818/23497 = 3.141592543728987 error = 3.510035365949903e-8 merit = 0.003008353961046636 74173/23610 = 3.1415925455315543 error = 3.452657862652023e-8 merit = 0.003029142753805288 74528/23723 = 3.1415925473169497 error = 3.395826962413729e-8 merit = 0.0030505664465106676 74883/23836 = 3.141592549085417 error = 3.339534904681598e-8 merit = 0.0030726459300795604 75238/23949 = 3.1415925508371956 error = 3.283774056124397e-8 merit = 0.003095403169820992 75593/24062 = 3.141592552572521 error = 3.228536938904675e-8 merit = 0.0031188612412389144 75948/24175 = 3.1415925542916234 error = 3.173816202407169e-8 merit = 0.0031430444223940115 76303/24288 = 3.14159255599473 error = 3.1196046373746034e-8 merit = 0.0031679782521683033 76658/24401 = 3.141592557682062 error = 3.065895190043484e-8 merit = 0.0031936895918127546 77013/24514 = 3.1415925593538385 error = 3.01268089146511e-8 merit = 0.0032202067806171002 77368/24627 = 3.141592561010273 error = 2.9599549423203633e-8 merit = 0.003247559639023363 77723/24740 = 3.1415925626515766 error = 2.9077106281049175e-8 merit = 0.0032757796556622983 78078/24853 = 3.1415925642779543 error = 2.8559414180798277e-8 merit = 0.0033048999843237645 78433/24966 = 3.14159256588961 error = 2.804640823913544e-8 merit = 0.003334955716987436 78788/25079 = 3.1415925674867418 error = 2.753802541039899e-8 merit = 0.0033659838476231357 79143/25192 = 3.141592569069546 error = 2.703420321435919e-8 merit = 0.003398023556100075 79498/25305 = 3.1415925706382137 error = 2.6534880725724155e-8 merit = 0.0034311162371627422 79853/25418 = 3.141592572192934 error = 2.6039997867349902e-8 merit = 0.0034653057466538235 80208/25531 = 3.141592573733892 error = 2.554949569295635e-8 merit = 0.0035006385417218717 80563/25644 = 3.14159257526127 error = 2.5063316245769302e-8 merit = 0.0035371638899188347 80918/25757 = 3.1415925767752455 error = 2.4581402841236452e-8 merit = 0.0035749340371894456 81273/25870 = 3.1415925782759953 error = 2.410369936023742e-8 merit = 0.003614004535709633 81628/25983 = 3.1415925797636914 error = 2.3630150955873712e-8 merit = 0.00365443439340209 81983/26096 = 3.141592581238504 error = 2.3160703488036753e-8 merit = 0.00369628643041249 82338/26209 = 3.141592582700599 error = 2.2695304230197833e-8 merit = 0.003739627468693587 82693/26322 = 3.1415925841501404 error = 2.2233900879902193e-8 merit = 0.0037845288174018898 83048/26435 = 3.1415925855872895 error = 2.1776442124200985e-8 merit = 0.0038310665494126084 83403/26548 = 3.1415925870122043 error = 2.1322877639651253e-8 merit = 0.00387932189896066 83758/26661 = 3.1415925884250404 error = 2.087315795095796e-8 merit = 0.003929381726572982 84113/26774 = 3.1415925898259505 error = 2.0427234430973973e-8 merit = 0.003981339007706688 84468/26887 = 3.1415925912150855 error = 1.9985059017984126e-8 merit = 0.004035293430477111 84823/27000 = 3.1415925925925925 error = 1.9546584922495102e-8 merit = 0.004091351857390988 85178/27113 = 3.1415925939586176 error = 1.9111765637729565e-8 merit = 0.004149629190123568 85533/27226 = 3.1415925953133033 error = 1.868055578777407e-8 merit = 0.004210248941258058 85888/27339 = 3.141592596656791 error = 1.825291042078912e-8 merit = 0.004273344214343279 86243/27452 = 3.1415925979892174 error = 1.78287858571571e-8 merit = 0.004339058439193095 86598/27565 = 3.1415925993107203 error = 1.7408138417260385e-8 merit = 0.004407546707464268 86953/27678 = 3.1415926006214323 error = 1.6990925835061217e-8 merit = 0.004478976601684539 87308/27791 = 3.1415926019214853 error = 1.6577106127237806e-8 merit = 0.004553529781140699 87663/27904 = 3.1415926032110093 error = 1.6166637875900305e-8 merit = 0.004631403402447433 88018/28017 = 3.141592604490131 error = 1.5759480794022753e-8 merit = 0.0047128116308472546 88373/28130 = 3.141592605758976 error = 1.5355594877295166e-8 merit = 0.004797987771392931 88728/28243 = 3.1415926070176683 error = 1.4954940686839493e-8 merit = 0.0048871863549194705 89083/28356 = 3.141592608266328 error = 1.4557479914641577e-8 merit = 0.004980685405908598 89438/28469 = 3.141592609505076 error = 1.4163174252687263e-8 merit = 0.005078789613658918 89793/28582 = 3.1415926107340284 error = 1.3771986523826276e-8 merit = 0.005181833172630217 90148/28695 = 3.141592611953302 error = 1.338387969226633e-8 merit = 0.005290183824183623 90503/28808 = 3.1415926131630103 error = 1.2998817570363058e-8 merit = 0.005404246870669908 90858/28921 = 3.1415926143632653 error = 1.2616764535904027e-8 merit = 0.005524470210563737 91213/29034 = 3.141592615554178 error = 1.2237685249392783e-8 merit = 0.005651350205744754 91568/29147 = 3.1415926167358563 error = 1.1861545360838771e-8 merit = 0.005785438063205309 91923/29260 = 3.1415926179084073 error = 1.1488310802967408e-8 merit = 0.005927347979056494 92278/29373 = 3.141592619071937 error = 1.111794779122008e-8 merit = 0.006077766389438445 92633/29486 = 3.141592620226548 error = 1.0750423671902066e-8 merit = 0.006237462409303776 92988/29599 = 3.1415926213723435 error = 1.0385705649960649e-8 merit = 0.006407301439430316 93343/29712 = 3.1415926225094237 error = 1.0023761778491034e-8 merit = 0.00658826005755035 93698/29825 = 3.1415926236378877 error = 9.664560534662385e-9 merit = 0.006781444748602359 94053/29938 = 3.1415926247578327 error = 9.308070961075804e-9 merit = 0.006988114128701429 94408/30051 = 3.1415926258693556 error = 8.954262241690382e-9 merit = 0.007209706348604964 94763/30164 = 3.14159262697255 error = 8.603104549971112e-9 merit = 0.007447871540046976 95118/30277 = 3.14159262806751 error = 8.254567918024995e-9 merit = 0.007704513406469473 95473/30390 = 3.141592629154327 error = 7.90862336746494e-9 merit = 0.007981838717667477 95828/30503 = 3.1415926302330917 error = 7.565241919903853e-9 merit = 0.008282421184374838 96183/30616 = 3.1415926313038933 error = 7.224395162386583e-9 merit = 0.008609280341750632 96538/30729 = 3.1415926323668195 error = 6.8860552473899216e-9 merit = 0.008965982432171553 96893/30842 = 3.1415926334219573 error = 6.550194468748648e-9 merit = 0.009356770586561815 97248/30955 = 3.141592634469391 error = 6.216785968445456e-9 merit = 0.009786732283709331 97603/31068 = 3.1415926355092054 error = 5.885802747105052e-9 merit = 0.010262022067809991 97958/31181 = 3.1415926365414837 error = 5.557218370784088e-9 merit = 0.010790155391967196 98313/31294 = 3.1415926375663066 error = 5.231007112329143e-9 merit = 0.011380406450991833 98668/31407 = 3.1415926385837554 error = 4.907143103228812e-9 merit = 0.012044356667029002 99023/31520 = 3.1415926395939087 error = 4.585601323119603e-9 merit = 0.01279665696194468 99378/31633 = 3.141592640596845 error = 4.266356751638026e-9 merit = 0.013656119502875172 99733/31746 = 3.1415926415926414 error = 3.9493849338525334e-9 merit = 0.014647305857352692 100088/31859 = 3.141592642581374 error = 3.6346615561895634e-9 merit = 0.015802906908552822 100443/31972 = 3.1415926435631176 error = 3.322162870507497e-9 merit = 0.017167407267272748 100798/32085 = 3.1415926445379463 error = 3.0118652700227016e-9 merit = 0.018802933529623964 101153/32198 = 3.141592645505932 error = 2.703745854741474e-9 merit = 0.020798958087527405 101508/32311 = 3.1415926464671475 error = 2.397781441954139e-9 merit = 0.02328921472604781 101863/32424 = 3.141592647421663 error = 2.0939496970989362e-9 merit = 0.026482916558483883 102218/32537 = 3.1415926483695484 error = 1.7922284269720909e-9 merit = 0.03072676661447583 102573/32650 = 3.141592649310873 error = 1.492595579727815e-9 merit = 0.03664010548445531 102928/32763 = 3.141592650245704 error = 1.195029527594277e-9 merit = 0.04544847105306477 103283/32876 = 3.1415926511741086 error = 8.995092082315892e-10 merit = 0.05996553050516452 103638/32989 = 3.1415926520961532 error = 6.060132765838922e-10 merit = 0.0883984797913258 103993/33102 = 3.1415926530119025 error = 3.1452123574324146e-10 merit = 0.16916355170353897 104348/33215 = 3.141592653921421 error = 2.5012447443706518e-11 merit = 2.1127131430431656

The error values in the middle column of the table above shrink steadily as you read from the top of the list to the bottom. Each successive approximation is more accurate than all those above it. Does that also mean each successive approximation is *better* than those above it? I would say no. Any reasonable notion of “better” in this context has to take into account the size of the numerator and the denominator.

If you want an approximation of \(\pi\) accurate to seven digits, I can give you one off the top of my head: \(3141593/1000000\). But the numbers making up that ratio are themselves seven digits long. What makes \(355/113\) impressive is that it achieves seven-digit accuracy with only three digits in the numerator and the denominator. Accordingly, I would argue that a “better” approximation is one that minimizes both error and size. The rightmost column of the table, filled with numbers labeled “merit” is meant to quantify this intuition.

When I wrote that program in 1981, I chose a strange formula for merit, one that now baffles me:

\[\frac{1}{(n + d)^2 * err}.\]

Adding the numerator and denominator and then squaring the sum is an operation that makes no sense, although the formula as a whole does have the correct qualitative behavior, favoring both smaller errors and smaller values of \(n\) and \(d\). In trying to reconstruct what I had in mind 26 years ago, my best guess is that I was trying to capture a geometric insight, and I flubbed it when translating math into code. On this assumption, the correct figure of merit would be:

\[\frac{1}{\sqrt{n^2 + d^2} * err}.\]

To see where this formula comes from, consider a two-dimensional lattice of integers, with a ray of slope \(\pi\) drawn from the origin and going on to infinite distance.

Because the line’s slope is irrational, it will never pass through any point of the integer lattice, but it will have many near misses. The near-miss points, with coordinates interpreted as numerator and denominator, are the accurate approximations to \(\pi\). The diagram suggests a measure of the merit based on distances. An approximation gets better when we minimize the distance of the lattice point from the origin as well as the vertical distance from the point to the \(\pi\) line. That’s the meaning of the formula with \(\sqrt{n^2 + d^2}\) in the denominator.

Another approach to defining merit simply counts digits. The merit is the ratio of the number of correctly predicted digits in the irrational target \(T\) to the number of digits in the denominator. A problem with this scheme is that it’s rather coarse. For example, \(13/4\) and \(16/5\) both have single-digit denominators and they each get one digit of \(\pi\) correct, but

\(16/5\) actually has a smaller error.

To smooth out the digit-counting criterion, and distinguish between values that differ in magnitude but have the same number of digits, we can take logarithms of the numbers. Let merit equal: \(-log(err) / log(d)\). (The \(log(err)\) term is negated because the error is always less than \(1\) and so its logarithm is negative.)

Here’s a comparison of the three merit criteria for some selected approximations to \(\pi\):

n/d 1981 merit distance merit log merit 3/1 1.3867212448620723 7.016316181613145 -- 13/4 0.10027514901117529 2.1306165422053285 2.4284808488226544 16/5 0.12196741168912356 3.208700907602539 2.4760467349663537 19/6 0.20046843839209055 6.288264070960828 2.6960388788612515 22/7 2.954192079226498 107.61458138965322 4.017563128080901 179/57 0.04542369572848121 13.467303354323912 1.9381258641568968 201/64 0.04623152429195394 15.390920494844842 1.9441196398907357 223/71 0.04861784421796857 17.956388291625093 1.9573120958787444 245/78 0.05310704607396699 21.548988850935377 1.9785253787278367 267/85 0.06092284944437125 26.93965209372642 2.0098618723780515 289/92 0.07506657421887829 35.92841360228601 2.055872071177696 311/99 0.10469219759604646 53.921550739835986 2.1273838230139175 333/106 0.1958822412726219 108.02438852795403 2.259868093766371 355/113 53.76883630752973 31610.90993685001 3.444107245852723 52163/16604 0.002495514149618044 215.57611105028013 1.6757260012234105 • • • • • • • • • • • • 103993/33102 0.2892417579456485 49813.04849576935 2.1538978293241056 104348/33215 0.5006051667655171 86508.24042805366 2.2065386096084607 208341/66317 0.3403602724772912 117433.39822796892 2.1589243556399245 312689/99532 0.6343809166515098 328504.0552596196 2.207421489352196

All three measures agree that \(22/7\) and \(355/113\) are quite special. In other respects they give quite different views of the data. My weird 1981 formula compares \((n + d)^{-2}\) with \(err^{-1}\); the asymmetry in the exponents suggests the merit will tend to zero as \(n\) and \(d\) increase, at least in the average case. The maximum of the distance-based measure, on the other hand, appears to grow without bound. And the logarithmic merit function seems to be settling on a value near 2.0. This implies that we shouldn’t expect to see many \(n/d \) approximations where the number of correct digits is greater than twice the number of digits in \(d\). The late Tom Apostol and Mamikon A. Mnatsakanian proved a closely related proposition (“Surprisingly accurate rational approximations,” *Mathematics Magazine*, Vol. 75, No. 4 (Oct. 2002), pp. 307-310).

The final joke on my 1981 self is that all this searching for better approximants can be neatly sidestepped by a bit of algorithmic sophistication. The magic phrase is “continued fractions.” The continued fraction for \(\pi\) begins:

\[ \pi = 3+\cfrac{1}{7+\cfrac{1}{15+\cfrac{1}{1+\cfrac{1}{292+\cfrac{1}{1 + \cdots}}}}}\]

Evaluating the successive levels of this expression yields a sequence of “convergents” that should look familiar:

\[3/1, 22/7, 333/106, 355/113, 103993/33102, 104348/33215.\]

It is a series of “best” approximations to \(\pi\), generated without bothering with all the intervening non-“best” values. I produced this list in CoCalc (a.k.a. SageMathCloud), following the excellent tutorial in William Stein’s *Elementary Number Theory*. Even much larger approximants gush forth from the algorithm in milliseconds. Here’s the 100th element of the series:

\[\frac{4170888101980193551139105407396069754167439670144501}{1327634917026642108692848192776111345311909093498260}\]

A question remains: In what sense are these approximations “best”? It’s guaranteed that every element of the series is more *accurate* than all those that came before, but it’s not clear to me that they also satisfy any sort of compactness criterion. But that’s a question to be taken up another day. Perhaps on Continued Fraction Day.

In a barn, 100 chicks sit peacefully in a circle. Suddenly, each chick randomly pecks the chick immediately to its left or right. What is the expected number of unpecked chicks?

Robitaille took less than a second to buzz in with the correct answer, according to the *Times*.

The next day, Jordan Ellenberg tweeted a followup problem:

Since I don’t have to squeeze this story into 140 characters, I’ll fill in some details of Ellenberg’s question, as I understand it. Where the original problem called for a single round of synchronized random pecking, we now have multiple rounds. During a round, each chick randomly turns either left or right and pecks one of its neighbors. However, once a chick has been pecked, it will never peck again, even if it continues to receive pecks. When two adjacent chicks peck each other in the same round, they both drop out of the pecking game for all future rounds. If an unpecked chick winds up sitting between two pecked neighbors, it can never be pecked and will therefore keep on pecking forever. The question is, what proportion of the flock will survive to become invulnerable peckers?

Spoilers below, so now’s the time to work out the answers for yourself. While you’re busy with that, I’m going to say a few words about chickens, and about the rhetoric and semiotics of mathematical “word problems.”

My only direct knowledge of poultry comes from boyhood visits to my Aunt Noretta’s farm in southern New Jersey. That’s not much of a claim to expertise, but for what it’s worth I never saw her chickens sit in a circle, and they didn’t peck randomly. (They had a *pecking order*!) Furthermore, nothing I observed in their social interactions resembled the turn-the-other-cheek behavior of the chickens described in this problem. Why does a pecked chick never peck again? This is a bigger riddle than the quantitative question we are asked to address. Has the chick suddenly discovered the wisdom and power of nonviolence? I can think of another explanation, but it’s not for the squeamish: Maybe pecked chicks don’t peck back because pecks are lethal.

I know it’s silly to demand narrative realism in a story like this one. Mathematical word problems belong to a genre where no one expects verisimilitude. They are set in a world where knaves *always* lie and knights *always* speak the truth, where shipwrecked sailors obsess about the divisibility properties of a pile of coconuts, where people don’t know the color of the hat on their own head. Even the laws of physics yield to mathematical necessity: A fly shuttling between oncoming locomotives instantaneously reverses direction. Those chicks sitting in a circle are not fluffly bundles of yellow plumage; they are mathematical abstractions. They have coordinates and state variables rather than feathers.

I’m okay with abstraction; by all means, let us strip away extraneous detail. Nevertheless, isn’t the point of word problems to connect the mathematics to some aspect of familiar experience? Consider the ancient and famous river-crossing problem, where the fox must not be left alone with the chicken, which must not be left alone with the bag of corn. These constraints are easy to understand when you know something about the dietary preferences of foxes and chickens. That kind of intuitive boost is not to be found in the pecking problem. On the contrary, a little knowledge of avian behavior actually makes the problem more perplexing.

But no matter. Onward! Have you come up with your answers?

The single-round problem from the Mathcounts Competition yields to the oldest trick in the probability book. A chick remains unpecked only if both of its neighbors turn away and peck in the other direction. On both the left and the right, the probability of escaping a peck is \(\frac{1}{2}\), and the two events are independent, so the probability of staying unpecked on both sides is \(\frac{1}{2} \times \frac{1}{2} = \frac{1}{4}\). This argument applies identically to all the birds in the circle, so you can expect 25 percent of the chicks to come through unscathed.

Do you agree with this analysis? I came up with it pretty quickly when I read the *Times* article (though not nearly fast enough to beat Luke Robitaille to the buzzer). But then I began to have doubts. Is it strictly true that a chick’s left and right neighbors are totally independent? After all, they are connected by a chain of other chicks. Perhaps some influence can propagate around the circle, creating a correlation between left and right and altering the probability of survival.

Time for an experiment: Write the program, run the simulation. Set up a ring of 100 unpecked chickens and allow a single round of random simultaneous pecking. Repeat many times and calculate the mean number of unpecked birds remaining. (Some quick notation: Let \(N\) be the number of chicks in the ring and \(S\) be the number that survive unpecked. I’ll use \(\bar{S}\) for the mean value of \(S\) averaged over \(R\) repetitions of the experiment.) My results:

\(R\) | \(\bar{S}\) |
---|---|

100 | 24.79 |

10,000 | 24.9881 |

1,000,000 | 25.000274 |

100,000,000 | 24.99991518 |

As expected, the mean is quite close to 25 survivors. Furthermore, each time the sample size increases by a factor of 100, the accuracy of the approximation improves about tenfold. This pattern conforms to a statistical rule of thumb—that the fluctuations in a random process are proportional to the square root of the sample size. Thus the slight departures from \(\bar{S} = 25\) appear to be innocent random noise, not some systematic bias.

So that settles it, right?

Well, the simulation looks pretty convincing for the specific case of \(N = 100\) chicks, but the result might differ for other values of *N*. In particular, perhaps there’s some finite-size effect that becomes apparent only when *N* is small. Consider a “circle” of just two chicks. In this situation the left neighbor and the right neighbor are one and the same chicken! No matter what random choices are made, the two chicks immediately peck each other, and the proportion of survivors is not 25 percent but zero.

The next-larger “circle” consists of three chicks arranged in a triangle. The two neighbors of a chick are distinct, but they are also neighbors of each other. What happens when the three chickens are set loose on one another? The system has \( 2^3 = 8\) possible pecking patterns, and we can easily examine all of them. In the diagram, the arrows indicate where the chicks choose to direct their pecking.

In two cases, where all the chicks peck left or all peck right, there are no survivors. In every other instance exactly one chick remains unpecked. Aggregating the eight patterns, we find six unpecked chicks out of 24 total chicks, for a proportion of \(\bar{S} = \frac{1}{4}\). Thus it appears the finite-size anomaly afflicts only the two-chick version of the problem.

But wait! There’s another possible confounding factor. Can we be sure of seeing the same outcome for both even and odd numbers of chicks? For any odd value of *N* there is just one way to annihilate all the chicks in a single round: They must all peck in the same direction. For even *N*, however, another pattern also leads to immediate extinction: Adjacent chicks can pair up, knocking each other out. Won’t this extra pathway slightly alter the overall probability of survival?

Let’s see what happens with *N* = 4. Now there are \(2^4 = 16\) possible outcomes:

As expected, four patterns leave no survivors at all. On the other hand, there are also four patterns that leave two chicks unpecked rather than just one. Miraculously, the extra losses and the extra gains balance exactly. In all we have 16 survivors out of 64 chicks, so the ratio is again \(\bar{S} = \frac{1}{4}\).

After that long and twisty detour through the combinatorics of chicken pecking, we are right back where we started. The probability of surviving unpecked after a single round of pecking is \(\frac{1}{4}\) for any \(N \gt 2\). All of my fretting about finite-size effects and odd-even disparities was a waste of time. So why have I inflicted it on you? Well, although those worries turn out to be unfounded, they are not farfetched. Making just a small change to the pecking protocol leads to a different outcome. Let the pecking be sequential rather than simultaneous. Some designated chick initiates the sequence of pecks, and then the birds take turns, proceeding clockwise around the circle. When a chick’s turn comes, if it has already been pecked, it does nothing. If it is unpecked, it pecks either its left or its right neighbor, choosing randomly. The round ends when every chick has had a turn.

For \(N = 2\) it’s easy to see that the first chick to peck always survives and the other chick always dies, for a survival rate of \(\frac{1}{2}\). With a little more pencil-and-paper chicken scratching, you can establish that the 50 percent survival rate also holds for \(N = 3\). Looking at very large values of \(N\), computer experiments indicate that the survival fraction again approaches \(\frac{1}{2}\) as *N* goes to infinity. Between these extremes, however, there’s some funny business:

At \(N = 4\) the survivor rate dips below 0.47. (The exact probability is \(\frac{30}{64} = 0.46875\).) This is a minimum. But as the rate recovers back toward 0.5, there is some telltale wiggling in the curve that reveals an odd-even bias: The survival probability is depressed further for even *N* than for odd *N*. This is just the kind of behavior I was looking for (but not finding) in the original Mathcounts version of the problem.

Let us now take up Ellenberg’s problem of iterated pecking (using the simultaneous rather than the sequential protocol). We already know that after the first round we can expect to find about one-fourth of the chicks still unpecked. Clearly, the unpecked fraction cannot increase after multiple rounds. Thus in the final state the expected surviving fraction \(\bar{S}\) must lie somewhere between zero and \(\frac{1}{4}\).

It’s helpful to look at a typical configuration of pecked (●) and unpecked (○) chicks after a single round of synchronized pecking:

`●○●●●○○●●●●○●●○●●○○●●●○○●●●●●○●●●●●●●●●○○●●●●●●○○●●●○●●●●●●○●●●○○●●●●●`

(You’ll have to use your imagination to connect the left and right ends of this array and thereby form a ring.) Notice that there are long strings of pecked chicks, but the unpecked chicks appear in only two configurations. They are either singletons (●○●) or pairs (●○○●). The cause of this pattern is not hard to understand. After a round of pecking, a group of three consecutive unpecked chicks (●○○○●) is impossible. The middle chick must have pecked either left or right, and so it cannot have two unpecked neighbors.

These constraints simplify the analysis of subsequent rounds. The singletons are essentially immortal and unchangeable: The unpecked chick in the middle can never be pecked, and the pecked neighbors can never be unpecked. For the pairs, there are four possible fates, corresponding to the four ways the two active chicks could choose to peck:

In any one round, all four of these events have the same probability, namely \(\frac{1}{4}\). The first three result states are *terminal*, in the sense that further rounds of pecking will leave them unchanged. In the fourth case we are left with an adjacent pair again, which will therefore face the same set of choices in the next round. Eventually, as the number of rounds goes to infinity, the fourth case must yield one of the other outcomes, and thus in the long run we can consider the fourth case to have probability zero and each of the other three cases to have probability \(\frac{1}{3}\).

And now it’s time to bring all these contingent events together and work out a chicken’s long-term probability of survival. The diagram below presents the scheme. In the first round of pecking, three-fourths of the chicks are eliminated immediately. Of the remaining one-fourth, half are singletons, which survive indefinitely. The other surviving chicks are members of pairs, with another pecking chick as either a right neighbor or a left neighbor.

The lower part of the diagram summarizes the effect of all subsequent rounds, which are assumed to continue until all pairs have been either annihilated or reduced to singletons. (I call this *pecking to completion*.) For each pathway that leads to a surviving singleton, the probability is the product of the individual probabilities encountered along that pathway. There are three such pathways, with probabilities \(\frac{1}{8}, \frac{1}{48}\), and \(\frac{1}{48}\), for a sum of \(\frac{1}{6}\).

I have to confess that I did not come up with this analysis—or with the correct answer—on my first try. I was able to work it out only after I had run a simulation and thus knew what I was looking for. Even then I had trouble with double counting.

Here are the simulation results:

\(R\) | \(\bar{S}\) |
---|---|

100 | 16.53 |

10,000 | 16.6835 |

1,000,000 | 16.664404 |

100,000,000 | 16.66701664 |

Again note that accuracy seems to improve as the square root of the sample size, although the variance here is larger than in the single-round experiment.

What about finite-size effects? In circles with only two or three members, the fate of the chicks is fully decided after a single round of pecking: \(\bar{S}\) is 0 and \(\frac{1}{4}\) respectively. Thus these smallest rings escape the \(\frac{1}{6}\) rule, but it appears that circles of all larger sizes converge to \(\frac{1}{6}\). There’s no evidence of even-odd discrepancies.

Another approach to understanding the iterated chicken-pecking problem is through the theory of Markov chains. For a ring of \(N\) chicks we list all \(2^N\) states of the flock and assign a probability to each transition between states. Consider a ring of four chicks, which has 16 states. Symmetries allow us to consolidate some sets of states, and other states can be ignored because they are unreachable from the starting state of four unpecked chicks ().

Only the four states in the red box need to be retained in the model. The transitions between them are recorded in a directed graph, where each arrow is labeled with the corresponding probability. Note that the starting state has only outgoing arrows; there is no way to re-enter the state once you leave. The states and are *absorbing*: The only outgoing arrow leads directly back to the same state; thus, once you reach one of those states, you never escape it.

The essential information from the directed graph can be captured in a \(4 \times 4\) matrix, where the rows and columns are labeled with the four states, and the matrix entries represent the probability of a transition from the row state to the column state. The entries in each row sum to 1, as they must if they are to represent probabilities.

The pattern of zero entries in the transition matrix implies that certain states can’t be reached from other states, even by an indirect route. For this reason the Markov model is said to be *irregular*. That’s a bit awkward, because regular Markov models are easier to analyze and understand. In a regular model, when you take successive powers of the transition matrix, it converges to a steady state, where all the rows are identical and every column consists of a single, repeated value. This fixed point reveals the system’s long-term probability distribution. An irregular Markov model may not even have a stable limiting distribution, but this one does, and it seems to offer some insight. Every ring of four chickens must wind up in one of the two absorbing states. With probability two-thirds that terminal state will be and with probability one-third . This result is consistent with the finding that one-sixth of the chickens survive unpecked.

So, finally, that wraps it up, right? Both the contest problem and Ellenberg’s iterative extension asked for the expected number of surviving chickens, and we have supplied the answers: for a circle of \(N\) chickens, the expected number of survivors \(\bar{S}\) is \(\frac{N}{4}\) after a single round of pecking and \(\frac{N}{6}\) upon pecking to completion. Ironically, though, the expected value of a probabilistic process doesn’t necessarily tell you what to expect. Consider a simpler problem: When you flip a fair coin 100 times, how many heads do you expect to see? The obvious answer is 50, and it’s correct in the sense that no other number has a higher likelihood of correctly predicting the outcome of the experiment. However, the probability of seeing *exactly* 50 heads is only about 0.08, and thus some other number will turn up more than 90 percent of the time.

Instead of looking only at the expected value, let’s examine the range of possible \(S\) values in the pecking game. We’ve already established that zero survivors is a possible outcome, so that forms a lower bound. What is the upper bound—the maximum number of survivors? In the single-round process, every chick pecks, and so after that round every chick must have at least one pecked neighbor. On the basis of this fact I claim that the surviving population can never be greater than \(\frac{N}{2}\). (Do you agree? It took me a while to persuade myself it’s true.)

If \(S\) can never be greater than \(\frac{N}{2}\), the next question is whether it can ever attain that bound. And if we *can* have equal numbers of pecked (●) and unpecked (○) chicks, how are they arranged in the ring? It’s tempting to propose the following configuration:

`●○●○●○●○●○●○`

This is a stable state: The unpecked chicks can never be pecked, so no further changes are possible. And the fraction of survivors is \(\frac{1}{2}\). But there’s a problem with this pattern: It cannot be reached from the starting state. Look at any of the black pecked chicks and ask yourself: Which of its neighbors did it peck? Neither of them, evidently, since they are both unpecked. But that’s not possible, given that every chicken must peck in the first round.

Although the alternating black and white arrangement is ruled out, we’re on the right track. There’s another configuration that also leaves one-half of the chicks unpecked after a single round, and that pattern *is* achievable from the starting state:

`●●○○●●○○●●○○`

When you join the ends to form a ring, every chick, whether pecked or not, has one pecked neighbor. It turns out this is the only way—after allowing for some obvious symmetries—to reach 50 percent survivorship. (Strictly speaking, 50 percent is attainable only when \(N\) is divisible by 4, but \(S\) is never less than \(\frac{N-2}{2}\).)

When the pecking continues to completion, the upper bound of \(S = \frac{N}{2}\) is no longer reachable. Suppose we tried to maintain \(\frac{N}{2}\) over multiple rounds of pecking. Clearly we would have to start in the first round with the maximal-survivor state `●●○○●●○○●●○○`

. However, at least half of the unpecked chicks in this configuration must succumb in subsequent rounds, leaving no more than \(\frac{N}{4}\) survivors.

Does this argument mean that \(S = \frac{N}{4}\) is the greatest possible after pecking to completion? No, it doesn’t. There’s another pattern where one of every three chicks survives:

`●●○●●○●●○●●○`

This configuration is reachable in a single round and stable indefinitely, since none of the pecking chicks has any pecking neighbors. No other arrangement has a higher density of survivors once the pecking process goes to completion.

To summarize: After one round of pecking the number of surviving chicks must lie somewhere between zero and \(\frac{N}{2}\), and the expected number \(\bar{S}\) is right in the middle at \(\frac{N}{4}\). After all further rounds of pecking are completed, the count of unpecked chicks is between zero and \(\frac{N}{3}\), with the expected value again in the middle, at \(\bar{S} = \frac{N}{6}\).

“How many chickens survive?” is a question that seems to call for a numeric answer, but in truth the most informative response is not a number at all; it is a *distribution*:

Each curve records the results of a million experiments with a ring of 100 chicks, giving the frequency of each possible value of \(S\). As expected, the one-round distribution has a peak at 25 survivors, and the iterated curve peaks at 17 (the closest integer to \(\frac{100}{6}\). Note that the red curve is not only shifted to the left but is also slightly taller and narrower.

To get a better view of the details, let’s zoom in. For the sake of smoother curves, I’m going to switch to experiments with \(N = 10{,}000\) chickens. First the green single-round curve, then the red one for the iterated pecking experiment:

With the larger value of \(N\), the curves now peak at 2500 and at 1666.67—exactly the positions expected for \(\frac{N}{4}\) and \(\frac{N}{6}\). Finding the peaks at these positions is no surprise, but what governs the width and the overall shape of the curve? In other words, what is the mathematical nature of the distributions?

One guess that’s always worth a try is the normal (or Gaussian) distribution. For the pecking problem, a normal distribution defines \(P(S)\), the probability of observing \(S\) survivors, as follows:

\[P(S) = \frac{1}{\sigma\sqrt{2 \pi}} \exp -\frac{1}{2}\left(\frac{S - \mu}{\sigma}\right)^2.\]

That’s a pretty messy equation for such a familiar concept, but it’s possible to tease out the basic meaning. The equation defines a symmatric curve with a peak where \(S\) is equal to \(\mu\), the mean of the distribution. The width of the peak depends on \(\sigma\), the standard deviation. Because the area under the curve is a constant, \(\sigma\) also effectively determines the height: A narrower peak has to be taller.

We can fit a normal distribution to the pecking data using a procedure that finds the optimal values of \(\mu\) and \(\sigma\)—those that minimize the discrepancy between the data points and the mathematical model. In the two graphs below the fitted models are superimposed on the two data plots, first for one round of pecking and then for pecking to completion:

The fits appear to be quite close indeed, with the theoretical curves splitting the experimental ones from end to end. In some sense this result has to be counted a success, and yet I don’t find this approach to the problem fully satisfying. The normal curve provides a very good *descriptive* model of the pecking process, but not a *predictive* or *explanatory* one. Remember, the curve is fitted to the data, not the other way around. I see no obvious way to construct a specific normal distribution from what I know about the underlying interactions of pecking chickens. In particular, where do the values of \(\sigma\) in the two models come from? Why is \(\sigma \approx 25\) in the one-round model and \(\sigma \approx 23.6\) in the iterated model? These values look like free parameters, which we have to tune to suit the data. Moreover, they will differ for every value of \(N\). Another issue: the normal curve is a *continuous* distribution, defined over the entire real number line. The pecking function is discrete; it makes sense only for integer numbers of chickens.

Let’s set aside the normal curve and consider another plausible model: the binomial distribution, which is discrete, and which turns up in many probabilistic contexts. Suppose you roll 10,000 dice and count how many of them come to rest with a 1 showing on the upper face. When you repeat this experiment many times, the expected number of 1s is one-sixth of 10,000, the same as the expected number of survivors in the iterated chicken-pecking experiment. With dice, there’s a well-known mathematical expression that defines not just the expected value but also the form of the entire distribution. Assume that every die has probability \(p\) of showing a \(1\). We are going to roll \(N\) dice and we want to know the probability of seeing \(k\) \(1\)s for any \(k\) between \(0\) and \(N\). The formula that supplies this information is:

\[P(k) = {N \choose k} p^k (1 - p)^{N - k}.\]

Here \(p^k (1 - p)^{N - k}\) gives the probability of any specific arrangement of \(k\) \(1\)s among \(N\) dice. The binomial coefficient \(N \choose k\), equal to \(N! / k! (N-k)!\), counts the number of such arrangements.

With \(N = 10000\) and \(p = \frac{1}{6}\) we get a curve showing the outcome of the dice-rolling experiment mentioned above. Perhaps the same curve also describes what happens to the iterated pecking model, which has the same expected value? Alas no.

The binomial curve is wider and flatter than the distribution of iterated pecking survivors. What has gone wrong? When I first saw the graph, I had an inkling. As noted above, the binomial coefficient \(N \choose k\) counts all the ways of choosing \(k\) items from a set of size \(N\). This is appropriate for an experiment with dice, since all possible arrangementds of \(k\) successes among \(N\) trials are equally likely. In particular, when you roll \(10{,}000\) dice, you could conceivably see no \(1\)s at all, or all \(10{,}000\) dice could land with a \(1\) showing face up; the entire range of outcomes has probability greater than zero.

The pecking problem is different. It’s not possible for 100 percent of the chickens to remain unpecked. Thus only a subset of the \(N \choose k\) arrangements are attainable. If the binomial distribution is going to work in this context, we need to adjust it somehow to include only the feasible outcomes.

With the thought that it’s easier to solve a problem if you already know the answer, I tried fiddling with the parameters of the distribution to see how the graph responded. My goal was to squeeze the curve into a narrower and taller profile while keeping it centered at the same mean. The mean is equal to \(Np\), so if we decrease \(N\) we have to increase \(p\) by the same factor. Here are the results of some experiments:

The dark green curve is the one we’ve already seen, for a binomial distribution with \(N = 10000\) and \(p = \frac{1}{6}\). Going to \(N = 5000\) and \(p = \frac{1}{3}\) appears to be a step in the right direction, and \(N = 3333\) and \(p = \frac{1}{2}\) is even better. Then, with \(N = 2500\) and \(p = \frac{2}{3} \ldots\) Bingo! The yellow curve is an excellent match to the pecking data. Thus it appears we can predict the survivorship of an \(N\)-member pecking ring by constructing a binomial distribution with parameters \(N^\prime = \frac{N}{4}\) and \(p^\prime = 4p\).

I can pull the same trick to find a binomial distribution that matches the single-round pecking data. This time the magic numbers that bend the curve to the correct trajectory are \(N’ = \frac{N}{3} = 3333\) and \(p’ = 3p = \frac{3}{4}\).

Unlike the normal distribution, the binomial model is constructive, or predictive. From the two parameters \(N’\) and \(p’\) we can calculate both the mean of the distribution and the standard deviation. The mean is simply \(N’ p’\); the standard deviation is \(\sqrt{N’ p’ (1 - p’)}\). For the example of the \(10{,}000\) chickens pecking to completion, the mean \(\mu\) works out to \(1{,}666.666 \dots\) (as expected), and the standard deviation \(\sigma\) is \(23.570226\). (The fitted normal distribution had \(\sigma = 23.567337\).) For the single-round case, \(\mu\) is exactly \(2500\) and \(\sigma\) is \(25\). (To avoid roundoff errors, I am taking \(N\) to be \(9999\) instead of \(10{,}000\).)

Hooray, eh? At last we have a formula for calculating the shape and location of the chicken-pecking distribution, based on a few simple parameters—\(N’\) and \(p’\). But I’m still grumpy, indeed more perplexed and frustrated than ever. Maybe the model explains the data, but what explains the model? With \(10{,}000\) chickens and a first-round survivor probability of \(\frac{1}{4}\), why does the formula call for \(N’ = 3333\) and \(p’ = \frac{3}{4}\)? Where do those numbers come from? And why \(N’ = 2500\) and \(p’ = \frac{2}{3}\) for the iterated case?

I am embarrassed to admit how long I have spent helplessly flailing and thrashing in the bogs of probability theory, trying to solve these mysteries. (I even turned to a recent book called *The Probability Lifesaver*, which I highly recommend—but it didn’t save my life.) In the search for answers I have investigated the multinomial extensions of binomials. I have looked into convolutions of distributions and computed contingent probabilities. I have filled whole pads of scratch-paper with soldierly rows of ●s and ○s, searching for patterns that would explain those enigmatic fractions \(\frac{N}{3}\) paired with \(\frac{3}{4}\), and \(\frac{N}{4}\) with \(\frac{2}{3}\). Night after night I’ve gone to bed with a promising idea, only to awaken and recognize a fatal flaw.

Now I believe I *do* have a correct explanation. It has passed the overnight test several nights in a row. I’m going to reveal it, but not until the end of this essay. Perhaps you’ll figure it out on your own before then. In the meantime, I’m going to widen the horizons of the chicken problem.

Our cozy circle of chickens is a one-dimensional structure. You can go clockwise or counterclockwise around the ring; there are no other meaningful directions in this little universe. Now suppose that instead of getting all our chickens in a row, we arrange them in a grid, an array of columns and rows, covering a region of a two-dimensional surface. To avoid leaving a subset of chickens on the exposed edges of a rectangular array, we can mate the left edge with the right edge and the top edge with the bottom edge. (Topologically, this turns the rectangle into a torus.) Getting real chickens to cooperate in this experiment would be even harder than in the one-dimensional version, but no matter; we’ve long since lost all touch with barnyard reality.

The most important fact about the two-dimensional flock is that each chicken has four neighbors instead of two. With twice as many hostile neighbors, one might well guess that a chicken would be more vulnerable to a pecking attack. On the other hand, each of those neighbors spreads its pecking over twice as many potential targets. How do these competing effects balance out?

For a single round of pecking, we can calculate the survival probability in the same way we did for the one-dimensional system. A chick remains unpecked only if *all* of its neighbors turn elsewhere to peck. Each neighbor does so with probability \(\frac{3}{4}\), and so the probability that all of them turn away is \(\left(\frac{3}{4}\right)^4\). Numerically, this works out to about 0.3164, compared with 0.25 in the circle. Thus the fraction surviving is greater in two dimensions than in one; the distraction of having more targets outweighs the danger of having more attackers. The distribution observed in computer experiments confirms this finding.

Here’s what a \(40 \times 40\) lattice of chicks looks like after a single round of pecking.

There are 1,600 chicks in the two-dimensional array. If you count the unpecked ○s, you’ll find there are 501, for a survival fraction of 0.3131, close to the theoretical value of 0.3164. Simulations confirm the expected survival rate of \(\left(\frac{3}{4}\right)^4\) for \(N \times N\) lattices with any value of \(N\) greater than \(2\). (For the \(2 \times 2\) grid, the survival rate is \(\frac{1}{4}\), as in the one-dimensional system. There’s a reason why!)

When I stare at the pattern above, I notice a certain stringy or loopy texture, with chains of ○s separating blobs of ●s. This might be a trick of the eye and mind, but I think not. In two dimensions the no-three-in-a-row restriction is lifted; the array includes rows and columns with as many as six consecutive unpecked chicks, as well as diagonal lines. But you will not see a solid \(3 \times 3\) block () of unpecked chicks, or even a \(3 \times 3\) cross (). Such patterns cannot exist because the chick in the middle of the block must have pecked one of its four neighbors. More generally, the system is still bound by the rule that every chick, whether pecked or unpecked, must have at least one pecked neighbor.

Since more chicks survive the first round of pecking in a two-dimensional world, it seems plausible there might also be a greater proportion of survivors when the pecking continues to completion. Let’s try the experiment:

In this \(40 \times 40\) array there are 238 survivors out of 1,600 chicks, which is *less* than the one-sixth survival rate seen in one dimension. In a sample of a million such pecking grids, I found that the mean survival rate \(\bar{S}\) is about 0.1533. Compare the distributions for one- and two-dimensional systems:

In going from 1D to 2D the peak shifts to the left, with the mean moving from 0.1667 to 0.1533. The 2D hump is also a little taller and skinnier, thus showing reduced variance.

Why stop at two dimensions? Let us ask our ever-accommodating chickens to roost in a three-dimensional lattice, again with opposite boundaries joined to create the 3D equivalent of a toroidal surface. It’s not hard to guess how this experiment is going to turn out. Back in one dimension, where every chick had two neighbors, the fraction of survivors after a single round of pecking was \(\left(\frac{1}{2}\right)^2 = \frac{1}{4}\). In two dimensions, with four neighbors, the corresponding number was \(\left(\frac{3}{4}\right)^4 = \frac{81}{256}\). In the three-dimensional pecking party each chick has six neighbors, so the obvious extrapolation is \(\left(\frac{5}{6}\right)^6 = \frac{15625}{46656}\), with a value of \(\approx 0.3349\). Running the simulation supports this surmise, and shows a clear trend when we construct chicken lattices with still higher numbers of dimensions.

From this series of results we can boldly generalize: When every chick has \(n\) neighbors, the fraction expected to survive a single round of pecking is:

\[\left(\frac{n - 1}{n}\right)^n.\]

As \(n\) increases, this expression converges on a value of approximately \(0.36787944\). Does that number look familiar? It is \(\frac{1}{e}\). (Changing the minus sign to a plus generates \(e\) itself, \(2.71828\).) When I stumbled upon this formula, the sudden appearance of \(\frac{1}{e}\) took me by surprise, but it shouldn’t have. The constant turns up in the same way in a model of rumor spreading that I wrote about some years ago.

What about the iterated pecking process in higher dimensions? The fraction of survivors shows a steady decline as the number of dimensions increases:

The proportion of chicks that never get pecked falls from 16.7 percent in one dimension to about half that when we embed our intrepid chickens in seven-dimensional space. In other words, a higher-dimensional space raises the initial survival rate (after one round of pecking), but depresses long-term survival (after pecking to completion). Here’s another way of showing the effect of dimension—tracking the mean number of survivors remaining after each round of pecking in one dimension through seven dimensions.

I can offer a rough, hand-wavy rationale for this trend. If you are a chick in a one-dimensional ring, your chance of surviving the first round of pecking is only \(\frac{1}{4}\), but if you make it through that round, your chance of avoiding a peck in the second round is at least \(\frac{1}{2}\). Why the improvement? It’s because of your own actions: Your pecking in the first round eliminated the threat from one of your two neighbors. Your odds continue improving in subsequent rounds: The longer you last, the greater the chance that you will hang on until all your neighbors are pacified.

The same trend holds in higher dimensions, but the magnitude of the effect tapers off. In four dimensions, for example, you have eight neighbors, and your chance of surviving the first round is \(\left(\frac{7}{8}\right)^8\), or about 0.34. Because you peck one of those neighbors, your probability of making it through the second round is better, but only slightly so: \(\left(\frac{7}{8}\right)^7\), or 0.39.

Looking at the graphs above, one might surmise that as the dimension \(D\) goes to infinity, the number of survivors (after pecking to completion) will drop to zero. To explore this idea, we don’t actually need infinite-dimensional space. What matters most is not the geometric arrangement of the chickens but the number of neighbors, and we can approximate an infinite-dimensional lattice just by declaring that all chickens are nearest neighbors. In other words, the who-pecks-whom graph becomes *complete*, with an arc from every chick to every other chick. This does seem to be a recipe for annihilation; you can’t be safe as long as even one other chicken continues to peck. But the details of the end game allow a little room for variation. Will there be one survivor or none?

Peter Winkler discusses a similar problem, “Group Russian Roulette,” in *Mathematical Puzzles: A Connoisseur’s Collection* (p. 33). The actors in his version are not chickens but “armed and angry people,” who engage in rounds of simultaneously shooting random neighbors. Winkler observes that the probability of a survivor does not approach a limit as \(N\) increases. I don’t see this effect in the chicken problem: There is almost always a last chicken standing. What makes the difference, I believe, is that Winkler’s roulette players don’t waste their ammunition on players who have already been shot, whereas the chickens continue to peck at neighbors who don’t peck back.

Finally, I return to the narrow confines of one dimension and to the mysterious binomial distributions that seem to predict the statistics of chicken pecking in this system. To review: If you roll 10,000 dice and count those that show a \(1\), you can expect to find about 1667. If you put 10,000 chicks in a circle and wait until all the pecking is done, you can expect about 1667 unpecked survivors. The dice experiment is described by a binomial distribution with parameters \(N = 10000\) and \(p = \frac{1}{6}\). The same model doesn’t work for the chickens: The predicted distribution is much broader than the observed one. But that’s not the weird part. The real puzzler is why a different binomial model, with parameters \(N’ = 2500\) and \(p’ = \frac{2}{3}\), does seem to match the experimental results.

The dice model’s failure to work for chicken pecking is not really a surprise. A key assumption underlying the binomial distribution is that the events or objects being counted are independent. That’s true for dice; one die doesn’t care what the others do. But the circle of pecking chickens is all about interactions between neighbors. If you have already been pecked, that alters the odds that your neighbors will eventually be pecked. Independence enters the binomial distribution through the coefficient \(N \choose k\). Given \(N\) dice with \(k\) of them showing \(1\)s, all possible interleavings of the \(1\)s among the other dice are equally likely; the binomial coefficient counts those arrangements. But given \(N\) chicks with \(k\) of them unpecked, it’s not true that all arrangements are equally likely. Indeed, many patterns, such as ○○○, are impossible.

If neighbor interactions spoil the binomial model with \(N = 10{,}000\) and \(p = \frac{1}{6}\), how are those interactions overcome in the model with \(N’ = 2500\) and \(p’ = \frac{2}{3}\)? For the longest time I was beguiled by the observation that 2500 is the expected number of survivors after a single round of pecking, and two-thirds of those individuals can be expected to survive all subsequent rounds. Surely, having those two numbers turn up in the binomial distribution cannot be a meangingless coincidence. Maybe not, but I was able to make sense of the situation only when I gave up on that line of inquiry.

What’s needed is a model in which we count the arrangements of 2500 objects, where two-thirds of the objects can be considered successes or survivors. I have found such a model. The objects are not individual chickens. They are groups of four chickens. Consider this set of 4-tuples:

*a* = ○●●●*b* = ●○●●*c* = ●●●●

If you select elements from this set at random and string them together, any sequence you create could be an output of the iterated pecking process. A typical result looks like this:

`●●●●○●●●●○●●○●●●●●●●○●●●●○●●●●●●●○●●○●●●●○●●●●●●○●●●●○●●●○●●○●●●●●●●○●●`

Note that this sequence satisfies all the rules for flock of chickens that has pecked to completion. All unpecked ○s are singletons, surrounded by pecked neighbors. At least two ●s separate every pair of ○s, and this ensures that every element of the sequence has at least one ● neighbor. There is no way of concatenating any selection of *a, b,* and *c* elements that violates these rules. Furthermore, if *a, b,* and *c* are chosen with equal probability, the expected proportion of ○s in the sequence is \(\frac{1}{6}\).

I am deeply ambivalent about this discovery. On the one hand, it’s always a relief to get to the bottom of a problem that has stumped you. On the other hand, what we have here is a recipe for creating a sequence with the same structure and statistics as the product of the pecking process, but it offers no insight into the nature of that process. There’s no connection with the behavior of the chickens. Worse, it’s not even a true or exact model. Although the curve appears to coincide with the data, it’s only an approximation. The proof of this fact is simple. The binomial distribution with \(N’ = 2500\) and \(p’ = \frac{2}{3}\) has an absolute cutoff at \(2500\). For any number of survivors greater than \(2500\), the model assigns a probability of zero. Yet the flock of \(10{,}000\) pecking chickens can in fact leave up to \(3333\) survivors.

The defect becomes visible in a smaller model, such as this one with \(N = 24\):

The predicted and observed curves exhibit slight mismatches everywhere, but pay particular attention to the right tail of the distribution, where the binomial curve *(purple)* dives to zero for all survivor numbers greater than six, whereas the experimental data *(red)* include 6718 instances with seven survivors and 49 instances with eight survivors.

A similar model for the one-round pecking process uses a set of four 3-tuples:

*a* = ○●●*b* = ○●●*c* = ●●○*d* = ●●●

Again it generates a sequence that looks very much like the outcome of a pecking experiment, but fails to reproduce the tail of the distribution. In the model the highest possible density of survivors is \(\frac{1}{3}\) whereas it should be \(\frac{1}{2}\).

Perhaps you’re thinking that a cute high school problem about chicks pecking their neighbors doesn’t really merit an 8,000-word screed on Markov chains and probability distributions, with tables and equations and 25 graphs and diagrams. That thought has crossed my mind, too. However, I want to add just a few more words to argue that the exercise is not totally frivolous.

Mathematics does not owe us a tidy, closed-form, one-line solution to every problem, but we’d be foolish to give up the quest too easily. In this case, computer simulations are easy and productive. By running a program for five minutes I can get answers to a multitude of detailed questions, and I don’t have serious doubts about the correctness of those answers. But they don’t help me make the connection between the microscopic mechanisms (a chicken pecks left or right at random) and macroscopic observations (the distribution has \(\mu = \frac{1}{6}\) and \(\sigma = 23.56\)). Richard Hamming’s old chestnut says the purpose of computing is insight, not numbers, but insight is just what I’m missing.

Second, this is not really a problem about chickens, whether real or abstract. It is a gateway to a collection of other many-body problems in statistical physics and dynamical systems and cellular automata.

Finally, I’ve had fun, and what’s the harm in that? Maybe the fun’s not over. What about zombie chickens, whose pecks bring other chickens back to life?

**Update 2017-07-11:** Carl Witty has worked out the correct probability distribution for the single-round case. See his comment below.

With pencil and paper it’s easy to show that \(6!\) *doesn’t* work. The factorial of \(6\) is \(1 \times 2 \times 3 \times 4 \times 5 \times 6 = 720\); adding \(1\) brings us to \(721\), which is not a square. (It factors as \(7 \times 103\).) On the other hand, \(7!\) is \(5040\), and adding \(1\) yields \(5041\), which is equal to \(71^2\). This makes for a very cute equation:

\[7! + 1 = 71^2.\]

Continuing on, you can establish that \(8! + 1\), \(9! +1\) and \(10! + 1\) are not square numbers. But to extend the search much further, we need mechanized assistance. Here’s a Julia function that does the obvious thing, generating successive factorials and checking each one to see if it is \(1\) less than a perfect square:

```
function search_fac_sqr(maxn)
fac = big(1) # bigints needed for n > 20
for n in 1:maxn
fac *= n # incremental factorial
r = isqrt(fac + 1) # floor of sqrt
if r * r == fac + 1
println(n, "! + 1 = ", r, "^2 = ", r^2)
end
end
println("That's all folks!")
end
```

With this tool in hand, let’s check out \(n! + 1\) for all \(n\) between \(1\) and \(100\). Here’s what the program reports:

```
search_fac_sqr(100)
4! + 1 = 5^2 = 25
5! + 1 = 11^2 = 121
7! + 1 = 71^2 = 5041
That's all folks!
```

Those are the three cases we’ve already discovered with pencil and paper—and no more are listed. In other words, among all values of \(n! + 1\) up to \(n = 100\), only \(n = 4\), \(n = 5\), and \(n = 7\) yield squares. When I continued the search up to \(n = 1{,}000\), I got exactly the same result: no more squares. Likewise \(n = 10{,}000\) and \(n = 100{,}000\). Allow me to mention that the factorial of \(100{,}000\) is a rather large number, with \(456{,}574\) decimal digits. At this point in the search, I began to grow weary; furthermore, I began to lose hope. When \(99{,}993\) successive values of \(n\) fail to produce a single square, it’s hard to sustain faith that success might be just around the corner. Nevertheless, I persisted. I got as far as \(n = 500{,}000\), which has \(2{,}632{,}341\) decimal digits. Not one more perfect square in the whole lot.

What can we learn from this evidence—or lack of evidence? Are 4, 5, and 7 the only values of \(n!\) that lie \(1\) short of a perfect square? Or are there more such cases somewhere out there along the number line, maybe just beyond my reach, waiting to be found? Could there be infinitely many? If so, where are they? If not, why not?

To my taste, the most satisfying way to resolve these questions would be to find some number-theoretical principle ensuring that \(n! + 1 \ne m^2\) for \(n \gt 7\). I have not discovered any such principle, but in a dreamy sort of way I can imagine what a proof might look like. Suppose we eliminate the “\(+1\)” part of the formula, and search for integers such that \(n! = m^2\). It turns out there is just one solution to this equation, with \(n = m = 1\). You needn’t bother lathering up your laptop in the quest for larger examples; there’s a simple proof they don’t exist. In any square number, all the prime factors must be present an even number of times, as in \(36 = 2 \times 2 \times 3\times 3\). In a factorial, at least one prime factor—the largest one—always appears just once. (If you’re not sure why, check out Bertrand’s postulate/Chebyshev’s theorem.)

Of course when we put the “\(+1\)” back into the formula, this whole line of reasoning falls to pieces. In general, the factorization of \(n!\) and of \(n! + 1\) are totally different. But maybe there’s some other property of \(n! + 1\) that conflicts with squareness. It might have something to do with congruence classes, or quadratic residues. From the definition of a factorial, we know that \(n!\) is divisible by all positive integers less than or equal to \(n\), which means that \(n! + 1\) *cannot* be divisible by any of those numbers (except \(1\)). This observation rules out certain kinds of squares, namely those that have small primes in their factorization. But for all \(n \gt 4\) the square root of \(n!\) greatly exceeds \(n\), so there’s plenty of room for larger factors, as in the case of \(7! + 1 = 71^2\).

Here’s another avenue that might be worth exploring. The decimal representation of any large factorial ends with a string of \(0\)s, formed as the products of \(5\)s and \(2\)s among the factors of the number. Thus \(n! + 1\) must look like

\[XXXXX \ldots XXXXX00000 \ldots 00001,\]

where \(X\) represents any decimal digit, and the trailing sequence of \(0\)s now ends with a single terminal \(1\). Can we figure out a way to prove that a number of this form is never a square? Well, if the final digit were anything other than \(1, 4,\) or \(9\), the proof would be easy, but lots of squares end in \(\ldots 01\), such as \(10{,}201 = 101^2\) and \(62{,}001 = 249^2\). If there’s some algebraic argument along these lines showing that \(n! + 1\) can’t be a square, it will have to be something subtler.

All of the above is make-believe mathematics. I have stirred up some ingredients that look like they might make a tasty confection, but I have no idea how to bake the cake. Perhaps someone else will supply the recipe. In the meantime, I want to entertain an alternative hypothesis: that nothing prevents \(n! + 1\) from being a square except improbability.

The pattern observed in the \(n! + 1 = m^2\) problem—a few matches among the smallest elements of the sequences, and then nothing more for many thousands of terms—is not unique to factorials and squares. Other pairs of sequences exhibit similar behavior. For example, I have tried matching factorials with triangular numbers. The triangulars, beginning \(1, 3, 6, 10, 15, 21, \ldots\), are defined by the formula \(T(m) = m(m + 1)/2\). If we look for factorials that are also triangular, we get \(1! = T(1) = 1\), then \(3! = T(3) = 6\), and finally \(5! = T(15) = 120\). No more examples appear through \(n = 100{,}000\).

What about factorials that are \(1\) less than a triangular, satisfying the equation \(n! + 1 = T(m)\)? I know of only one case: \(2! + 1 = 3\). Broadening the search a little, I found that \(n! + 4\) is triangular for \(n \in {2, 3, 4}\), again with no more hits up to \(100{,}000\).

For another experiment we can bring back the square numbers and swap out the factorials, replacing them with the ever-popular Fibonacci sequence, \(1, 1, 2, 3, 5, 8, 13, \ldots\), defined by the recurrence \(F(n) = F(n - 1) + F(n - 2)\), with \(F(1) = F(2) = 1\). It’s been known since the 1960s that \(1\) and \(144\) are the only positive integers that are both Fibonacci numbers and perfect squares. Looking for Fibonacci numbers that are \(1\) less than a square, I found that \(F(4) + 1 = 4\) and \(F(6) + 1 = 9\), with no other instances up to \(F(500{,}000)\).

We can do the same sort of thing with the Catalan numbers, \(1, 1, 2, 5, 14, 42, 132 \ldots\), another sequence with a huge fan club. I find no squares other than \(1\) among the Catalan numbers up to \(n = 100{,}000\); I don’t know if anyone has proved that none exist. A search for cases where \(C(n) + 1 = m^2\) also comes up empty, but there are a few low-lying matches for \(C(n) + k = m^2\) for \(k \in {2, 3, 4}\).

Finding similar behavior in all of these diverse sequences changes the complexion of the problem, in my view. If we discover some obscure, special property of \(n! + 1\) that explains why it never lands on a square (for large values of \(n\)), do we then have to invent another mechanism for Fibonacci numbers and still another for Catalan numbers? Isn’t it more plausible that some single, generic cause lies behind all the observations?

But the cause can’t be *too* generic. It’s not the case that you can take any two numeric sequences and expect to see the same kind of pattern in their intersections. Consider the factorials and the prime numbers. By the very nature of a factorial, none of them except 2! = 2 can possibly be prime, but there’s no obvious reason that \(n! + 1\) can’t be a prime. And, indeed, for \(n \le 100\) nine values of \(n! + 1\) are prime. Extending the search to \(n \le 1000\) turns up another seven. Here is the full set of known numbers for which \(n! + 1\) is prime:

\[1, 2, 3, 11, 27, 37, 41, 73, 77, 116, 154, 320, 340, 399, 427, 872, 1477, \\ 6380, 26951, 110059, 150209\]

They get rare as \(n\) increases, but there’s no hint of a sharp cutoff, as there is in the other cases explored above. Does the sequence continue indefinitely? That seems a reasonable conjecture. (For more on this sequence, including references, see Chris K. Caldwell’s factorial prime page.)

My question is this: Can we understand these curious patterns in terms of mere chance coincidence? The values of \(n! + 1\) form an infinite sequence of integers spread over the number line, dense near the origin but becoming extremely sparse as \(n\) increases. The values of \(m^2\) form another infinite sequence, again with diminishing density, although the dropoff is not as steep. Maybe factorials bump into squares among the smallest integers because there just aren’t enough of those integers to go around, and some of them have to do double duty. But in the vast open spaces out in the farther reaches of the number line, a factorial can wander around for years—maybe forever—and not meet a square.

Let me try to state this idea more precisely. Since \(n!\) cannot be a square, we know that it must lie somewhere between two square numbers; the arrangement on the number line is \((m - 1)^2 \lt n! \lt m^2\). The distance between the end points of this interval is \(m^2 - (m - 1)^2 = 2m - 1\). Now choose a number \(k\) at random from the interval, and ask whether \(n! + k = m^2\). Exactly one value of \(k\) must satisfy this condition, and so the probability of success is \(1/(2m - 1)\), or roughly \(1 / (2 \sqrt{n!})\). Because \(\sqrt{n!}\) increases very rapidly, this probability takes a nosedive toward zero as \(n\) increases. It is represented by the red curve in the graph below. Note that by \(n = 100\) the red curve has already reached \(10^{-80}\).

The green curve gives the probability of a collision between Fibonacci numbers and squares; the shape is similar, though it dives off the precipice a little later. The Fibonacci-square curve approximates a negative exponential: The probability is proportional to \(\phi^{-\sqrt{F(n)}}\), where \(\phi = (\sqrt{5} + 1) / 2 \approx 1.618\). The factorial-square curve is even steeper because the factorial function is *superexponential*: \(n!\) grows faster than \(c^n\) for any fixed \(c\).

The blue curve, recording the probability of coincidences between factorials and primes, has a very different shape. In the neighborhood of \(n!\) the average distance between consecutive primes is approximately \(\log n!\), which grows just a little faster than \(n\) itself and very much slower than \(n!\). The probability of collision between factorials and primes is roughly \(1 / \log n!\). The continuous blue curve corresponds to this smooth approximation. The blue dots sprinkled near that line give the probability based on actual distances between consecutive primes.

What to make of those curves? Is it legitimate to apply probability theory to these totally deterministic sequences of numbers? I’m not quite sure. Before confronting the question directly, I’d like to retreat a few steps and look at a simpler model where probability is clearly entitled to a seat at the table.

Let us borrow one of Jacob Bernouilli’s famous urns, which have room to hold an infinite number of ping pong balls. Start with one black ball and one white ball in the urn, then reach in and take a ball at random. Clearly, the probability of choosing black is \(1/2\). Put the chosen ball back in the urn, and also add another white ball. Now there are three balls and only one is black, so the probability of drawing black is \(1/3\). Add a fourth ball, and the probability of black falls to \(1/4\). Continuing in this way, the probability of black on the \(n\)th draw must be \(\frac{1}{n + 1}\).

If we go on with this protocol forever—always choosing a ball at random, putting it back, and adding an extra white ball—what is the probability of eventually seeing the black ball at least once? It’s easier to answer the complement of this question, calculating the probability of *never* seeing the black ball. This is the infinite product \(\frac{1}{2} \times \frac{2}{3} \times\frac{3}{4} \times\frac{4}{5} \ldots\), or:

\[P(\textrm{never black}) = \prod_{n = 1}^{\infty} 1 - \frac{1}{n+1}\]

The product goes to zero as \(n\) goes to infinity. In other words, in an endless series of trials, the probability of never drawing black is \(0\), which means the probability of seeing black at least once must be \(1\). (“Probability \(1\)” is not exactly the same thing as “certain,” but it’s mighty close.)

Now let’s try a different experiment. Again start with one black ball and one white ball, but after the first draw-and-replace cycle add two white balls, then four white balls, and so on, so that the total number of balls in the urn at stage \(n\) is \(2^n\); throughout the process all of the balls but one are white. Now the probability of never seeing the black ball is \(\frac{1}{2} \times \frac{3}{4} \times\frac{7}{8} \times\frac{15}{16} \ldots\), or:

\[P(\textrm{never black}) = \prod_{n = 1}^{\infty} 1 - \frac{1}{2^n}\]

This product does *not* go to zero, no matter how large \(n\) becomes. Neither does it go to \(1\). The product converges to a constant with the approximate value \(0.288788095\). Strange, isn’t it? Even in an infinite series of draws from the urn, you can’t be sure whether the black ball will turn up or not.

These two urn experiments do not correspond directly to any of the sequence coincidence problems described above; they simply illustrate a range of possible outcomes. But we can rig up an urn process that mimics the probabilistic treatment of the factorials-and-squares problem. At the \(n\)th stage, the urn holds \(1 + 2 \sqrt{n!}\) balls, only one of which is black. The probability of never seeing the black ball, even in an infinite series of trials, is

\[\prod_{n = 1}^{\infty} 1 - \frac{1}{1 + 2 \sqrt{n!}}.\]

This expression converges to a value of approximately \(0.2921426977\). It follows that the probability of seeing black at least once is \(1 - 0.2921426977\), or \(0.7078573023\). (No, that number is not \(1/\sqrt{2}\), although it’s close.)

An urn process resembling the factorials-and-primes problem gives a somewhat different result. Here the number of balls in the urn at stage \(n\) is \(\log n!\), again with just one black ball. The infinite product governing the cumulative probability is

\[\prod_{n = 2}^{\infty} 1 - \frac{1}{\log n!}.\]

On numerical evidence this expression seems to dwindle away to zero as \(n\) goes to infinity (although I’m not \(100\) percent sure of that). If it *does* go to \(0\), then the complementary probability that the black ball will eventually appear must be \(1\).

Some of these results leave me feeling befuddled, and even a little grumpy. Call me old-fashioned, but I always thought that rolling the dice infinitely many times ought to be enough to settle beyond doubt whether a pattern appears or not. In the harsh light of eternity, I would have said, everything is either forbidden or mandatory; as \(n\) goes to infinity, probability goes to \(0\) or it goes to \(1\). But apparently that’s not so. In the factorial urn model the probability of never seeing a black ball is neither \(0\) nor \(1\) but lies somewhere in the neighborhood of \(0.2921426977\). What does that mean, exactly? How am I supposed to verify the number, or even check its first few digits? Running an infinite series of trials is not enough; you need to collect a statistically significant sample of infinite experiments. For an exact result, try an infinite series of infinite experiments. Sigh.

The urn model corresponds in a natural way to the randomized version of the factorial-square problem, where we look at \(n! + k = m^2\) and choose \(k\) at random from an appropriate range of values. But what about the original problem of \(n! + 1 = m^2\)? In this case there’s no random variable, and hence there’s no point in running multiple trials for each value of \(n\). The system is deterministic. For each \(n\) the factorial of \(n\) has a definite value, and either it is or it isn’t adjacent to a perfect square. There’s no maybe.

Nevertheless, there might be a way to sneak probabilities in through the back door. To do so we have to assume that factorials and squares form a kind of ergodic system, where observing one chain of events for a long period is equivalent to watching many shorter chains. Suppose that factorials and squares are uncorrelated in their positions on the number line—that when a factorial lands between two squares, its distance from the larger square can be treated as a random variable, with every possible distance being equally likely. If this assumption holds, then instead of looking at one value of \(n!\) and trying many random values of \(k\), we can adopt a single value of \(k\) (namely \(k = 1\)) and look at \(n!\) for many values of \(n\).

Is the ergodic assumption defensible? Not entirely. Some distances between \(n!\) and \(m^2\) are known to be more likely than others, and indeed some distances are impossible. However, the empirical evidence suggests that the deviations must be slight. The histogram below shows the distribution of distances between a factorial and the next larger square for the first \(100{,}000\) values of \(n!\). The distances have all been normalized to the range \((0, 1)\) and classified in \(100\) bins. There is no obvious sign of bias. Calculating the mean and standard deviation of the same \(100{,}000\) relative distances yields values within \(1\) percent of those expected for a uniform random distribution. (The expected values are \(\mu = 1/2\) and \(\sigma = 1/12\).)

If this probabilistic approach can be taken seriously, I can make some quantitative statements about the prospects for ever finding a large factorial adjacent to a perfect square. As mentioned above, the overall probability that one or more values of \(n! + 1\) are equal to squares is about \(0.7078573023\). Thus we should not be too surprised that three such cases are already known, namely the examples with \(n = 4, 5,\) and \(7\). Now we can apply the same method to calculate the probability of finding at least one more case with \(n \gt 7\). Let’s make the question more general: “Whether or not I have seen any squares among the first \(C\) values of \(n! + 1\), what are the chances I’ll see any thereafter?” To answer this question, we can just remove the first \(C\) elements from the infinite product:

\[\prod_{n = C+1}^{\infty} 1 - \frac{1}{1 + 2\sqrt{n!}}.\]

For \(C = 7\), the answer is about \(0.0037\). For \(C = 100\), it’s about \(5.7 \times 10^{-80}\). We are sliding down the steep slope of the red curve.

As a practical matter, further searching for another factorial-square couple does not look like a promising way to spend time and CPU cycles. The probability of success soon falls into the realm of ridiculously small numbers like \(10^{-1{,}000{,}000}\). And yet, from the mathematical point of view, the probability never vanishes. Removing a finite number of terms from the front of an infinite product cannot change its convergence properties. If the original product converged to a nonzero value, then so will the truncated version. Thus we have wandered into the canyon of maximal frustration, where there’s no realistic hope of finding the prize, but the probabilities tell us it still might exist.

I am going to close this shambling essay by considering one more example—a cautionary one. Suppose we apply probabilistic reasoning to the search for a cube that is \(1\) less than a square. If we were looking for exact matches between cubes and squares, we’d find plenty of them: They are the sixth powers: \(1, 64, 729, \ldots\). But integer solutions to the equation \(n^3 + 1 = m^2\) are not so abundant. One low-lying example is easy to find: \(2^3 + 1 = 3^2\), but after 8 and 9 where can we expect to see the next consecutive cube and square?

The probabilistic approach suggests there might be reason for optimism. Compared with factorials and Fibonaccis, cubes grow quite slowly; the rate is polynomial rather than exponential or superexponential. As a result, the probability of finding a cube at a given distance from a square falls off much less steeply than it does for \(n!\) or \(F(n)\). In the graph below, \(P(n^3 + k = m^2)\) is the orange curve.

Note that the orange curve lies just below the blue one, which represents the probability that \(n!\) lies near a prime. The proximity of the two curves suggests that the two problems—factorials adjacent to primes, cubes adjacent to squares—might belong to the same class. We already know that factorial primes do seem to go on and on, perhaps endlessly. The analogy leads to a surmise: Maybe cube-square coincidences are also unbounded. If we keep looking, we’ll find lots more besides \(8\) and \(9\).

The surmise is utterly wrong. The problem has a long history. In 1844 Eugène Catalan conjectured that \(8\) and \(9\) are the only consecutive perfect powers among the integers; the conjecture was finally proved in 2004 by Preda Mihăilescu. For the special case of squares and cubes, Euler had already settled the matter in the 18th century. Thus, probabilities are beside the point.

All of the questions considered here belong to the category of Diophantine analysis—the study of equations whose solutions are required to be integers. It is a field notorious for problems that are easy to state but hard to solve. Catalan’s conjecture is one of the most famous examples, along with Fermat’s Last Theorem. When Diophantine problems are ultimately resolved, the proofs tend to be non-elementary, drawing on sophisticated tools from distant realms of mathematics—algebraic geometry in the proof of Fermat’s Last Theorem by Andrew Wiles and Richard Taylor, cyclotomic fields in Mihăilescu’s proof of the Catalan conjecture. As far as I know, probability theory has not played a central role in any such proof.

When I started wrestling with these questions a few weeks ago, I did not expect to discover a definitive solution. I’ve certainly fulfilled my expectations! As a matter of fact, in my own head the situation is more muddled now than it was at the outset. The realization that even an infinite series of experiments would not necessarily resolve some of the questions is deeply unsettling, and makes me wonder how much I really understand about probability theory. But that’s hardly unprecedented in mathematics. I suppose I’ll just have to get used to it.

**Update:** Thanks to a further tip from Tanton, I have learned that the problem has an extensive history, and also a name: Brocard’s problem, after Henri Brocard, who published on it in 1876 and 1885. Ramanujan mentioned it in 1913. Erdos conjectured there are no more solutions. Marius Overholt connected it with the abc conjecture. Bruce C. Berndt and William F. Galway established that there are no more solutions up \(10^9\). All this comes from the Wikipedia entry on Brocard’s problem. That article also mentions (but does not explain) that the solutions are called Brown numbers.

I have some more reading to do.

]]>Place numbers in the grid so that each outlined region contains the numbers 1 to

n, wherenis the number of squares in the region. The same number can never touch itself, not even diagonally.

Here is a partially completed example:

The black, pre-printed numbers are the “givens,” supplied by the puzzle creator. I filled in the pencil-written numbers in a sequence of “forced” moves dictated by two simple rules:

- A number can be placed in a square if no other number is allowed there. For example, the three singleton squares in the bottom row must each hold a 1, and these squares are the obvious place to start solving the puzzle. After the 1s are written in, the square outlined in yellow in the diagram below can also be filled in; its neighbors forbid any number other than 3.
- A number can be placed in a square if the number has no other possible home within a region. The blue-outlined 1 in the diagram below was determined by this rule. There must be a 1 somewhere in the region, but none of the other squares can accommodate it.

At this point in the solution process, with the grid in the state shown above, I was unable to find any other blank squares whose contents could be decided by following these two rules and no others. But I *did* spot a move based on a different kind of reasoning. Consider the two pairs of open squares marked in color:

The salmon-pink squares must hold the numbers 2 and 5, but it’s not immediately clear which number goes in which square. Likewise the lime-green squares must hold 2 and 4, in one order or the other. I submit that the numbers must have the following arrangement:

How do I justify that choice? Suppose the green 2 and 4 were transposed:

Then the pink 2 and 5 could be placed in either permutation, and no later moves elsewhere in the puzzle would ever resolve the ambiguity. This outcome is not acceptable if we assume the puzzle must have a unique solution. The uniqueness constraint might be expressed as a third rule:

- A number can be placed in a square if it is needed to prevent other squares from having multiple legal configurations.

I have vague qualms about this mode of puzzle-solving. It’s surely not cheating, but the third rule has a different character from the others. It exploits an assumed global property of the solution, rather than relying on local interactions. We are not making a choice because it is forced on us; we are choosing a cofiguration that will force a choice elsewhere.

In this particular puzzle it’s not actually necessary to apply the uniqueness constraint. There is at least one other pathway to a solution—which I’ll leave to you to find. Can we devise a puzzle that *requires* rule 3? I’m not quite sure the question is even well-formed. All constraint-satisfaction problems can be solved by a mindless brute-force algorithm: Just write in some numbers at random until you reach a contradiction, then backtrack. So if we want to force the solver to use a specific tool, we somehow have to outlaw that universal jackhammer.

The uniqueness constraint is not unique to the Capsules puzzle. I’ve encountered it often in kenkens, and occasionally in sudokus. I even have a sense of *deja lu* as I write this. I feel sure I’ve read a discussion of this very issue, somewhere in recent years, but I haven’t been able to lay hands on it. Pointers to precedents are welcome.

**Addendum 2017-03-19**: Jim Propp reminds me of his marvelous Self Referential Aptitude Test. The instructions begin:

The solution to the following puzzle is unique; in some cases the

knowledge that the solution is unique may actually give you a short-cut

to finding the answer to a particular question.

I completed the 20-question puzzle when SRAT first went public some years ago. This morning I found I was able to do it again with no diminution in enjoyment—or effort. I remembered none of the answers or the sequence of deductions needed to find them.

Highly recommended. And while you’re at it, check out Propp’s Mathematical Enchantments blog and his Twitter feed: @JimPropp.

]]>The answer to Tanton’s question is surely No: The series will never again land on an integer. I leaped to that conclusion immediately after reading the definition of the series and glancing at the first few terms. But what makes me so sure? Can I prove it?

I wrote a quick program to generate more terms:

1 2 5/2 17/6 37/12 197/60 69/20 503/140 1041/280 9649/2520 9901/2520 111431/27720 113741/27720 1506353/360360 1532093/360360 1556117/360360 3157279/720720 54394463/12252240 18358381/4084080 352893319/77597520

Overall, the trend visible in these results seemed to confirm my initial intuition. When the fractions are expressed in lowest terms, the denominator generally grows larger with each successive term. Looking at the terms more closely, it turns out that the denominators tend to be products of many small primes, whereas the numerators are either primes or products of a few comparatively large primes. For example:

\[\frac{9649}{2520} = \frac{9649}{2^3 \cdot 3^2 \cdot 5 \cdot 7} \qquad \textrm{and} \qquad \frac{18358381}{4084080} = \frac{59 \cdot 379 \cdot 821}{2^4 \cdot 3 \cdot 5 \cdot 7 \cdot 11 \cdot 13 \cdot 17}.\]

To produce an integer, we need to cancel all the primes in the factorization of the denominator by matching primes in the numerator; given the pattern of these numbers, that looks like an unlikely coincidence.

But there is reason for caution. Note the seventh term in the sequence, where the denominator has decreased from \(60\) to \(20\). To understand how that happens, we can run through the calculation of the term, which starts by summing the six previous terms.

\[\frac{60}{60} + \frac{120}{60} + \frac{150}{60} + \frac{170}{60} + \frac{185}{60} + \frac{197}{60} = \frac{882}{60}.\]

Then we calculate the mean, and add 1 to get the seventh term:

\[\require{cancel}\frac{882}{60} \cdot \frac{1}{6} = \frac{882}{360} = \frac{\cancel{2} \cdot \cancel{3} \cdot \cancel{3} \cdot 7 \cdot 7}{\cancel{2} \cdot 2 \cdot 2 \cdot \cancel{3} \cdot \cancel{3} \cdot 5} = \frac{49}{20} + 1 = \frac{69}{20}\]

Cancelations reduce the numerator and denominator of the mean by a factor of 18. It seems possible that somewhere farther out in the sequence there might be a term where *all* the factors in the denominator cancel, leaving an integer.

Another point to keep in mind: For large \(n\), the value of the Tanton function grows very slowly. Thus if integer values are not absent but merely rare, we might have to compute a huge number of terms to get to the next one. Reaching the neighborhood of 100 would take more than \(10^{40}\) terms.

So what do you think? Can we prove that no further integers appear in Tanton’s sequence? Or, on the contrary, might my instant conviction that no such integers exist turn out to be an alternative fact?

I’ve had my fun with this problem. I know the answer now, but I’m not going to reveal it yet. Others also deserve a chance to be distracted, or anaesthetized. I’ll be back in a few days to follow up—unless commenters explain what’s going on so thoroughly there’s nothing left for me to say.

**Update 2017-01-30**: Okay, pencils down. Not that anyone needs more time. As usual, my readers are way ahead of me. (See comments below, if you haven’t read them already.)

My own slow and roundabout voyage of discovery went like this. I had written a little piece of code for printing out *n* terms of the series, directly implementing the definition given in James Tanton’s tweet:

```
from fractions import Fraction as F
from statistics import mean
def tanton (n):
seq = [F(1)]
for i in range(n):
print(seq[i])
seq.append(mean(seq) + 1)
```

But this is criminally inefficient. On every pass through the loop we calculate the mean of the entire sequence, then throw that work away and do it all again the next time. Once you have the mean of \(n-1\) terms, isn’t there some way of updating it to incorporate the *n*th term? Well, yes, of course there is. You just have to appropriately weight the new term, dividing by *n*, before adding it to the mean. Here’s the improved code:

```
from fractions import Fraction as F
def faster_tanton (n):
m = F(1)
for i in range(1, n):
print(m)
m += F(1, i)
```

Tracing the execution of this function, we start out with 1, then add 1, then add 1/2, then 1/3, then 1/4, and so on. This is 1 plus the harmonic series. That series is defined as:

\[H_{n} = \sum_{i=1}^{n} \frac{1}{i} = \frac{1}{1} + \frac{1}{2} + \frac{1}{3} + \cdots + \frac{1}{n}\]

The first 10 partial sums are:

1 3/2 11/6 25/12 137/60 49/20 363/140 761/280 7129/2520 7381/2520

One fact about the harmonic series is very widely known: It diverges. Although \(H_{n}\) grows very slowly, that growth continues without bound as \(n\) goes to infinity. Another fact, not quite as well known but of prime importance here, is that no term of the series after the first is an integer. The simplest proof shows that when you factor the numerator and the denominator, the denominator always has more \(2\)s than the numerator; thus when the fraction is expressed in lowest terms, the numerator is odd and the denominator even. This proof can be found in various places on the internet, such as StackExchange. There’s also a good explanation in Julian Havil’s book *Gamma: Exploring Euler’s Constant*.

Neither of those sources mentions anything about the origin or author of the proof. When I scouted around for more information, I found more than a dozen sources that attribute the proof to “Taeisinger 1915,” but with no reference to an original publication. For example, a recent paper by Carlo Sanna (*Journal of Number Theory*, Vol. 166, September 2016, pp. 41–46) mentions Taeisinger and cites Eric Weisstein’s *Concise Encyclopedia of Mathematics*; consulting the online version of that work, Taeisinger is indeed credited with the theorem, but the only reference is to another secondary source, Paul Hoffman’s biography of Erdős, *The Man Who Loved Only Numbers*; there, on page 157, Hoffman writes, “In 1915, a man named Taeisinger proved. . .” and gives no reference or further identification. So who was this mysterious and oddly named Taeisinger? I have never heard of him, and neither has MathSciNet or the Zentralblatt or the MacTutor math biography pages. In Number Theory: A Historical Approach John J. Watkins gives a slender further clue: The first initial “L.”

After some further rummaging through bookshelves and online material, I finally stumbled on a reference to a 1915 publication I could actually track down. In the *Comptes Rendus Mathematique* (Vol. 349, February 2011, pp. 115–117) Rachid Aït Amranea and Hacène Belbachir include this item in their list of references:

L. Taeisinger, Bemerkung über die harmonische Reihe,

Monatsch. Math. Phys.26 (1915) 132–134.

When I got ahold of that paper, here’s what I found:

Not Taeisinger but Theisinger!

I still don’t know much of anything about Theisinger. His first name was Leopold; he came from Stockerau, a small town in Austria that doesn’t seem to have a university; he wrote on geometry as well as number theory.

What I *do* know is that a lot of authors have been copying each other’s references, going back more than 20 years, without ever bothering to look at the original publication.

Spoiler alert: Everybody dies.

The setting is Melbourne, Australia, the southernmost major city on the planet. The entire population of the Northern Hemisphere was wiped out in the war, and airborne radioactivity is slowly creeping across the Equator. Darwin and Cairns, on Australia’s north coast, are already ghost towns, and the people of Melbourne are told they have less than a year to go.

A U.S. submarine takes refuge in Melbourne’s harbor. Over a period of weeks the captain of that vessel, Dwight, forms an attachment to a young woman named Moira. There’s affection on both sides, and maybe passion, but Dwight is determined to remain faithful to his wife and children back in Connecticut. He buys them presents: a diamond bracelet, a fishing rod, a pogo stick. He speaks of them in the present tense. Is Dwight delusional? Not exactly. He knows perfectly well that his family are all dead, and that he’ll never rejoin them except in the sense that he too will soon be dead. But those deaths are abtractions.

He had seen nothing of the destruction of the war . . . ; in thinking of his wife and of his home it was impossible for him to visualize them in any other circumstances than those in which he had left them. He had little imagination, and that formed a solid core for his contentment in Australia.

It’s not just Dwight who lacks imagination—or chooses to ignore the truths it reveals. Moira studies shorthand and typing for a future job that will never exist. Her father harrows fields for crops that will never grow. Another couple plant hundreds of daffodils whose blooms they will never see, and they invest in a lawn mower for grass they’ll never cut.

The author himself seems to share this selective connection to reality. Everyone in his doomed society is unfailingly polite, and usually cheerful. Civilization may be ending, but not civility. There’s not a single act of violence or malice or even selfishness in the entire story. Shute mentions no hoarding or profiteering, much less rape and pillage. No marauding bandits or desperate refugees from the contaminated north descend on this last haven. *On the Beach* is the antithesis of that other Australian vision of apocalypse: the Mad Max movies. (The first of the series, in 1979, was filmed near Melbourne.)

It’s also worth noting that no one in Shute’s world takes any steps to prolong life. The government is not hollowing out mountains to keep the human germ line going until the atmosphere clears. Families are not digging fallout shelters in the back yard. These last few representatives of *Homo sapiens* may indulge in a variety of follies, but hope isn’t one of them.

Why am I writing about this sad book just now? Well, obviously, it’s inauguration day. Which feels more like termination day.

The threat of nuclear disaster has continued to shadow us through all the years since Shute wrote his novel. The danger of a planet-scouring war seemed particularly urgent when I was 13 and reading *On the Beach* for the first time. I stood up in front of my eighth-grade English class to give an oral report on the book. My performance was not interrupted by a duck-and-cover drill, but it could have been.

Now we have handed control of 4,700 nuclear warheads to a petulant brat, and the danger seems greater than ever.

Revisiting that sense of menace is why I picked up the book, but it’s not what has made the strongest impression on me this second time around. I am both drawn to and appalled by the stoic acceptance of Shute’s fictional Melbournites. Given their circumstances, their reaction is not inappropriate. The worst has already happened, there’s nothing they can do to change it, they may as well make the best of it. In the face of certain extinction, what can you do but shrug your shoulders? Maybe the best way of muddling through is just to plant some daffodils.

Given the current mood of the nation and the world, I suddenly find it easier to understand Dwight’s behavior. The urge to pretend is powerful. I too want to believe that life can go on as normal, that I can continue to enjoy the private pleasures of family and friends, that I can retreat to a cozy office or library and lose myself in the world of ideas, in the “less fretful cosmos” of mathematics and science, or art and literature for that matter.

But we are not yet huddled on the beach, the last of the doomed. It’s late, but not yet too late. This is not the moment for resignation and acquiescence. Tomorrow we march!

]]>**Carey’s Equality**. Has everyone but me known all about this for ages and ages?

In a stationary population—where births equal deaths—the number of individuals who have lived *a* years is the same as the number who still have *a* years left to live. Here’s a more precise statement from James W. Vaupel of the Max-Planck-Institute for Demographic Research:

If an individual is chosen at random from a stationary population with a positive force of mortality at all ages, then the probability the individual is one who has lived

ayears equals the probability the individual is one who has that number of years left to live. For example, it is as likely the individual is age 80 as it is the individual has 80 years to live—not 80 years of remaining life expectancy but a remaining lifetime of precisely 80 years.

Is this fact obvious, a trivial consequence of symmetry? Or is it deep and mysterious? Apparently it was not clearly recognized until about 10 years ago, by James R. Carey, a biological demographer at UC Davis and UC Berkeley who was studying the age structure of fruitfly populations. The equality was proved in 2009 by Vaupel. A more general statement of the theorem and a more mathematically oriented proof were published in 2014 by Carey and Arni S. R. Srinivasa Rao of Augusta University.

I learned all this from a wide-ranging talk by Rao: “From Fibonacci to Alfred Lotka and beyond: Modeling the dynamics of population and age-structures.”

**Go with the Green**. Every weekday you walk from your home at the corner of 1st Avenue and 1st Street to your office at 9th Avenue and 9th Street. Since your city is laid out with a perfectly rectilinear grid, you have to go eight blocks east and eight blocks north. Assuming you never waste steps by turning south or west, or by straying outside the bounding rectangle, how many routes can you choose from?

It would be quite a chore to count the paths one by one, but combinatorics comes to the rescue. The answer is \(\binom{16}{8}\), the number of ways of choosing eight items (such as eastbound or northbound blocks) from a set with 16 members:

\[\binom{16}{8} = \frac{16!}{(8!)(8!)} = 12{,}870{.}\]

You could walk to work for 50 years without ever taking the same route twice. Which of those 12,870 paths is the shortest? That’s the beauty of the Manhattan metric: They all are. Every such path is exactly 16 blocks long.

But just because the routes are equally long doesn’t mean they are equally fast. Suppose there’s a traffic light at every intersection. Depending on the state of the signal, you can proceed either north or east without interruption, but you’ll have to wait for the light to change if you want to cross the other way. A sensible strategy, it seems, is always to go with the green if you can. Following this rule, you will never have to wait for a light unless you are on the north or the east boundary edge of the square.

The street grid with traffic lights came up in a talk by Ivan Corwin of Columbia University, titled “A Drunk Walk in a Drunk World.” The more conventional term for this subject is “random walks in random environments.” In an ordinary random walk (with a *non*random environment), the walker chooses a direction at each step according to a fixed probability distribution—the same at all sites and at all times. With a random environment, the probabilities vary both with position and with time. In a brief aside, Corwin offered the street grid with traffic lights as an example of a random environment. If the lights are uncorrelated on the time scale of a pedestrian’s progress through the grid, the favored direction at any intersection is an independent random variable. Then the following question arises: If the walker always takes the green-light direction when that’s possible, which paths are the most heavily traveled?

Corwin’s answer is that the walker will likely follow a stairstep path, never venturing very far from the diagonal drawn between home and office. Thus even though the distance metric says all routes are equal, the walker winds up approximating the Euclidean shortest path.

Corwin gave no proof of his assertion, although he did show the result of a computer simulation. After ruminating on the problem for a while, I think I understand what’s going on. One way of thinking about it is to break the 16-block walk into two eight-block segments, then consider the single vertex that the two segments have in common. Suppose the common point is the central intersection at 5th Avenue and 5th Street. There are 70 ways of getting from home to this point, and for each of those paths there are another 70 ways to continuing on to the office. Thus 4,900 paths pass through the center of the grid. In contrast, only one path goes through the corner of 9th Avenue and 1st Street. The same kind of analysis can be applied recursively to show that the initial eight-block segment of the walk is more likely to pass through 3rd Avenue and 3rd Street than through 5th Avenue and 1st Street.

Another way to look at it is that it’s all about the binomial theorem and Pascal’s triangle. The binomial coefficient \(\binom{n}{m}\) is largest when \(m = n/2\), making the “middle-way” paths the likeliest.

This argument says that always going with the green will give you the fastest route across town (at least in terms of expectation value), and the route you follow is likely to lie near the diagonal. What the argument *doesn’t* say is that deliberately biasing your choices so that you stay near the diagonal will get you to work sooner; that’s clearly *not* true.

When I mentioned Corwin’s example to my friends Dan Silver and Susan Williams, Susan immediately pointed out that the model fails to capture some important features of walking in an urban environment. Streets have two sides, and generally two sidewalks. To get from the southwest corner of an intersection to the northeast corner, you need *two* green lights. I’m not sure whether the conclusions hold up when these complications are taken into account.

I should add that solving this citified problem was not the main point of Corwin’s talk. Instead, he was addressing the problem of a bartender who wants to build a tavern in rough and ever-changing terrain near the rim of the Grand Canyon. The bartender needs to know how close he can come to the edge without endangering inebriated customers who might wander over the cliff.

**TASEP**. I’m a sucker for simple models of complex behavior. This week I learned of a new one—new to me, anyway. Jinho Baik of the University of Michigan talked about TASEP, a “totally asymmetric simple exclusion process” (admittedly not the most vividly descriptive name). Here’s what little I understand of the model so far.

The setting is a one-dimensional lattice, which could be either an infinite line or a closed loop of finite size. Some lattice sites are vacant and some are occupied by a particle. (No site can ever host multiple particles.) At random intervals—random with an exponential distribution—a particle “wakes up” and tries to move one space to the right (on a line) or one space clockwise (on a loop). The move succeeds if the adjacent site is vacant; otherwise the particle goes back to sleep until the next time the exponential alarm clock rings. Given some initial distribution of the particles, how does that distribution evolve over time.

When I see a model like this one, my impulse is to write some code and see what it looks like in action. I haven’t yet done that, but this is my current understanding of what I should expect to see. If you start with the smoothest possible particle distribution (alternating occupied and vacant sites), the particles will tend to clump together. If you start with a maximally clumpy state (one area solidly filled, another empty), the particles will tend to spread out. Baik and his colleagues seek a more precise description of how the density fluctuations evolve over time. And they have found one! Unfortunately, I’m not yet prepared to explain it, even in my hand-waviest way. The best I can do is refer you to the most recent paper by Baik and Zhipeng Liu.

**Debunking Guy**. If you ever have an opportunity to hear Doron Zeilberger speak, don’t pass it up. At this meeting he gave a spirited and inspiring defense of experimental mathematics, under the title “Debunking Richard Guy’s Law of Small Numbers.” Sitting in the front row was 100-year-old Richard Guy. Neither one of them was in any way daunted by this confrontation. In any case, Doron’s talk was more *homage* than attack. Later, I had a chance to ask Guy what he thought of it. “His heart is in the right place,” he said.

Guy’s Strong Law of Small Numbers says:

There aren’t enough small numbers to meet the many demands made of them.

As a consequence, if you discover that \(f(n)\) yields the same value as \(g(n)\) for several small values of \(n\), it’s not always safe to assume that \(f(n) = g(n)\) for all \(n\). Euler discovered a cautionary example that’s now well known: The equation \(n^2 + n + 41\) evaluates to a prime for all \(n\) from \(-40\) to \(+39\), but not outside that range.

Zeilberger doesn’t deny the risk of mistaking such accidents for mathematical truths. As a matter of fact, he discusses some of the most dramatic examples: the Pisot numbers, some of which produce coincidences that persist for thousands of terms, and yet ultimately break down. But such pathologies are not a sign that “empirical” mathematics is useless, he says; rather, they suggest the need to refine our proof techniques to distinguish true identities from false coincidences. In the case of the Pisot numbers, he offers just such a mechanism.

A paper by Zeilberger, Neil J. A. Sloane, and Shalosh B. Ekhad (Zeilberger’s computer/collaborator) outlines the main ideas of the JMM talk, though sadly it cannot capture the theatrics.

**Soundararajan on Tao on Erdős**. Take a sequence of +1s and –1s, and add them up. Can you design the sequence so that the absolute value of the sum is never greater than 1? That’s easy: Just write down the alternating sequence, +1, –1, +1, –1, +1, –1, . . . . But what if, after you’ve selected your sequence, an adversary applies a rule that selects some subset of the entries. Can you still count on keeping the absolute value of the sum below a specified bound? This is a version of the Erdős discrepancy problem, which Paul Erdős first formulated in the 1930s.

The question was finally given a definitive answer in 2015 by Terry Tao of UCLA. In the “Current Events” session of the JMM, Kannan Soundararajan of Stanford gave a lucid account of thre proof. You can read it for yourself, along with three other Current Events talks, by downloading the *Bulletin*.

**Proust’s Powdered-Wig Party**. Finally, a personal note. In the closing pages of Marcel Proust’s immense novel *A la Recherche du Temps Perdu*, the narrator attends a party where he runs into many old friends from Parisian high and not-so-high society. He is annoyed that no one told him the party was a costume ball: All of the guests are wearing white powdered wigs, as if they were gathering at the court of Louis XIV. Then the narrator catches sight of himself in a mirror and realizes that he too is coiffed in white.

At these annual math gatherings I run into people I have known for 30 years or more. For some time I’ve been aware that the members of this cohort, including me, are no longer in the first blush of youth. This year, however, the powdered wigs have seemed particularly conspicuous. Everyone I talk to, it seems, is planning for imminent retirement.

But of course this geriatric impression owes more to selection effects than to the aging of the mathematical population overall. Indeed, the corridors here are full of youngsters attending their first or third or fifth JMM. Which brings us back to Carey’s Equality. If we can safely assume that the population of meeting attendees is stationary, then the proportion of people who have been coming to these affairs for 30 years should be equal to the proportion who will attend 30 more meetings.

]]>What is truth? said jesting Pilate, and did not stay for an answer.

Lately there’s been a lot of news about fake news (some of it, for all I know, fake). Critics are urging Facebook, Google, and Twitter to filter out the fraudulent nonsense. This seems like a fine idea, but it presupposes that the employees—or algorithms—doing the filtering can reliably distinguish fact from fiction. Even if they *can* tell the difference, can we count on the companies to stand up to the prevaricators? Sure, Facebook can block traffic from a clickbait website run by a teenager in Macedonia. But what if the lies were to come from an account registered in the .gov domain?

When misinformation is stamped with the imprimatur of the president or other high government officials, there’s not much hope of shutting it down at the source or breaking the chain of transmission. This problem was not created by the new communication technologies of the internet age, and it is not unique to the incoming Trump administration. I have probably been lied to by every president who has served during my lifetime, and I could name seven of those presidents whose fibs are well documented. But Trump is different. He is not a *devious* liar, careful not to be caught in a contradiction. He is simply indifferent to truth. When challenged to support a dubious claim, he shrugs or changes the subject. The question of veracity seems not to interest him. And his election suggests that some part of the voting public feels the same way.

What to do? The only practical remedy I can suggest is to work diligently to uncover the truth, to publish it widely, and to help the public reach sound judgments about what to believe. All three of these tasks are difficult, but the last one, in my view, is the real stumper. As the signal-to-noise ratio in public discourse dives toward zero, we would *all* do well to sharpen our powers of discrimination. But I worry most about that subpopulation for whom strict factual accuracy is not the primary criterion when they choose stories to pass on to their friends and to embrace as the basis of important decisions. I don’t know how to change this, but I feel it’s important to try.

I’d like to begin with a more personal and less political anecdote. Some years ago, when the internet was young, a friend began sending me emails with subject lines like “Save 7 y.o. Jessica Mydek from cancer” or “Fw: Fw: FW: Fw: Bill No. 602P 5-cent tax on every email.” I would reply with a link to the debunking report at Snopes. My friend would thank me and sheepishly apologize, then the next month she would forward a message warning me not to blink my high beams if I saw a car with the headlights off—I’d be attacked by gang members conducting a rite of initiation. Email exchanges like these continued for a year or so, then they tapered off. Had my friend developed a measure of skepticism? Yes, but not in the way I had hoped. She had become skeptical of snopes.com. After all, it’s a website with a funny name, run by smug, self-appointed know-it-alls who make fun of gullible people. Why should she trust them?

Instead of an *ad hoc* watchdog like Snopes, maybe we should have an official arbiter of factuality, a certified and sanctified public agency. Call it the Ministry of Truth. And let’s give it enforcement powers: No social network or news outlet is allowed to publish anything unless the ministry attests to its accuracy.

Okay, that’s not such a hot idea after all.

In any case, no amount of scrupulous fact-checking would have cured my friend’s addiction to hoax email. There was something in those messages she wanted to believe. Even if 7 y.o. Jessica Mydek doesn’t exist, a world where chain letters can cure cancer is more appealing and empowering than the Snopesian world of grim facts, where you can only watch helplessly while a child dies. When you see a car driving without headlights, it’s more exciting to imagine a murderer at the wheel than a forgetful old fool. I’m sure my friend had her own doubts about some of these breathless pleas and warnings, but she was willing to overlook dodgy evidence or flawed logic for the sake of a good story.

As far as I know, my friend’s lax attitude toward factuality never caused grievous harm to herself or anyone else. But sometimes credulity can be disastrous.

Those who can make you believe absurdities can make you commit atrocities.

—Voltaire (paraphrase)

Let’s talk about Edgar Maddison Welch, the young man who showed up at the Comet Ping Pong pizzeria with a rifle and a handgun. By his own account, he sincerely believed he was going to rescue children being held captive in a basement room and subjected to unspeakable acts by Hillary Clinton and her associates. Where did that idea come from? Apparently it began with leaked emails from the hacked account of John Podesta, Clinton’s campaign chairman. According to an outline in the *New York Times*, eager sleuths on Reddit and 4chan discovered the phrase “cheese pizza” in the email texts, and recognized it as a code word for “child pornography.” Connecting the rest of the dots was easy and obvious: Podesta had corresponded with the owner of Comet Ping Pong, and Barack Obama had been photographed playing ping pong with a small boy, and so the basement of the restaurant must be where the Democrats slaughter their child sex slaves. However, the would-be rescuer with the AR-15 found no basement kill room—in fact, no basement at all. “The intel on this wasn’t 100 percent,” he told a *Times* reporter.

In case there’s even the slightest doubt, let me say plainly that I don’t believe a word of that grotesque tale about child abuse in the pizza parlor. Indeed, I can make sense of it only as a stupid joke, a parody, a deliberately preposterous confection. If I were fabricating such a malicious fiction, and if I wanted people to believe it, I would come up with something that’s not such a total affront to plausibility. Yet at least one reader of these fantasies took them in deadly earnest. We’ll never know how many more believe there might be a “grain of truth” in the story, even if specific details are wrong. And the purveyors of the myth are not backing down. In an AP story that ran in the *Times* on December 9 they propose that the Comet Ping Ping event was a “false flag,” yet another twist in the larger plot:

James Fetzer, a longtime conspiracy theorist who also believes the Sandy Hook school shooting was a hoax, told The Associated Press that Welch’s visit to the pizzeria was staged to distract the public from the truth of the “pizzagate” allegations. . . .

Fetzer and other conspiracy theorists seized on the fact that Welch had dabbled in movie acting as a giveaway that his visit to the restaurant was staged. . . . Blogger Joachim Hagopian, a false-flag proponent, told the AP that conspirators look for “a patsy or stooge” to pose as a lone gunman with an assault rifle. Welch, he said, “fits the pattern” with his acting background.

“He’s got an IMDB (Internet Movie Database) profile,” Hagopian said.

It’s easy to heap ridicule on these ideas. Indeed, by quoting them at length that’s exactly what I’m doing. How could anyone possibly believe in such contrived and convoluted schemes, such teetering towers of improbabilities? But it’s useful to keep in mind that the incredulity goes both ways. The conspiracy theorists would snigger at my naiveté for believing what I read in *New York Times*. Anyone who’s paying attention knows that all the big papers and TV networks are parties to the conspiracy. (Snopes is surely in on it too.)

Mathematics alone proves, and its proofs are held to be of universal and absolute validity, independent of position, temperature or pressure. You may be a Communist or a Whig or a lapsed Muggletonian, but if you are also a mathematician, you will recognize a correct proof when you see one.

—Philip J. Davis,American Mathematical Monthly, 79(3):254 (March 1972)

A high-stakes presidential election and accusations of child rape and murder certainly add force and immediacy to a discourse on the nature of truth, but they also distract. I would like to retreat from these incendiary themes, at least for a few paragraphs, and look at the calmer universe of mathematics, where we have well-developed mechanisms for distinguishing between truth and falsehood.

Take the case of angle trisectors—people who claim they can divide an arbitrary angle into equal thirds with the standard Euclidean toolkit of straightedge and compass. In some respects, trisectors are like peddlers of pizza parlor pedophilia, but when a trisector comes before you, you can give a stronger response than: “What you claim is contrary to common sense.” You can offer an absolute refutation: “What you claim is impossible. Pierre Laurent Wantzel proved it 180 years ago.” But I wouldn’t count on the trisector meekly accepting this answer and going away.

A few years ago, writing in *American Scientist*, I made an earnest effort to explain the Wantzel proof in some detail and in plain words, and I provided an English translation of Wantzel’s own paper from 1837. Soon after the article appeared, I began receiving letters festooned with elaborate geometric diagrams, some of them quite pretty, which the authors presented as proper straightedge-and-compass trisections. I wasn’t surprised at this development, but I was at a loss for how to respond. If a mathematical proof fails to persuade the reader of the truth of a mathematical proposition, what other kind of argument could possibly be more effective?

In the past few weeks I’ve given this incident further thought, and I’ve come to see it in a different light. The task of “persuading the reader,” even in mathematics, is not just about truth; it’s also about trust, or rapport, or social solidarity. The quip by Philip Davis that I reproduce above has long been a favorite of mine, but at this point I am tempted to turn it inside out. What I would say is not “If you’re a mathematician, you’ll recognize a proof” but “If you recognize a proof, you’re a mathematician.” The ability and willingness to engage in a certain style of reasoning, and to accept the consequences of that mental process no matter what the outcome, marks you as a member of the mathematical tribe. And, conversely, if you respond to a proof by saying “It may be impossible but I can do it anyway,” then you are not a member of this particular affinity group.

I am *not* arguing here that mathematical truth is some kind of socially determined quantity, and no more fundamental than religious or political doctrines. Quite the contrary, I am one of those stubborn prepostmodernists who believes in a reality that’s not just my private daydream. I’m convinced we all share one universe, where certain things are true and others aren’t, where certain events happened and others didn’t. The interior angles of a plane triangle will always sum to 180 degrees no matter what I say. Nevertheless, the process by which we recognize such truths and reach consensus about them is a social one, and it’s not infallible.

The same essay in which I discussed Wantzel’s proof also mentioned the infamous Monty Hall problem.

In 1990 Marilyn vos Savant, a columnist for

Parademagazine, discussed a hypothetical situation on the television game show “Let’s Make a Deal,” hosted by Monty Hall. A prize is hidden behind one of three doors. When a contestant chooses door 1, Hall opens door 3, showing that the prize is not there, and offers the player the option of switching to door 2. Vos Savant argued . . . that switching improves the odds from 1/3 to 2/3.

In the following weeks thousands of letter writers berated vos Savant for her blatant error, insisting that the two remaining closed doors were equally likely to conceal the prize. Quite a few of those critics identified themselves as mathematicians or mathematics teachers. Even Paul Erdős took this side of the controversy (although he didn’t write a letter to *Parade*). But of course vos Savant was right all along.

This story was already well known when I told it in my *American Scientist* essay, but I have a reason for retelling it yet again now. Along with the mail from angle trisectors I also received irate messages from Monty Hall deniers, who insisted that the probabilities really are 1/2 and 1/2. But this time it wasn’t professional mathematicians who raised objections; they had long since resolved their differences and settled on the correct answer. Now it was outsiders, dissidents, who attacked what they perceived to be an ignorant, entrenched orthodoxy enforced by the professoriat. In other words, the same two factions continued to fight over the same question, but they had switched positions.

The point I’m making here is the unsurprising one that social factors influence judgment. We are all predisposed to go along with the views of those we know and trust, and we are skeptical, at least initially, of ideas that come from outsiders. We listen more attentively and sympathetically when the speaker is a trusted colleague. The scrawled manuscript from an unknown author claiming a simple proof of the Riemann hypothesis gets a cursory reading or none at all. There’s nothing wrong with making such distinctions. The alternative—equal treatment for the competent and the crackpot—would certainly not help advance the cause of truth. But it has to be acknowledged that these practices further alienate outsiders. By pushing them away and closing off the channel of communication—treating them as irredeemables and deplorables—we diminish the chance that they will ever find a path into the community.

Why do the nations so furiously rage together, and why do the people imagine a vain thing?

—Psalms, 2:1, via George Frideric Handel

Do these skirmishes over minor mathematical questions have anything to do with “Fakebook” news that might have turned the tide of a presidential election? I submit there is a connection. In both cases the nub of the problem is not discovering the truth but persuading people to recognize and own it. The mathematical examples show that even the most irrefutable kind of evidence—a deductive proof—is not always enough to win over skeptics or opponents.

Proof is said to “compel belief”: You embrace the result even against your will. Once you grant the premises, and you work through the chain of implications, accepting the validity of each step in turn, you have no choice but to accept the ultimate conclusion. Or so one might think. But this view of proof as an irresistible engine of reason underestimates the flexibility and creativity of the human mind. In fact we are all capable of believing impossible things before breakfast, and denying certainties after dinner, if we choose to. Mathematicians—members of the tribe—promise not to do so, but that pledge is not binding on anyone else.

When I look back over my various encounters with angle trisectors and other mathematical mavericks, I can’t recall a single instance where I successfuly persuaded someone to give up an erroneous belief and accept the truth. Not one soul saved. This record of failure does not give me great confidence when I think of venturing forth to combat fake political news, where we don’t even have the secret weapon of deductive proof.

I’m left with the thought that *compelling* people to acknowledge a truth may be the wrong approach, the wrong attitude. Voltaire was a great hero of free-thinking, but his motto “Écrassez l’infâme!” is a bit too militaristic for my taste. However you choose to translate that phrase, he meant it as a call to arms. Let us crush superstition, wipe out error and ignorance, put an end to fanaticism and irrationality. I’m for all that, but I don’t want to be bludgeoning people into accepting the truth. It doesn’t really change their minds, and at some point they bludgeon you back.

Rather than *force* the people to give up their false notions and vain things, I would let the truth seduce them. Let them fall in love with it. Doesn’t that sound grand? If only I had the slightest clue about how to make it happen.

I like mathematics largely because it is

nothuman and has nothing particular to do with this planet or with the whole accidental universe—because, like Spinoza’s God, it won’t love us in return.

At this point my only consolation is a cold and severe one. Trump may be indifferent to truth, but the universe, in the long run, is utterly indifferent to him and his foibles. Our new president can declare that climate change is a hoax, and purge government agencies of all those who disagree, but those acts will not lower the concentration of carbon dioxide in the atmosphere.

Mathematical truths are even more aloof from human interference. In Orwell’s *1984* the Thought Police boast of making citizens believe that two plus two equals five. But all the sophistry of the Ministry of Truth and all the torture chambers of the Ministry of Love cannot alter the equation itself. They cannot *make* two and two equal five.

These are very small islands of certainty in a vast maelstrom of confusion, but they offer refuge, and maybe a place to build from.

]]>