Driving the dreamboat

17 August 2011

RCA electronic car of tomorrow ad

Slide behind the wheel of this dreamboat. Push the electronic control button. Then sit back and let transistors take over.

There’s something curiously tentative about this vision of the future of motoring, as seen from 1964. You’re invited to push the button and let the transistors take over. But you’ve still got your hands on the wheel; apparently you’re still responsible for driving the dreamboat.

Other early discussions of automatic automobiles are also fuzzy about exactly who or what is in charge. A notable example is the General Motors Futurama exhibit at the 1939 Worlds Fair in New York. “Safe distance between cars is maintained by automatic radio control,” intones the narrator, above creepy organ music. This certainly suggests something other than seat-of-the-pants driving. But the next sentence narrows the scope of that automatic control: “Curved sides assist the driver in keeping his car within the proper lane under all circumstances.” Thus the technology is merely assistive, not autonomous. And what’s that about “curved sides”? Norman Bel Geddes, the designer of the exhibit, explains all in Magic Motorways, published in 1940. It’s very low-tech. Freeway lanes are to be separated by high curbs of concave cross-section, which deflect a straying car back into its lane. (Later in the book Bel Geddes also discusses more elaborate guidance systems, involving buried conductors.)

The reprise of Futurama at the 1964 World’s Fair—an exhibit that I attended, along with 29 million other people—was even vaguer about the question of autonomous vehicles. We saw lots of miniature automobiles moving in close order along gleaming freeways, and personally I came away with the impression that all those vehicles were under computer control. But the transcript of the narration includes only a single sentence on the topic, and it’s open to almost any interpretation: “Vehicles electronically paced, travel routes remarkably safe, swift and efficient.”

Why so coy about the prospect of cars that would drive themselves without human intervention? Maybe the concept was just too outlandish for credibility, particularly in 1939. Or maybe GM recognized that their natural audience is made up of car enthusiasts, who want to drive their dreamboats, not just be carried along as electronically paced, radio-controlled passengers.

In any case, the coyness has now evaporated, and these days everybody is talking about truly autonomous vehicles. DARPA runs contests for them; an Italian group has driven them across Europe and Asia; Google has a “secret” fleet of them. And I too am talking about autonomous vehicles: “Leave the Driving to It” is my latest American Scientist column.

Note: The artwork above is from an RCA advertisement in the September 1964 issue of Scientific American. Stylistically, the painting owes something to the Futurama exhibits, but I’d like to make a wild guess that the (uncredited) artist who created this rendering lived in Minneapolis. That brightly lighted, colonnaded building to the right of center looks to me very much like a building at Hennepin and Washington (now owned by ING) that was completed in 1964, just as this ad appeared. The architect was Minoru Yamasaki, the designer of the World Trade Center.

Probabilities of probabilities

16 August 2011

In simple games, one can calculate the exact probability of every outcome, and so the expected winnings can also be determined exactly….

When I wrote the words above in a recent American Scientist column, I didn’t think I was saying anything controversial. So I was taken by surprise when a reader objected to the very notion of exact probability. My correspondent argued that probabilities can never be determined exactly, only measured to within some error bound. He also ruled out the use of limits in defining probabilities, allowing only finite processes. After a brief correspondence with my critic, I set the matter aside; but it keeps coming back to haunt me in idle moments, and I think I should try to clarify my thoughts.

The “simple games” I had in mind were those based on coin-flipping or dice-rolling or card-dealing. For example, I would say that the exact probability of getting heads when you flip a fair coin is 1/2. (Indeed, that’s how I would define a fair coin.) Likewise the exact probability of rolling a 12 with a pair of unbiased dice is 1/36, and the exact probability of dealing four aces from a properly shuffled deck is 1/270,725. The arithmetic behind these numbers is straightforward. You’ll find a multitude of similar examples in the exercises of any introductory textbook on probability. The trouble is, if you ask me to show you a fair coin or an unbiased die or a properly shuffled deck, I can’t do it. I certainly can’t prove that any specific coin or die or deck has the properties claimed. So my critic has a point: Those exact probabilities I was talking about come from a suppositional world of ideal randomizing devices that don’t exist—or can’t be shown to exist—in the physical universe.

Of course mathematics is full of objects that we can’t construct out of Lego bricks and duct tape—dimensionless points, the real number line, Hilbert space. Just as I can’t exhibit a fair coin, I can’t show you an equilateral triangle—or rather I can’t draw you a triangle and then prove that the three sides of that particular triangle are equal. The existence of perfect circles and right angles and other such Platonic apparatus is something we just have to accept if we want to do a certain kind of mathematics. I’m happy to view the fair coin in this light—as a hypothetical device that comes out of the same cabinet where I keep my Turing machines and my Cantor sets. The question is whether we can (or should) take the more radical step of banishing such idealized paraphernalia altogether.

Suppose we insist that the only way of determining a probability is to measure it by experiment. Take a coin out of your pocket and flip it 100 times; if it comes up heads 55 times, then p(H) = 0.55. But when you repeat the experiment, you might get 51 heads, suggesting p(H) = 0.51. Or maybe you would now conclude that p(H) = 0.53. Or you might adopt some other method of inferring a probability from the experimental evidence, something more sophisticated than just taking the mean of a sample. There’s an elaborate and highly developed technology for just this purpose. It’s called statistics, and it works wonders. Still, to the extent that the assigned probability depends on the outcome of a finite number of trials (and remember: taking limits is out of bounds), we’re never going to settle on a single, definite and immutable value for the probability.

Alan Hájek, in an article in the Stanford Encyclopedia of Philosophy, calls this approach to probability “finite frequentism.”

Where the classical interpretation [i.e., the probability theory of Laplace, Pascal, et al.] counted all the possible outcomes of a given experiment, finite frequentism counts actual outcomes. It is thus congenial to those with empiricist scruples. It was developed by Venn (1876), who in his discussion of the proportion of births of males and females, concludes: “probability is nothing but that proportion.”

Hájek calls attention to several worrisome aspects of this doctrine. Most of the problems take us right back to where this discussion began, namely to the fact that we can never learn the exact probability of anything. And this ignorance leads to awkward consequences. In the standard calculus of probabilities, we know that if an event occurs with probability p, then two independent occurrences of the same event have probability p2. How do we apply this rule in an environment where p changes every time we measure it? I suppose a stalwart empiricist might reply that if you want to know p(HH), then that’s what you should be measuring. The empiricist might also point out that the loss of the calculus is not a defect of the theory; it’s just the human condition. We really are ignorant of exact probabilities, and we’ll only get into trouble if we pretend otherwise.

We could adopt a new calculus based on probabilities of probabilities, in which p(H) is not a number but a distribution—maybe a normal distribution determined by the mean and variance of the experimental data. Then p2 becomes the product of two such distributions. But beware: The shape of that normal curve we’ve just smuggled into our reasoning is defined by a process that involves the moral equivalent of flipping a fair coin, not to mention taking limits. Maybe we should instead use a discrete, experimental approximation to the normal distribution, created with a physical device such as a Galton board.

the Hexstat in action

•     •     •

Reading on in Hájek’s long encyclopedia article, I find that the finite frequentists are not the only school of thought subject to withering criticism; Hájek finds deep flaws in every interpretation of probability theory, without exception. (That’s a philosopher’s job, I guess.)

Joseph Doob, writing 15 years ago in the American Mathematical Monthly, took the position that we have a perfectly sound mathematical theory of probability (formulated mainly by Kolmogorov in the 1930s, and founded on measure theory); the only problem is that it doesn’t connect very well with the world of everyday experience. I would like to quote at length what Doob has to say about the law of large numbers:

In a repetitive scheme of independent trials, such as coin tossing, what strikes one at once is what has been christened the law of large numbers. In the simple context of coin tossing it states that in some sense the number of heads in n tosses divided by n has limit 1/2 as the number of tosses increases. The key words here are in some sense. If the law of large numbers is a mathematical theorem, that is, if there is a mathematical model for coin tossing, in which the law of large numbers is formulated as a mathematical theorem, either the theorem is true in one of the various mathematical limit concepts or it is not. On the other hand, if the law of large numbers is to be stated in a real world nonmathematical context, it is not at all clear that the limit concept can be formulated in a reasonable way. The most obvious difficulty is that in the real world only finitely many experiments can be performed in finite time. Anyone who tries to explain to students what happens when a coin is tossed mumbles words like in the long run, tends, seems to cluster near, and so on, in a desperate attempt to give form to a cloudy concept. Yet the fact is that anyone tossing a coin observes that for a modest number of coin tosses the number of heads in n tosses divided by n seems to be getting closer to 1/2 as n increases. The simplest solution, adopted by a prominent Bayesian statistician, is the vacuous one: never discuss what happens when a coin is tossed. A more common equally satisfactory solution is to leave fuzzy the question of whether the context under discussion is or is not mathematics. Perhaps the fact that the assertion is called a law is an example of this fuzziness.

Note that the coins Doob is tossing seem to be drawn from the Platonic closet of ideal hardware.

None of my reading and pondering has made me a convert to finite frequentism, but at the same time I am grateful for this challenge to my easygoing confidence that I know what probabilities are and how to calculate with them. I surely know nothing of the sort. And on balance I think it might be a good idea if introductory accounts of probability theory put a little more emphasis on calculating from real-world data, with less reliance on fair coins and unbiased dice. In this connection I can recommend the probability section of Joseph Mazur’s diverting book Euclid in the Rainforest.

Four miscellaneous related notes, in lieu of a conclusion:

(1) One might guess that any discrepancies between real coins and ideal ones would amount to only minor biases (unless you’re wagering with Persi Diaconis). Perhaps so, but consider what happened a century ago when several eminent statisticians tried large-scale experiments in generating random numbers with dice, playing cards and numbered slips of paper drawn from a bowl or a bag. Not one of those efforts produced results that passed statistical tests of randomness (including the predictions of the law of large numbers). As late as 1955, even the big Rand Corporation table of a million random digits (generated by a custom-made electronic device) had to be fudged a little after the fact.

(2) Hájek’s indictment of finite frequentism includes this charge: The scheme “rules out irrational probabilities; yet our best physical theories say otherwise.” Elsewhere in the essay he elaborates on this point, mentioning that quantum mechanics posits “irrational probabilities such as 1/√2.” At first I thought this a very acute criticism, but now I’m not so sure. In the quantum contexts most familiar to me, 1/√2 is a commonly encountered amplitude; the corresponding probability is |1/√2|2 = 1/2. Is it true that quantum mechanics necessarily requires irrational probabilities?

(3) Mark Kac, writing on probability in Scientific American (September 1964, p. 96) asks why probability theory was such a late-blooming flower among the branches of mathematics. It was neglected through most of the 18th and 19th centuries. Kac offers this explanation:

Why this apathy toward the subject among professional mathematicians? There were various reasons. The main one was the feeling that the entire theory seemed to be built on loose and nonrigorous foundations. Laplace’s definition of probability, for instance, is based on the assumption that all the possible outcomes in question are equally likely; since this notion itself is a statement of probability, the definition appears to be a circular one.

I’m skeptical of this hypothesis. It’s surely true that probability had shaky foundations, but so did other areas of mathematics. In particular, analysis was a ramshackle mess from the time of Newton and Leibniz until 1951, when Tom Lehrer finally gave the world the epsilon-delta notation. Yet analysis was the height of fashion all through that period.

(4) If you adhere strictly to the finite-frequentist doctrine that a probability does not exist until you measure it, can you ever be surprised by anything?

A slight discrepancy

23 June 2011

patio table with a pattern of embedded raindrops

The image above shows the mesh top of a patio table, photographed after a soaking rain. Some of the openings in the mesh retain drops of water. What can we say about the distribution of those drops? Are they sprinkled randomly over the surface? The rainfall process that deposited them certainly seems random enough, but to my eye the pattern of occupied sites in the mesh looks suspiciously even and uniform.

For ease of analysis I have isolated a square piece of the tabletop image (avoiding the central umbrella hole), and extracted the coordinates of all the drops within the square. There are 394 drops, which I plot below as blue dots:

positions of 394 raindrops on a tabletop

Again: Does this pattern look like the outcome of a random process?

I come to this question in the aftermath of writing an American Scientist column that explores two varieties of simulated randomness: pseudo and quasi. Pseudorandomness needs no introduction here. A pseudorandom algorithm for selecting points in a square tries to ensure that all points have the same probability of being chosen and that all the choices are independent of one another. Here’s an array of 394 pseudorandom dots constrained to lie on a skewed and rotated lattice somewhat like that of the mesh tabletop:

394 pseudorandom dots on a skewed 60-by-60 grid

Quasirandomness is a less familiar concept. In selecting quasirandom points the aim is not equiprobability or independence but rather equidistribution: spraying the points as uniformly as possible across the area of the square. Just how to achieve this aim is not obvious. For the 394 quasirandom dots shown below, the x coordinates form a simple arithmetic progression, but the y coordinates are permuted by a digit-twiddling process. (The underlying algorithm was invented in the 1930s by the Dutch mathematician J. G. van der Corput, who worked with one-dimensional sequences, and extended to two dimensions in the 1950s by K. F. Roth. For more details, see my American Scientist column, page 286, or the splendid book by Jiri Matousek.)

394 dots scattered over a square by the Vandercorput algorithm

Which of these point sets, the pseudo or the quasi, is a better match for the raindrops? Here are the three patterns in miniature, placed side-by-side as an aid to comparison:

pseudorandom, quasirandom and raindrop patterns

Each of the panels has a distinctive texture. The pseudorandom pattern has both tight clusters and large voids. The quasirandom dots are more evenly spaced (though there are several close pairs of points), but they also form distinctive, small-scale repetitive motifs, most notably a hexagonal structure that repeats with not-quite-crystalline regularity. As for the raindrops, they appear to be spread over the area at almost uniform density (in this respect resembling the quasirandom pattern), but the texture shows hints of swirly rather than latticelike structures (more like the pseudorandom example).

Rather than just eyeballing the patterns, we could try a quantitative approach to describing or classifying them. There are lots of tools for this purpose—radial distribution functions, nearest-neighbor statistics, Fourier methods—but my main motive for bringing up this question in the first place is to play with a new toy that I learned about in the course of reading up on quasirandomness. It is a quantity called discrepancy, which attempts to measure how much a point set departs from a uniform spatial distribution.

There are lots of variations on the concept of discrepancy, but I’m going to discuss just one measurement scheme. The idea is to superimpose rectangles of various shapes and sizes on the pattern, allowing only rectangles with sides parallel to the x and y axes. Here are three example rectangles drawn on the raindrop pattern:

raindrop pattern with three axis-parallel rectangles

Now count the number of dots enclosed by each rectangle, and compare it with the number of dots that would be enclosed—based on the rectangle’s area—if the distribution of dots were perfectly uniform throughout the square. The absolute value of the difference is the discrepancy D(R) associated with rectangle R:

D(R) = | N · area(R) – #(PR) |,

where N is the total number of dots and #(PR) denotes the number of dots P in rectangle R. For example, the rectangle at the upper left covers 10 percent of the area of the square, so its fair share of dots would be 0.1 × 394 = 39.4 dots. The rectangle actually encloses only 37 dots, and so the discrepancy associated with the rectangle is |39.4 — 37| = 2.4. Note that the density of dots in the rectangle could be either higher or lower than the overall average; in either case the absolute-value operation would give a positive discrepancy.

For the pattern as a whole, we can define the global discrepancy D as the worst-case value of this measure, or in other words the maximum discrepancy taken over all possible rectangles drawn in the square. Van der Corput asked whether point sets could be constructed with arbitrarily low discrepancy, so that D would always remain below some fixed bound as N goes to infinity. The answer is No, at least in one and two dimensions; the minimal growth rate is O(log N). Pseudorandom patterns generally have even higher discrepancy, O(√N).

How can you find the rectangle that has maximum discrepancy for a given point set? When I first read the definition of discrepancy, I thought it would be impossible to compute exactly, because there are infinitely many rectangles to be considered. But after thinking about it a while, I realized there are only finitely many candidate rectangles that might possibly maximize the discrepancy. They are the rectangles in which each of the four sides passes through at least one dot. (Exception: Candidate rectangles can also have sides lying on the boundary of the enclosing square.)

Suppose we encounter the following rectangle:

reactangle with three sides anchored by points

The left, top and bottom sides each pass through a dot, but the right side lies in “empty space.” This configuration cannot be the rectangle of maximum discrepancy. We could push the right edge leftward until it just intersects a dot:

ractangle reshaped to maximize density of points

This change in shape reduces the area of the rectangle without altering the number of dots enclosed, and thus increases the density of dots. Alternatively, we could push the edge the other way:

rectangle stretched to maximize area

Now we have increased the area, again without changing the count of enclosed dots, and thereby lowered the density. At least one of these actions must increase D(R), compared with the initial configuration. Thus every rectangle with all four sides touching dots or the edges of the square is a local maximum of the discrepancy function; by enumerating all rectangles in this finite set, we can find the global maximum.

There is also the irksome question of whether the rectangle is to be considered “closed” (meaning that points on the boundary are included in the area) or “open” (boundary points are excluded). I’ve sidestepped that problem by tabulating results for both open and closed boundaries. The closed form gives the highest discrepancy for densely populated rectangles, and the open form maximizes discrepancy for sparsely populated rectangles.

By considering only extrema of the discrepancy function, we make the counting problem finite—but not easy! In a square with N dots (all with distinct x and y coordinates), how many rectangles have to be considered? This turns out to be a really sweet little problem, with a simple combinatorial solution. I’m not going to reveal the answer here, but if you don’t feel like working it out for yourself, you could look it up in the OEIS or see a short paper by Benjamin, Quinn and Wurtz. What I will say is that for N = 394, the number of rectangles is 6,055,174,225—or double that if you count open and closed rectangles separately. For each rectangle, it’s necessary to figure out how many of the 394 points are inside and how many are outside. Pretty big job.

One way to reduce the computational burden is to retreat to a simpler measure of discrepancy. Much of the literature on quasirandom patterns in two or more dimensions uses a quantity called star discrepancy, or D*. The idea is to consider only rectangles “anchored” at the lower left corner of the square (which we can conveniently identify with the origin of the xy plane). In this case the number of rectangles is just N2, or about 150,000 for N = 394.

Here is the rectangle that defines the global star discrepancy of the raindrop pattern:

maximal star-discrepancy rectangle for the raindrop pattern

The dark green rectangle at the bottom covers about 55 percent of the square and ought to encompass 216.9 dots, if the distribution were truly uniform. The actual number of dots included (assuming a “closed” rectangle) is 238, for a discrepancy of 21.1. No other rectangle anchored to the corner (0,0) has a greater discrepancy. (Note: Because of limited graphic resolution, the D* rectangle appears to extend horizontally all the way across the unit square; in fact the right edge lies at x = 0.999395.)

What does this result tell us about the nature of the raindrop pattern? Well, for starters, the discrepancy is fairly close to √N (which is 19.8 for N = 394) and not particularly close to log N (which is 6.0 for natural logs). Thus we get no support for the idea that the raindrop pattern might be more quasirandom than pseudorandom. The D* values for the other patterns shown above are in the ranges expected: 25.9 for the pseudorandom and 4.4 for the quasirandom. Contrary to the visual impression, the raindrop distribution seems to have more in common with a pseudorandom point set than a quasirandom one—at least by the D* criterion.

What about calculating the unrestricted discrepancy D—that is, looking at all rectangles rather than just the anchored rectangles of D*? A moment’s thought shows that this exhaustive enumeration of rectangles can’t change the basic conclusion in this case; D can never be less than D*, and so we can’t hope to move from √N toward log N. But I was curious about the computation anyway. Among those six billion rectangles, which one has the greatest discrepancy? Is it possible to answer this question without heroic efforts?

The obvious, straightforward algorithm for D generates all candidate rectangles in turn, measures their area, counts the dots inside, and keeps track of the extreme discrepancies seen along the way. I found that a program implementing this algorithm could determine the exact discrepancy of 100 pseudorandom or quasirandom dots in a few minutes. This outcome might seem to offer some encouragement for pushing on to higher N; however, the running time almost doubles every time N increases by 10, which suggests the computation would take a couple of centuries at N = 394.

I’ve invested a few days’ work in efforts to speed up this calculation. Most of the running time is spent in the routine that counts the dots in each rectangle. Deciding whether a given dot is inside, outside or on the boundary takes eight arithmetic comparisons; thus, at N = 394, there are more than 3,000 comparisons for each of the six billion rectangles. The most effective time-saving device I’ve discovered is to precompute all the comparisons. For each point that can become the lower left corner of a rectangle, I precompute a list of all the pattern dots above and to the right. For each potential upper right corner of a rectangle, I compile a similar list of dots below and to the left. These lists are stored in a hash table indexed by the corner coordinates. Given a rectangle, I can then count the number of interior dots just by taking the set intersection of the two lists.

With this trick, the estimated running time for N = 394 came down from two centuries to about two weeks. A big improvement—just enough encouragement to induce me to spend yet another day on further refinements. Replacing the hash table with an N × N array helped a little. And then I figured out a way to ignore all the smallest rectangles, those that cannot possibly be the max-D rectangle because they either contain too few dots or have too small an area. This improvement finally brought the running time down to the overnight range. The illustration below, which shows the rectangle of maximum discrepancy D for the raindrop pattern, took six hours to compute.

rectangle yielding the largest exact discrepancy for the raindrop pattern

The max-D rectangle is clearly a slight refinement of the D* rectangle for the same point set. The D rectangle “should” contain 204.3 dots but actually has 229, for a discrepancy of D = 24.7.

Of course knowing that the exact discrepancy is 24.7 rather than 21.1 tells us nothing more about the nature of the raindrop pattern. As a matter of fact, I feel I know less and less about it as I compute more and more.

When I started this project, I had a pet theory about what might be happening in the tabletop to smooth out the distribution of drops and thereby make the pattern look more quasi- than pseudorandom. To begin with, think of droplets lying on a smooth pane of glass rather than a metal mesh. If two small droplets come close enough to touch, they merge into one larger drop, because that configuration has lower energy associated with surface tension. Perhaps something similar could happen in the mesh. If two adjacent cells of the mesh are both filled with water, and if the metal channel between them is wet, then water could flow freely from one drop to the other. The larger drop will almost surely grow at the expense of the smaller one, and the latter will eventually be annihilated. Thus we would expect a deficit of drops in adjacent cells, compared with a purely random distribution.

This idea still sounds pretty good to me. The only trouble is: It explains a phenomenon that may not exist. I don’t know whether my discrepancy measurements actually reveal anything important about the three patterns, but at the very least the measurements fail to provide evidence that the raindrop distribution is different from the pseudorandom distribution. True, the patterns look different, but how much should we trust our perceptual apparatus in a case like this? If you ask people to draw dots at random, most do a bad job of it, typically making the distribution too smooth and even. Maybe the brain is equally challenged when trying to recognize randomness.

Nevertheless, I suspect there is some mathematical property that will more effectively distinguish between these patterns. If anyone else would like to sink some time into the search, the xy coordinates for the three point sets are in a text file here.

Update 2011-06-24: This is just a brief note to suggest that if you’ve read this far, please go on and read the comments, too. There’s much of value there. I want to thank all those who took the trouble to propose alternative explanations or algorithms, and to point out weaknesses in my analysis. Special thanks to themathsgeek, who within 40 minutes after I first posted the item had come up with a far superior program for computing discrepancies. Also Iain, who pursued my offhand remarks about the perception of random patterns with actual experiments.

Don’t try to read this proof!

7 June 2011

On the subject of the Collatz conjecture (also known as the 3x+1 problem), Paul Erdos remarked: “Mathematics is not yet ready for such problems.” Shizuo Kakutani joked that the problem was a Cold War invention of the Russians meant to slow the progress of mathematics in the West. Richard Guy listed it in an article titled “Don’t try to solve these problems!” All of these warnings to the unwary have had the expected effect: A bibliography compiled by Jeffrey Lagarias cites 200 works on the Collatz conjecture, and there are hundreds of other papers that didn’t make the list. This past week brought one more preprint: a claimed proof of the conjecture by Gerhard Opfer of the University of Hamburg.

The Collatz conjecture can be stated in terms of this little recursive program:

    procedure collatz(x)
        if x=1 then halt
        elseif even(x) then collatz(x/2)
        elseif odd(x) then collatz(3*x+1);

The conjecture makes the following claim: If x is any positive integer, then the program eventually halts. For example, starting with x=3, the successive values of x are 3, 10, 5, 16, 8, 4, 2, 1—and on reaching x=1 the program halts. You could try a few other starting values for yourself; x=27 is a popular choice; x=319,804,831 is even better. But please remember Guy’s advice. Also note that the Collatz conjecture has been verified numerically by Tomás Oliveira e Silva for all x up to 20 × 258 . Thus if you’re searching for a counterexample, you may as well start somewhere north of 5,764,607,523,034,234,880.

The conjecture is named for Lothar Collatz (1910–1990), who investigated the curious properties of the 3x+1 iteration when he was a young student, circa 1930. Opfer, the author of the new proposed proof, was a Ph.D. student of Collatz in the 1960s. This conjunction suggests a novelistic storyline: A mathematician struggles all his life with an intractable problem, then hands it on to his student, who dedicates his own career to the task, finally achieving triumphant success some 80 years after the story began. But apparently that’s not how it happened. Collatz never returned to the 3x+1 problem after the 1930s; almost all of his mature work was in numerical analysis. Opfer is also a numerical analyst and only recently turned to the 3x+1 problem. There’s no grand saga of a multigenerational obsession here.

Much of Opfer’s paper is beyond my understanding, but I can piece together a crude guide to a few of the basic ideas. The proof begins with a mathematical change of venue. A problem originally posed in terms of the simplest kind of arithmetic on the natural numbers is transported to the realm of holomorphic functions on the complex plane. It’s like being swept up from a farm in Kansas and set down in the Land of Oz. The idea for this transformation comes from the work of Lothar Berg and Günter Meinardus in the 1990s, who established a direct connection between these two realms. They set up a certain system of equations for functions of a complex variable z, then showed that the Collatz conjecture is true if and only if the solutions of the equations all lie in a certain region. In the simplest case, the region is the open unit disc—the disc centered at the origin with |z| < 1. Who’d've thunk it? What we do in Oz has consequences back in Kansas!

How can those two very different problems be yoked together? As I (tenuously) understand it, the complex functions of Berg and Meinardus can be represented by formal power series whose integer coefficients encode information about the sequence of numbers generated in the Collatz iteration. The rest of the Opfer proof is all about those coefficients. Thus we click our heels and return from the Emerald City to the land of number theory.

At this point Opfer begins reasoning in terms of algorithms that generate sets of coefficients. I’m on much friendlier terms with algorithms than I am with holomorphic functions, but strangely enough this is where I begin to lose my footing as I try to follow Opfer’s steps. His aim is to show that all coefficients vanish except for those whose indices lie in a certain congruence class. (Specifically, he is seeking to prove that all coefficients ηj = 0 except those for which j = 3k–1, k ∈ 1, 2, 3, … .) He argues that his algorithms for generating the coefficients yield exactly this property. But this is where I get lost.

Opfer is not the first to announce a proof of the Collatz conjecture. The earlier attempts did not stand up to scrutiny, and in various corners of nerdom there’s already skepticism about this try, too. As for me, I can’t say whether there are gaps in Opfer’s proof because there are such wide gaps in my understanding of it. The paper has been submitted to Mathematics of Computation, which is certainly an appropriate journal for this work, and I’m content to wait and see what comes of the referreeing process.

Like many others, I first learned of the 3x+1 problem from Martin Gardner’s Scientific American column in the early 1970s. A decade later, when I briefly had a chance to fill Martin’s space in the magazine after his retirement, 3x+1 was one of the first subjects I wrote about (nearly illegible PDF—sorry; it’s my only copy). I’m delighted to have this opportunity to poke at the problem again. Even if the Opfer proof comes to nothing, it has given me an incentive to read some of the recent literature, including that illuminating trip to Oz courtesy of Berg and Meinardus. I would also like to recommend a book by Günther J. Wirsching, The Dynamical System Generated by the 3n + 1 Function (Springer, 1998). A version of at least one chapter is available online.

Jeffrey Lagarias also has a recent book: The Ultimate Challenge: The 3x+1 Problem (AMS, 2010), but I haven’t seen it yet. The volume reprints a number of survey articles and historical documents, including a 1985 American Mathematical Monthly article by Lagarias himself that is still the best starting point for those who want to ignore all good advice and dive into the problem. I am particularly eager to see another chapter: an English translation of the only paper in which Collatz discusses his work on 3x+1; it is an account in Chinese of a talk by Collatz at Qufu Normal University in 1986.

Only correlate!

28 May 2011

I’m not actually a shill for Google Labs, although it may seem that way from all my recent (and ongoing) attention to the Google Ngram Viewer: four posts (1, 2, 3, 4) and an American Scientist column, so far. What I particularly like about Google Labs is that they share their toys. They create Big Data projects that everybody can play with. For those of us without a server farm on the back 40, that’s a rare opportunity.

The latest Labs release is Google Correlate. If you have a time series—data expressed as a function of date, for any subinterval of the period since 2003—Correlate will try to identify Google search queries that exhibit a similar temporal pattern of activity. All this is easier to understand with an example. For a specimen time series, consider the interest-rate index known as the 1-year CMT, which is published every week. I scraped seven years of CMT data from this web site, and uploaded the file to Correlate. I got back a list of 100 phrases whose popularity as Google search terms has followed a trajectory more or less similar to that of the interest rates. As it happens, none of those highly correlated terms has an obvious connection to financial affairs. Roughly half of them are related to cell phones (”cingular” and “treo” turn up over and over). But the term with the strongest correlation (r=0.9751) is the phrase “pill identification”:

graph of time-series correlation between 1-year CMT interest rate data and Google searches for 'pill identitification'

In other words, the gradual rise in interest rates during the early 2000s was paralleled by a steady growth in the number of people seeking help in identifying the contents of mysterious unlabeled vials in the medicine cabinet. Then, sometime in 2007, both trends reversed direction. Why should these particular variables be so closely correlated? If there is a reason, I have no idea what it is. And I must immediately insert the obligatory disclaimer: Correlation is not causation. Emphatically so in this case. If you are trying to predict the future course of interest rates, I do not recommend tracking popular interest in pill identification. Or vice versa.

At a more personal level, there’s a time series I have been tracking since 2007: the volume of spam arriving in my email inbox. My records are monthly, whereas Google Correlate wants weekly data, so I did some resampling and smoothing, and came up with this:

graph of correlation between Brian Hayes's spam receipts and the Google query 'ashford blackboard login'

The best match, shown in the graph, is the mildly enigmatic query “ashford blackboard login.” Many of the other correlated series suggest a seasonal theme that I can understand in retrospect but that I did not see coming before looking at the results: “honda accord 2009,” “celica 2009,” “rav4 2009,” “2009 altima coupe,” “new cars 2009,” “2009 ranger,” etc. The most distinctive features of the spam curve are a peak in the fall of 2008, a deep dip the following winter, and an even stronger surge in the summer of 2009. Evidently shoppers for cars in the 2009 model year followed a similar trend line. (But again I would caution that spam volume is unlikely to be a good predictor of automobile sales.)

These results might be taken to suggest that every conceivable time series must be correlated with some set of Google queries, however farfetched the association. I tried submitting a few random walks, covering the same time span as the spam series, and they too fetched up matching queries from the Google database:

correlation graph for a random walk and the query 'att tilt software'

At the opposite end of the spectrum from a random walk, I tried some rigidly artificial probes, such as a series with nonzero entries only in the month of May. Sure enough, there are search-engine queries that follow the same recurrent annual pattern:

correlation of a time series with nonzero entries on in the month of May and the Google query 'j labs'

A time series that has all of its energy concentrated in a single pulse elicits from the database a variety of flash-in-the-pan topics—queries that came and went and were never heard of again.

correlation of a time series with a pulse in September 2005 and the query 'wolframtones'

Without too much work we could enumerate all such one-month wonders.

It is not the case, however, that every possible time series has a close correlate somewhere in the Google collection. Here is an example of a series for which Correlate finds no query that matches closely enough to bother reporting:

weekly driving mileage, late 2003 to late 2010

This is a weekly record of miles driven in the family car. Should we be surprised that not a single series among the tens of millions of queries in the Google database comes close to matching this pattern? One approach to this question is to ask just how many series of this kind might exist. The mileage record covers 364 weeks. As a lower bound, suppose the mileage associated with each week could have just two possible values: either we drove the car or we didn’t, so the mileage is either zero or greater than zero. Then there are 2364 (or about 10110) possible time series—many orders of magnitude greater than the total number of Google searches since the company was founded. Thus the set of queries in the Google archive must be an extremely sparse subset of all possible time series. Most of the series we could construct would necessarily come up empty. (I note in passing that there’s interesting structure in that mileage log of mine, which I never knew about until I graphed it—but that’s a story for another day.)

A really interesting question is how Google Correlate does it. Even with “only” tens of millions of queries in the database, comparing a submitted series with all the candidates would be impossibly expensive. A white paper explains:

In our Approximate Nearest Neighbor (ANN) system, we achieve a good balance of precision and speed by using a two-pass hash-based system. In the first pass, we compute an approximate distance from the target series to a hash of each series in our database. In the second pass, we compute the exact distance function on the top results returned from the first pass.

Thus the basic strategy is precomputation: Spend a lot of time in advance computing a succinct signature or hash associated with each time series in the database; then quickly compare hash values when looking to match a submitted time series.

A few further miscellaneous notes:

Google Correlate evolved from earlier work on tracking influenza outbreaks by monitoring search-engine queries. Initially this required a batch computation lasting hours, even when run on hundreds of computers. The new hash-based search takes less than a second. (Algorithms and data structures still count for more than hardware.)

Google Correlate includes a geographic component alongside the temporal database. If you have data distributed over the 50 U.S. states, you can retrieve Google queries that exhibit a similar spatial pattern. (I have not experimented with this system.)

Even if you don’t have a time series or a geographic data set of your own, you can play with the new service by cross-correlating one search query against others. For example, enter the term “solstice” in the search box, and you’ll see a graph with exactly the pattern of twice-a-year spikes that you might expect. You also get a list of other search terms whose temporal pattern has similar features. One of those correlated terms is “italian seafood salad.” A glance at the corresponding graph suggests there’s only half a correlation in this case:

correlation of 'solstice' and 'italian seafood salad'

I didn’t know until just a few minutes ago that frutti di mare was a dish to be eaten at the winter solstice.

Oh, the places I’ve been!

11 May 2011

When I heard the rumors that my iPhone was tracking my movements and keeping a log of location data, I was annoyed. Now that Apple has fixed this bug/feature, I’m even more annoyed. There’s just no pleasing me.

iPhone location map for a trip through north California and southern Oregon

The fuss began three weeks ago, when Alasdair Allan and Pete Warden announced at a conference that iPhones record a location fix every few seconds, based on position with respect to cell-phone towers and wifi access points. The log file is saved on the phone, they said, and also transferred to any computer the phone syncs with, in the form of an SQLite database named “consolidated.db.” Allan and Warden wrote a handy Macintosh application, iPhoneTracker, that ex­tracts the geo­graphical markers and displays them on a map. At left is the record of a trip I took last summer through northern California and southern Oregon, as traced by cell phone towers along the way. There are lots of mysteries and spurious details among those dots, but the broad outline of the route is displayed clearly enough: a counterclockwise loop up I-5, across the Cascades to Coos Bay, and back down to San Francisco on U.S. 101.

Here’s another map-pin travel diary, a memento of a brief visit to Pittsburgh for a meeting at Carnegie Mellon:

Pittsburgh

In this case the dots represent wifi hotspots rather than cell-phone towers. Lots of hotspots! They look like a swarm of bees. The cluster at lower left is the CMU campus, where the meeting was held. Moving northeast, the nearest clump of dots is the vicinity of my hotel. The rest of the dots, further north and east, trace the routes of a couple of walks I took, out looking for dinner. Curiously, a third long walk doesn’t show up at all, even though I had the phone with me, and indeed used the Maps app to get unlost a couple of times.

(I should mention that the version of iPhoneTracker distributed by Allan and Warden does not show wifi locations, and it plots cell towers on a rather coarse grid. But the software is open-source, and those limitations are undone with a couple of easy edits.)

Allan and Warden’s discovery of the iPhone location database wasn’t exactly new. Alex Levinson reported some months ago on an earlier version of the location log. And last July Apple explained its “location services” privacy policies in considerable detail in response to an inquiry from two Senators. No one took much notice of those earlier reports, but the new one caused a ruckus. It soon emerged that Android devices are collecting similar information and sharing it with Google. Yesterday, both Apple and Google were grilled in Congress.

The ruckus has mostly been about quaint 20th-century notions like personal privacy. I have my own worries on that score, but what irks me most is not that my phone is storing this information but that Apple gives me no access to it. If I’m going to help them build an immense database of cell towers and wifi beacons, then surely I should at least be able to retrieve and display my own coordinates, no?

The Allan and Warden program fills this need to some extent. And last week the New York Times bits blog announced a cloud-based approach that might be even better. They are inviting iPhone users to upload their location information to a service called OpenPaths, where you can build animated maps of your own peregrinations and, perhaps, if you choose, share the data for research purposes.

There’s just one problem. Apple’s response to the invasion-of-privacy complaints was to issue an operating system update that will make it even harder—probably impossible—for me to get access to my own data. After I install the update, my phone will not stop collecting geographic information, nor will it stop reporting location fixes to Cupertino, but it will encrypt the file so that I can’t read it. Maximally annoying. Geolocation wthout representation.

As far as I can tell, Apple is telling the truth about the nature and source of the information in consolidated.db. When the story first broke, I assumed—along with many others—that the database was recording the cell sites and wifi networks that my phone detected as I wandered around, carrying the device in my pocket. In other words, the database was a local copy of a location log that was also, presumably, being uploaded to Apple. An Apple press release from April 27 insists that I had it backwards. This is not information gathered by my phone. Instead it is a “crowd-sourced database” downloaded from Apple to my phone.

3. Why is my iPhone logging my location?

The iPhone is not logging your location. Rather, it’s maintaining a database of Wi-Fi hotspots and cell towers around your current location, some of which may be located more than one hundred miles away from your iPhone, to help your iPhone rapidly and accurately calculate its location when requested….

4. Is this crowd-sourced database stored on the iPhone?

The entire crowd-sourced database is too big to store on an iPhone, so we download an appropriate subset (cache) onto each iPhone….

6. People have identified up to a year’s worth of location data being stored on the iPhone. Why does my iPhone need so much data in order to assist it in finding my location today?

This data is not the iPhone’s location data—it is a subset (cache) of the crowd-sourced Wi-Fi hotspot and cell tower database which is downloaded from Apple into the iPhone to assist the iPhone in rapidly and accurately calculating location. The reason the iPhone stores so much data is a bug we uncovered and plan to fix shortly….

Why do I believe this self-serving story? Basically because a true log of the phone’s trajectory through time and space would look rather different from the list of entries I find in consolidated.db. My phone spends much of its time in one place, talking all day to the same wifi links and the same cell towers. A log that recorded my moment-by-moment position over a period of months would include many, many repeated contacts with these few nearby sites. But in fact consolidated.db has exactly one entry for each such site. (The structure of the database guarantees this: The wifi MAC address and the set of identifiers for cell towers are primary keys in the data tables, which means they must be unique.) Another clue: Clumps of sites in the same neighborhood all have exactly the same time stamp. It appears they were all downloaded to the phone at the same time. That’s not the way I would have encountered the sites while walking the streets of the Shadyside neighborhood in Pittsburgh.

All the same, I still insist that I have an ownership stake in this database. I’m part of the crowd that sourced it. Without the unwitting participation of millions of iPhone owners, Apple’s database wouldn’t exist. And a piece of it is stored on my phone—some 24 megabytes’ worth (24,209 cell phone towers, 177,103 wifi routers). Finally, even if the database is not constructed as a direct tabulation of my movements, it provides a remarkably accurate record of the places I’ve been. That too makes the data mine.

Addendum 2011-05-12: Apple and Google argue that if you want the benefits of location-based services, then you have to be willing to share information about your whereabouts. Is this trade-off actually necessary? I think not. If the entire database were resident on the phone, then the phone itself could calculate its position, without any need to reveal that position to the outside world. If the global database is too large to put a copy on every phone, then installing larger pieces of it would at least raise the granularity of the information being leaked. There’s a difference between knowing I was in Pennsylvania and knowing I was at the intersection of Fifth Avenue and South Aiken Avenue in the Shadyside neighborhood of Pittsburgh.

The real need for sending my position information to Apple or Google is not so that I can get the benefit of the cell/wifi database but rather so that I can help them build that database. When Google first set out to compile this kind of information, they did so at their own expense, equipping their Streetview photography cars with wifi and cell antennas. The Skyhook database was created in a similar way. Using cell-phone customers to do the same work changes the terms of the transaction in a way I find unpleasant. I’m contributing to a proprietary database; I’m doing the work of drivers who would otherwise have to be hired to cruise the streets; but I’m not being compensated; on the contrary, I’m paying for the privilege. I can see an argument for a scheme in which we all voluntarily contribute data for a public good, but that’s not the nature of this transaction.

I should point out that there are efforts to build publicly accessible databases of cell and wifi coordinates. There’s Cellspotting, which looks interesting but works only with a few kinds of mobile phones. And there’s OpenBmap, which has some rough edges but even so provides an impressive amount of information. It’s the place to go if you want to learn about the cell towers in your neighborhood and figure out the numbering scheme that identifies them in the consolidated.db file.

Finally, a question: Can we imagine a “zero-knowledge” internet location service? GPS works this way: I can get a fix on my position simply by receiving signals from GPS satellites and doing some arithmetic on them; I don’t have to transmit anything at all. What the satellites are broadcasting is merely a time signal, and the arithmetic I have to do consists in finding a consistent time-of-flight solution for signals from three or four of the satellites. If we had internet beacons of known location broadcasting a continual stream of high-resolution time signals, we could do something similar. A complication is that the internet is a very inhomogeneous medium, where signals move at very different speeds. On the other hand, it would be easy to collect input from hundreds of beacons, rather than three or four satellites. Even without the beacons, the art of inferring latitude and longitude from IP number seems to be pretty highly developed; there’s a fascinating recent paper (PDF) on how it’s done, by Yong Wang and colleagues at Northwestern and Microsoft Research. (The GPS-beacons-on-the-internet idea must have been proposed a zillion times by now, but I don’t have a reference ready at hand.)

Zipfy n-grams

28 April 2011

In the 1930s and 40s George Kingsley Zipf studied word frequencies in several languages and came up with a general observation: If you sort all the words from commonest to rarest, the frequency of the word at rank r is proportional to 1/r. This law has a distinctive graphical interpretation: Plotting the logarithm of frequency against the logarithm of rank yields a straight line with a slope of –1.

Do the Google n-grams obey Zipf’s law? When commenter Nick Black raised this question, I didn’t know the answer. Now I do.

Here’s a log-log plot of the number of occurrences of a word w as a function of w’s rank among all words:

Zipf curve for n-gram abundance

We don’t have a straight line here, but it’s not too far off. You can get a good fit with two piecewise straight lines—one line for the top 10,000 ranks and the other for the rest of the 7.4 million words.

Zipf curve with piecewise linear fitted lines

The upper part of the distribution is quite close to Zipf’s prediction, with a slope of –1.007. The lower part is steeper, with a slope of –1.66. Thus the communal vocabulary seems to be split into two parts, with a nucleus of about 10,000 common words and a penumbra of millions of words that show up less often. The two classes have different usage statistics.

I don’t have any clear idea of why language should work this way, but very similar curves have been observed before in other corpora. For example, Ramon Ferrer-i-Cancho and Ricard V. Solé report slopes of –1.06 and –1.97 in the British National Corpus, with the breakpoint again at roughly the rank 10,000. (Preprint here.)

Incidentally, the Zipf plot is truncated at the bottom because the n-gram data set excludes all words that occur fewer than 40 times. Extrapolating the line of slope –1.66 yields an estimate of how large the data set would have been if everything had been included: 77 million distinct n-grams.

The Library of Babble

23 April 2011

The new issue of American Scientist is out, both on newsstands and on the web. My “Computing Science” column takes up a topic I’ve already written about here on bit-player: the huge corpus of “n-grams” extracted from the Google Books scanning project and released to the public by a team from Harvard and Google. (The earlier bit-player items are titled “Googling the Lexicon” and “3.14.”)

After two blog posts and a magazine column, my faithful readers may have had enough of this subject—but not me. I can’t seem to get it out of my system. So I’m going to take this opportunity to publish some of the overflow matter that wouldn’t fit in the column. (And even this won’t be the end of the story. Stay tuned for still more n-grams.)

For the benefit of readers who have not been doting on my every word, here’s a precis. Google aims to digitize all the world’s printed books, and so far they have scanned about 15 million volumes (which is probably about one-eighth of the total). At Harvard, Erez Lieberman Aiden and Jean-Baptiste Michel, with a dozen collaborators, have been working with digitized text from a subset of 5,195,769 Google book scans. Because of copyright restrictions they cannot release the full text, but they have extracted lists of n-grams, or phrases of n words each, for values of n between 1 and 5. The 1-grams are individual words (or other character strings, such as numbers and punctuation marks); the 2-grams are two-word phrases, and so on. Each n-gram is accompanied by a time series giving the number of occurrences of that n-gram in each year from 1520 to 2008. (For more background and technical detail, see the Science article by Michel, et al.)

If you want to trace changes in the frequency of a specific word over time, Google has set up an online Ngram Viewer that makes this easy. But you can also download the data set, which allows for many other kinds of exploration. The cost is some heavy lifting of multigigabyte files. So far I have worked only with English text (six other languages are also covered) and only with 1-grams, which form the smallest part of the data set.

Here are some basic facts and figures on the English 1-grams:

uncompressed file size (bytes) 9,672,200,350
number of distinct 1-grams 7,380,256
total 1-gram occurrences 359,675,008,445
number of distinct character codes 484
total character occurrences 1,515,454,264,550

That’s a lot of verbiage.

It’s worth pausing for a comment on those 484 distinct character codes. We tend to think of English as being written with an alphabet of just 26 letters, or 52 if you count upper case and lower case separately. Then there are numbers and marks of punctuation, and miscellaneous symbols such as $ and +. The original ASCII code had 95 printable characters. How do you get up to 484? Well, even though these files are derived from English-language books, a fair amount of non-English turns up in them. There’s Greek and Cyrillic and a smattering of Asian languages, as well as all the accented versions of Latin characters seen in Romance and Germanic and Slavic languages. There’s even a little mathematical notation. It’s actually surprising that the data set spans only 484 symbols; this is a small subset of the full Unicode spectrum.

Here is the length distribution of the 7 million distinct 1-grams:

graph of abundance vs. word length for the 7 million distinct 1-grams

Note that the abundance of shorter words is combinatorially limited. With an “alphabet” of 484 symbols, there cannot possibly be more than 484 single-character words, or 4842 two-character words. But this constraint becomes unimportant beyond length 3; for words of five or six letters, only a tiny fraction of all possible combinations are actually observed.

The corresponding distribution for the 359 billion word occurrences looks rather different:

graph of abundance vs. word length for the 359 billion word occurrences

This is roughly what you’d expect to see for a language that encodes information efficiently. As in a Huffman coding, the shortest words are very common, and the sesquipedalian ones are rare. The overall trend is generally linear, and it is remarkably so in the range of lengths from 5 to 11 or 12. Is this a well-known fact? (It is not the shape I would have guessed.) Words of length three and four stand out above the linear trend line. And I should mention that the abundance of single-character words is boosted in this tabulation by the inclusion of punctuation marks, which would probably not be counted at all in other studies of word length.

The graph below is one that appears in my American Scientist column. I reproduce it here because I want to call attention to some curious features of the curves.

historical time series of distinct words and word occurrences

The graph shows the number of distinct words per year (blue) and the total word occurrences per year (red) over the past 200 years or so. I find it interesting that some major historical events are visible in this record. There are dips in both curves at the time of the American Civil War and during both World Wars of the 20th century. Presumably, book publishing languished during those years. There are also ripples that might be attributed to the crash of 1929 and the ensuing Great Depression, although they are less clear.

Other features of the curves don’t have such a ready-made historical explanation. I am particularly curious about the broad sag in the red curve (but not the blue one) from the late 1960s through the 1970s. Between 1967 and 1973, the number of word occurrences declined 12 percent, while the number of distinct words rose 1 percent. This is very strange: We continued to invent new words, but we didn’t make much use of them. Nowhere else do the two curves maintain opposite slopes for any extended period. I can’t explain it. I call it the great Nixonian slump.

I thought I might learn something about the slump by looking at a selection of specific words that exhibit this pattern of abundance—less frequent in the early 70s than in surrounding years. So I extracted about 70,000 of them and sorted them by overall abundance. In many ways it’s an interesting collection. At the very top of the list is Reagan, apparently in eclipse during the Nixon years. Then there’s a swarm of words connected with mechanical and automotive engineering: torque, rotor, stator, pinion, crankshaft, impeller, carburetor, alternator. All of these words might plausibly appear in the same books, so seeing them fade in and out together is not so surprising. Still, one would like to know what happened in book publishing to cause the decline. The 70s were a tough time for the auto industry; is that enough to explain it?

Looking for patterns in these words is fun, but the truth is I don’t believe they have anything to do with the overall Nixonian swoon. Most words have large fluctuations in abundance over a time scale of a decade or two; my extraction procedure merely identified a subset of words that happened to enter a trough at the same time. I could find similar sets for other periods.

In another attempt to explain the slump I divided the data set into halves. One half consists of the 100 most frequent words, which happen to account for almost exactly half of all word occurrences. The other 7,380,156 words make up the second half. Maybe the slump afflicted only common words, or only rare ones? The result was remarkably uninformative:

times series for the top and bottom halves of the frequency range

The two curves trace almost exactly the same time course. Or maybe that’s not so uninformative after all: It tells us that whatever phenomenon causes the slump, the effect is spread out over the entire vocabulary, not just words in a certain frequency range.

Here’s another just-so story I’ve been telling myself in an attempt to understand the Nixonian swoon. The 1960s is when phototypesetting began to displace the older technology of metal type. Suppose that some characteristic of early phototypeset books causes trouble for optical character recognition (OCR) systems. On reading such a book, the OCR program would report an exaggerated number of unique words (since instances of a given word could be read in various different erroneous ways) but the total number of word occurrences would remain constant (since every word is still recognized as some word). This isn’t exactly the pattern we’re seeing, but there’s one more factor to take into account. In the Harvard-Google data set, no word is included unless it appears at least 40 times. If the phototypeset text caused the OCR system to misread the same word in many different ways, some of those errors would fail to reach the 40-occurrence threshold and would just disappear. Thus the total occurrence count could decline even as the number of distinct words increased.

Lying awake in the middle of the night, I thought this story sounded really good. When I got up in the morning, I ran a test. If an unusual number of words are falling off the bottom of the distribution during the Nixon years, then we should also see a bulge just above the bottom, consisting of those words that just barely reached the threshold. But here’s the time series for the 15,518 words with exactly 40 occurrences:

time series for words with exactly 40 occurrences

There’s no hump in the late 60s. On the contrary, the curve looks much like all the rest, with the same sorry sag. Isn’t it annoying when mere fact overturns a perfectly lovely theory?

But enough of the Watergate era. I have a few more loose ends to tie up.

A graph in the magazine illustrates our collective fondness for round numbers—those divisible by 5 or 10. I remark in the text: “Dollar amounts are even more dramatically biased in favor of well-rounded numbers.” Here’s the evidence:

frequency of dollar amounts for numbers from $1 to $100

Dollar amounts mod $1 tell the same story:

prevalence of cents values between 1 and 99

I was not surprised at the prominence of $X.25, $X.50 and $X.75, but in this graph I had expected also to see a strong signal from prices a penny less than a dollar. That signal is detectable but not conspicuous. Apparently, items mentioned in books—perhaps the books themselves—are more commonly priced at $X.95.

Finally, I want to mention two small but troublesome anomalies in the downloadable 1-gram files. The Google OCR algorithms treat most marks of punctuation as separate 1-grams. Thus we can count how many periods, colons and question marks appeared in printed books over the years. But the encoding of two of these symbols was garbled somewhere along the way. The most abundant 1-gram in the entire data set—with 21,396,850,115 occurrences, or about 6 percent of the total—is listed in the files as the double quotation mark. In fact it should be identified as the comma. (In the web interface to the online Ngram Viewer, the comma is a separator, which may have something to do with the confusion in the downloadable files. The character is correctly given as a comma in the private files of Aiden and Michel.)

The second problem entry is even weirder. Loaded into a text editor, it looks like this:

                          """ "

Closer examination with a binary editor shows that the space between the third and fourth quote marks consists of two control characters, with hexadecimal values 0×15 (NAK) and 0×12 (DC2). I make no sense of this, and Aiden and Michel have not yet been able to help. All the same, I think I know how to fix it. If the entry for the double quotation mark is actually a comma, then something else has to be the double quote. This bizarre character string looks like a good candidate.

Rashid’s bits

10 April 2011

I’m in Pittsburgh this weekend, attending a conference at Carnegie Mellon University. The sessions are being held in the Rashid Auditorium of the Gates Building.

Gates bldg CMU

The seats in the auditorium are upholstered with a pattern that seems apt for a computer science department, but it also begs for elucidation:

Rashid seat bits small 0769

What is the patterning principle in these bits?

I’ve asked half a dozen people in the department—faculty, grad students, staff—but they all profess ignorance. Perhaps it’s a secret that’s never divulged to outsiders.

A quick web search turned up only one pertinent reference—a publication of the CMU chapter of the ACM. A photo caption on the last page includes this remark:

Do the binary codes written on the seats in Rashid Auditorium mean anything? After further investigation, we have concluded that while the binary patterns seem not to repeat, they do not map to any meaningful ASCII words.

Could it be true that the sequence has no repetition anywhere in the auditorium? I guess that depends on what you mean by repetition. Obviously short subsequences are repeated, and Axel Thue proved a century ago that no binary sequence can go on indefinitely without repeating some block of bits. But I have not tried to survey all 247 seats looking for the longest repeated block. (Most of the seats are occupied at the moment.)

I have a sneaking suspicion about what lies behind this pattern, but I’m trying to pay attention to a series of talks as I write this, so I’m not going to investigate further until I get home. In the meantime, I thought I’d throw open the question to anyone else who wants to speculate. To make the job a little easier, I’ve transcribed a few hundred bits:

1 0 1 0 1 1 0 1 1 1 0 1 0 1 0 0 0 1 1 0 1 1 0 1 0 1 1 0 1 1 1 0 1 0 1 0 1 0 0 0 1 0 1 0 1 1 0 1 0 1 0 0 0 1 1 0 1 1 0 0 1 0 1 0 1 1 0 1 1 1 0 1 0 1 0 0 0 1 1 1 0 1 0 0 1 1 1 0 1 1 0 1 0 1 0 1 1 0 1 1 1 0 1 1 0 1 1 0 1 0 1 1 1 0 1 1 0 1 0 1 0 0 0 0 1 1 0 1 0 1 0 1 1 0 1 1 1 0 1 1 0 1 0 0 1 1 1 0 1 1 0 1 0 1 0 1 1 0 1 0 1 1 0 1 0 1 0 0 1 1 0 1 0 1 0 0 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 0 0 1 1 0 1 0 1 0 0 1 0 0 0 1 0 1 0 1 1 0 1 0 1 0 0 1 1 0 1 0 1 0 0 1 0 1 0 1 0 0 1 1 1 0 1 0 1 0 1 1 0 1 0 0 1 0 0 1 0 1 0 1 0 1 1 1 0 1 0 0 1 0 1 0 1 0 0 0 1 0 1 0 1 0 1 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0 1 1 1 0 1 0 1 0 1 1 0 1 0 1 0 1 0 1 0 1 0 1 1 0 0 0 1 0 1 0 0 0 1 1 1 0 1 0 0 1 0 1 0 1 1 0 1 0 1 0 1 0 1 0 0 1 0 0 1 0 1 0 0 0 1 1 1 0 1 0 0 0 1 1 0 1 0 1 0 1 0 1 1 0 0 0 1 0 1 0 0 0 0 1 0 1 0 0 1 0 0 1 1 0 1 0 1 0 0 1 1 1 0 1 0 1 1 0 1 1 1 0 1 0 1 0 1 0 0 1 0 1 0 1 1 0 1 0 1 0 0 1 1 1 0 1 0 1 1 0 1 0 0 1 0 1 0 0 1 0 0 1 1 0 1 0 1 0 0 1 1 0 1 0 1 0 1 0 0 0 1 0 1 0 0 1 0 1 0 1 1 0 1 1 1 0 1 0 1 0 0 0 1 1 1 1 1 0 1 0 1 1 0 1 1 0 0 1 0 1 0 1 1 0 1 1 1 0 1 1 1 0 1 0 1 0 1 0 0 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 0 1 0 1 0 0 0 1 1 0 1 0 0 1 1 1 0 1 1 0 1 0 1 0 1 1 1 0 1 1 0 1 1 0 1 1 0 1 0 1 1 0 1 0 0 1 1 1 0 1 1 0 1 0 1 1 1 0 1 1 0 1 0 1 0 0 0 1 1 0 1 0 0 0 1 0 1 0 0 1 0 1 0 1 1 1 0 1 0 1 1 0 1 0 1 0 0 1 1 0 1 0 1 0 0 0 1 1 0 1 0 0 1 0 0 0 1 1 0 1 0 1 1 0 1 0 1 0 0 1 1 0 1 0 1 0 1 0 0 1 0 1 0 1 1 1 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 0 0 0 1 0 1 0 1 0 0 1 1 1 0 1 0 1 0 1 1 0 1 0 1 0 0 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 0 1 1 1 0 1 0 1 1 1 0 1 0 0 1 0 1 0 1 0 0 0 1 0 1 0 1 0 1 0 1 0 1 1 1 0 1 0 0 1 0 1 1 0 1 0 1 0 1 0 1 1 0 0 0 1 0 1 0 0 0 1 0 1 0 1 0 0 1 0 1 0 0 1 1 0 1 0 1 0 1 0 1 1 0 1 0 1 1 0 1 0 1 1 1 0 1 0 0 1 0 1 1 0 1 0 1 0 1 0 1 0 1 1

Any ideas?

Update 2011-04-12: The “sneaking suspicion” I had in mind the other day was ridiculously wrong. I had wanted the pattern to be a segment of the Thue-Morse sequence, which I obliquely alluded to above. (Obliquely and also clumsily, as one commenter rightly pointed out.) The Thue-Morse sequence is one of the little gems of discrete mathematics, and it seemed like a perfect choice for a temple of high nerdiness like the CMU CS department. Here’s how the sequence begins:

0 1 1 0 1 0 0 1 1 0 0 1 0 1 1 0 1 0 0 1 0 1 1 0 0 1 1 0 1 0 0 1 1 0 0 1 0 1 1 0 0 1 1 0 1 0 0 1 0 1 1 0 1 0 0 1 1 0 0 1 0 1 1 0 1 0 0 1 0 1 1 0 0 1 1 0 1 0 0 1 0 1 1 0 1 0 0 1 1 0 0 1 0 1 1 0 0 1 1 0 1 0 0 1 1 0 0 1 0 1 1 0 1 0 0 1 0 1 1 0 0 1 1 0 1 0 0 1

At first glance—just taking in the optical texture of 1s and 0s—it seemed a good match for the fabric in Rashid Auditorium. But a closer look at the pattern on the chairs shows that the bits there cannot possibly be drawn from the Thue-Morse sequence. The key property of the sequence is that it is “cube-free”: No block of bits w is repeated three times in succession, as in www. The block w can be of any length, including a length of just one bit. The fabric pattern includes subsequences of 000 and 111, both of which violate the cube-free constraint. (I should have noticed this before I even got up out of my seat, but I didn’t.)

If not Thue-Morse, then what?

Several commenters note that it cannot simply be bits chosen uniformly at random, and I totally agree. Here are the frequencies of subsequences from one to four bits long:

Rashid 1 bit freqs

Rashid 2 bit freqs

Rashid 3 bit freqs

Rashid 4 bit freqs

For random bits, all of these distributions are flat: Each k-bit subsequence occurs with expected frequency 2k. The pattern at CMU is anything but flat. It has prominent peaks and valleys, with a notable excess of alternating 1s and 0s, and a corresponding deficit of subsequences that repeat the same symbol.

Another way to look at the bit sequences is through the autocorrelation function, which measures how well the sequence matches up with itself when shifted along its length by some fixed distance. An autocorrelation of 1.0 indicates a perfect match, and –1.0 is perfect anticorrelation—all the bits are different. For any series, the autocorrelation has to be 1.0 at a shift of 0. As the graph below shows, the autocorrelation function for the Rashid sequence is strongly negative at a shift of 1, and positive at shift 2. These results are consistent with the observed prevalence of alternating bits. Interestingly, the correlations continue to oscillate at larger shifts, with little sign of damping. What’s even more curious, for shifts greater than 4, the sign of the correlation is the opposite of what you’ve expect for an alternating 01 sequence.

Autocorrelation rashid random thue

For comparison, the graph also shows the autocorrelation function of the Thue-Morse sequence, which has a fascinating structure of its own, and that of a random sequence of bits. Except at shift 0, the random sequence should converge on 0 autocorrelation everywhere.

The results above are based on a somewhat larger sample than the one I gave on Sunday. When I got home from Pittsburgh, I transcribed 35 rows of 79 digits each, for a total of 2,765 bits. I’ve treated the rows as 35 independent samples, since I don’t know how to join them without creating spurious conjunctions.

If you’d like to play with the bits for yourself, they are all here in a PDF that includes both an image of the fabric and my transcription of it. I should mention that typing 2,765 binary digits is not a lot of fun. Before embarking on that project, I struggled to get useful output from OCR services such as Online OCR, but the character-recognition algorithms try desperately to turn numerical data into English, producing stuff like this:

LOLOL010101.11.0110101.1.1010001010111.01.LoLotootoLoc t tot totot IMO /0t 100101 L LO LOLOt L tO1O 101000 OIOtOOlOLOtOlttOttOtOlttOOlOLOttOC

LOL indeed.

At this point, I admit I’m stumped. I just don’t know where all those bits came from.

The CMU-ACM magazine I mention above says the bit stream is not ASCII text. I don’t know how the author determined that, but I tend to agree. ASCII and most other byte-oriented encodings would show an autocorrelation spike at a shift of 8, and there’s no evidence of that.

I’m also sure the sequence is not Morse code. When you smush together letters and words in dot-dash encoding, the result has many more long runs than we see here.

Could it be the binary object code for some program? Rick Rashid was the principal author of the Mach operating system, created at CMU, so a fitting tribute would be to sew his work into the seats of the auditorium he funded. But I’ve never seen a core dump that looked anything like this. Most machine code has blocks of highly repetitive bit patterns.

Maybe the underlying message is text or an image that’s been run through a compression algorithm, such as gzip? That would raise the entropy of the signal and flatten the distribution of subsequences to some extent. But would it yield the specific pattern we see here? In particular, would it favor alternation of 0s and 1s? I don’t see why.

In the comments, Zvika suggests that the sequence looks like the product of a human operator trying to type randomly. I resist this idea. It’s just such an inhuman waste of human resources; I can’t stand to think of some poor schlub pecking away at a keyboard for hours, struggling to be mindless about it. But I suppose it could be true.

I came up with a slightly less barbaric alternative scenario. Suppose a fabric designer says to his programmer friend, “I need some random bits. Can you help me out?” The programmer goes away and fires up her RNG, and comes back with a million bits that pass all the usual tests of randomness. The designer says thanks, but later he complains, “These bits don’t look random. They’ve got all these long runs of 0s and 1s.” The programmer explains about independent choices from a uniform distribution, and the 1/2k spectrum of run lengths, but the designer will have none of it. “I don’t care about the math; give me something that looks right,” he says. So the programmer goes away and creates some new bits, meant to satisfy people who believe that when a coin has come up heads three times in a row, it’s more likely to show tails on the next toss.

I made an attempt to create such a function myself. It’s a simple Markov process, in which each appearance of a 1 in the sequence reduces the probability p that the next symbol will also be a 1. Specifically, I start with p = 1/2. If the next bit is a 1, p is adjusted to p/k, where k is a constant greater than 1. If the new bit is instead a 0, p becomes 1 — p/k. When I set k = 2, this process generated a stream of bits that looked much like the fabric pattern. Here’s a small sample:

1 0 1 1 0 1 0 1 1 0 1 0 0 1 0 1 1 1 0 1 0 1 0 1 0 1 1 1 0 1 1 0 1 1 0 0 1 0 1 1 1 0 1 1 0 0 1 0 1 1 0 1 0 1 0 1 1 1 1 0 1 0 1 0 1 1 0 1 0 1 0 1 1 1 1 0 1 1 0 1

And the subsequence frequencies also seemed broadly similar to those of the Rashid sequence:

Markov 3 bit freqs with lollies

(The orange bars are the results for the Markov process; the green lollipops show the corresponding frequencies for the Rashid bits.) This outcome was at least mildly encouraging, but then I looked at the autocorrelations in the Markov-generated stream of bits:

Autocorrelation markov

At shifts of 1 and 2, the Markov autocorrelations match those of the Rashid sequence almost exactly, but at longer distances the Markov correlations fade away, as they must in any random process. The long-range structure in the Rashid sequence seems all the more peculiar when displayed in this context.

Still another question is whether the fabric pattern can be truly aperiodic at large scales. If it were a printed fabric, I would be highly skeptical of this proposition. The printing process is inherently repetitive, with the same pattern recurring at intervals determined by the size of the printing plate or cylinder. (Think of wallpaper patterns.) ButRashid fabric detail 0792 the pattern on the upholstery cloth is not printed; it is woven into the fibers, and that changes everything. Looms have been programmable since the time of Jacquard, and there is no fundamental reason a modern loom could not be driven by a computer that supplies an arbitrarily long sequence of nonrepetitive instructions. The Thue-Morse sequence would do nicely for this purpose. So I can easily imagine an aperiodic upholstery pattern; I just don’t know if that’s what we’re seeing here.

Later this week I’ll go take a look at the Harvard version. (Thanks to Max Lin for discovering this second instance.)

Update 2011-04-13: I note here for the benefit of those who don’t read the comments: rouli has swept away all this speculation about looms driven by computers generating infinite aperiodic sequences. In the section of the pattern I transcribed, the last nine lines are identical to the first nine. I’ve checked a couple of other photos (of other seat backs) and confirmed that the whole pattern appears to repeat on a cycle of 26 lines.

How appalling that I could type more than 700 repeated characters and never notice the resemblance.

Update 2011-04-13, later:Yikes! And now Barry Cipra reveals that the pattern also repeats horizontally, with period 60.

3.14

14 March 2011

All those books that Google has been scanning for the past ten years are surprisingly rich in numbers as well as words. The Google Books data set released last December by a Harvard-Google team includes (by my count) 9,620,835,344 occurrences of 458,794 distinct numbers. (Plus another 31,293 numeric values that have dollar signs attached.)

In recognition of pi day, I want to zero in on some successive approximations to the world’s favorite irrational:

Pops 3

Pops 3 point 1

Pops 3 point 14

Pops 3 point 141

In tabular form here are the closest approximations found in the files, along with the abundance of each value:

3.141592 704
3.1415923 80
3.1415926 1141
3.14159265 1300
3.141592653 143
3.1415926535 286
3.141592653589 54
3.14159265358979 338
3.141592653589793 453
3.1415926535898 65
3.14159265359 177
3.1415926536 289
3.141592654 512
3.1415927 776
3.1415928 109
3.1415929 133
3.141593 843

The data set includes only items that appear at least 40 times in the collection of scanned volumes. Closer approximations to pi evidently fell below that threshold. In particular there is no sign of William Shanks’s famous 707-digit calculation, which was published in 1873. So, just for the sake of celebrating 3.14, here are 707 digits of pi—but unlike the product of Shanks’s many years of labor, I think these digits may be correct:

3.1415926535897932384626433832795028841971693993751058209
74944592307816406286208998628034825342117067982148086513
282306647093844609550582231725359408128481117450284102701
938521105559644622948954930381964428810975665933446128475
648233786783165271201909145648566923460348610454326648213
393607260249141273724587006606315588174881520920962829254
0917153643678925903600113305305488204665213841469519415116
09433057270365759591953092186117381932611793105118548074462
3799627495673518857527248912279381830119491298336733624406
5664308602139494639522473719070217986094370277053921717629
3176752384674818467669405132000568127145263560827785771342
7577896091736371787214684409012249534301465495853710507922
796892589235420200