Archive for the ‘statistics’ Category

The ormat game

Monday, August 16th, 2010

Here’s the deal. I’m going to give you a square grid, with some of the cells colored and others possibly left blank. We’ll call this a template. Perhaps the grid will be one of these 3×3 templates:

colored 3x3 ormat grids

You have a supply of transparent plastic overlays that match the grid in size and shape and that also bear patterns of black dots:

dot patterns for the six 3x3 permutation matrices

Note that each of these patterns has exactly three dots, with one dot in each row and each column. The six overlays shown are the only 3×3 grids that have this property.

Your task is to assemble a subset of the overlays and lay them on the template in such a way that dots cover all the colored squares but none of the blank squares. You are welcome to superimpose multiple dots on any colored square, but overall you want to use as few overlays as possible. To make things interesting, I’ll suggest a wager. I’ll pay you $3 for a correct covering of a 3×3 template, but you have to pay me $1 for each overlay you use. Is this a good bet?

Before going further, I should mention that not every conceivable template can be covered under these rules. To take an obvious example, no 3×3 template with fewer than three colored squares can possibly be covered by any combination of the six overlays. But I promise to submit only templates that can be covered by some combination of the given dot patterns; if I err about this, I forfeit the bet.

How does the game play out? If I give you the template marked “1″ above, you can easily win; just choose permutations a and b, which together cover all the colored squares and no others. You pay $2 and get $3. Template 2, with all nine squares colored, looks like it might be the toughest challenge. Clearly, it cannot be covered with fewer than three overlays, since we need a total of nine dots; and it turns out that exactly three overlays are required. Indeed, there are two ways of covering the template with three overlays: a + d + e and b + c + f. Thus this template is a breakeven proposition: You earn $3 and pay $3.

Now we come to template 3, which has eight colored squares and one blank. Surely if you can cover the full nine squares with just three overlays, then you should also be able to cover eight squares—no? I invite you to try it. In fact the only covering that works requires four overlays: b + d + e + f. Thus you shouldn’t take my bet, since I can always give you a template with just one blank, and you’ll have a net loss of $1.

Some background. I’ll return to the gaming table momentarily, but first let me explain what this is all about and where it came from. A few weeks ago, I was writing about “ranges of rankings,” which led me into the topic of permutation matrices. To recapitulate:

  • A permutation matrix is a square matrix with a single 1 in each column and each row, and all the rest of the elements 0.
  • An ormat is a superposition of permutation matrices, formed by applying the Boolean OR function to corresponding elements of the permutation matrices. For example: matrix-or-sum.png
  • Not all square matrices with (0,1) entries can be formed by OR-ing permutation matrices, but there’s an efficient algorithm for deciding whether or not a given matrix is an ormat. (I thank some helpful commenters for enlightening me on this point.)
  • Given an ormat, the total number of distinct permutation paths that can be threaded through the 1 entries of the matrix is equal to the permanent of the matrix. Calculating the permanent is known to be a hard computational problem.

In a comment, Barry Cipra posed the following query:

The permanent tells us the maximum number of different permutations that can be OR-summed to produce a given ormat, but what is the corresponding minimum number? Also, in how many different ways can the minimum be achieved?

The connection between ormats and my little game is probably apparent by now. The template of colored and blank squares is an ormat; the dotted overlays represent permutation matrices; to maximize your payoff in the game (or to minimize your loss), you need to answer Barry’s first question, finding the minimum number of permutations that can be combined to yield the given ormat.

For 3×3 matrices, we can solve this problem by exhaustive search, calculating the OR-sums of all possible combinations of the six 3×3 permutation matrices taken 1, 2, 3, …, 6 at a time. I did this with pencil and paper on a recent airplane trip. Here is a summary of the results:

number of ormats generated by various combinations of permutation matrices

Some of the numbers on this card are easy to explain. The six ormats with just three 1 entries are the permutation matrices themselves. There are six of them because there are 3! = 6 permutations of three things. There are no ormats with four 1 entries for a reason that bears thinking about: There can be no permutations that differ from one another in just one element. When you superimpose any of the six overlays shown above, you can wind up with three, five or six dots, but never four.

At the other end of the scale, it’s no surprise that there’s exactly one ormat with nine 1 entries, and that it takes three permutations to produce it. And then there are the nine ormats with eight 1 entries, which each require four permutations to be OR-ed. These are the single-blank patterns like template 3 above.

Based on these results, I began speculating about what I would see in a tabulation of all 4×4 ormats.

guesses about stats for 4x4 ormats

There would have to be 4! = 24 patterns with four 1 entries, and just one pattern with all 1s, generated by OR-ing four permutations. And there should be 16 ormats that require five permutations, namely the 16 matrices with a single 0 element. This last prediction seemed a little less self-evident than the others.

Pocket change and Cheerios. My thoughts about the single-zero (or single-blank) case went something like this. To cover 15 squares with sets of four dots each, we need at least four sets, or else we simply won’t have enough dots. So a useful starting point is one of the optimal arrangements that cover all 16 squares without gaps or overlaps. By this time I had grown tired of drawing zillions of dots, and so I started working with sets of coins.

initial configuration of four permutations of coins

In this arrangement each coin denomination forms a permutation, with no two pennies, nickels, dimes or quarters in the same row or the same column. We have successfully covered all the colored squares, but unfortunately we’ve also covered the blank at the lower right. Thus this pattern of coins is not an acceptable solution, but maybe we can fix it up somehow?

adjusted configuration after one coin is moved

Moving the penny from the blank square to another square in the same column solves one problem but creates another: Now the arrangement of pennies is no longer a permutation. There are two pennies in the third row.

coins after second adjustment to restore permutation

So now we have to shift another penny to restore the one-per-row-and-column property. Inevitably, this leaves a colored square uncovered. The only way we can cover that exposed square is to introduce a fifth permutation. Since I had run out of coin denominations, I chose a popular brand of breakfast toroids. Voila:

coins and cheerios -- the five-permutation solution

There’s nothing special about the particular moves I chose in this sequence. If you try some alternatives, you should be able to persuade yourself that moving the penny that covers the blank to any other square in the fourth column (or in the fourth row) would lead to essentially the same situation. Likewise the game would come out the same if the single blank square were placed anywhere else in the grid. And you could also start with a different set of initial permutations (provided they cover all the squares).

This coin-shuffling exercise demonstrates that we can cover any 4×4 template that has a single blank by combining no more than five permutations, but how do we know that five are actually needed? Maybe there’s some totally different arrangement that would do the job with just four permutations? Well, think about what such an arrangement would look like. It would have to differ at exactly one position from some other layout of permutations that covers the full 16-square grid. But no two permutations can differ at one and only one place. Thus the reason there can be no four-permutation cover of 15 squares is essentially the same as the reason no 4×4 ormat pattern can cover just five squares.

This argument generalizes to k×k matrices: For any integer k, there must be at least k ormat patterns that cannot be covered with fewer than k+1 permutations. But then comes the bigger speculative leap: Perhaps k+1 is an upper bound. Perhaps part of the answer to Barry’s question is that no k×k ormat pattern requires more than k+1 permutations. At one point I even had a “proof” of this conjecture. Then I wrote a program to check it, doing much the same thing I did with the dots on the airplane.

Out of bounds. My program found the expected 16 ormat patterns that require five permutations—and it found many more as well. In all it identified 2,032 4×4 ormats that can’t be composed from fewer than five permutations. And then came a bigger surprise: The program also found 480 patterns that require six permutations. So much for my proposed upper bound.

One of those problematic 480 ormats takes this form:

((1 1 1 1) (1 1 1 1) (1 1 1 1) (1 1 0 0))

Looking over this pattern, I thought I understood where my earlier reasoning had gone awry. This matrix is just like the single-zero pattern, but with two zeros! (I do mean for that statement to make sense. Bear with me.) Suppose we start again with a set of four permutations that completely cover the grid, including the two blanks.

starting configuration of 16 coins on 4x4 template with two blanks

Then we can uncover each blank just as we did in the coin-shuffling procedure above, although we have to be careful the two sets of movements don’t interfere with each other. (Not much point in removing the penny from a blank square, then putting the nickel there.) Here is a strategy for clearing both blank squares while maintaining the one-per-column-and-row permutation property:

four coins and two blanks: first solution

Inevitably, when we uncover the two blank squares, we also remove coins from two colored squares, which now have to be filled in again. The key point is that no single permutation can repair that damage, because the two open colored squares are in the same row. To cover both of those squares we need two additional permutations.

Other ways of reshuffling the coins avoid putting the two open squares in the same row or column, but they still foil all attempts to complete the covering with just five permutations. Try adding four Cheerios to the diagram below. If you cover both of the open blue squares, then either you also cover one of the blank squares or you wind up with two Cheerios in the same row.

four coins and two blanks: second solution

So now it’s clear we need as many as six permutations to cover a 4×4 ormat. Does that suggest that the general upper bound might be k+2 rather than k+1? Or perhaps the appropriate formula is 2k–2? In support of this latter possibility I offer these two ormats, which require 8 and 10 permutations respectively:

2k-2ormats.png

Another wager. Having fooled myself several times about the upper bound on minimal ormat coverings, I feel I should build in a little margin for error before I invite you to make a further wager. We already have direct evidence that covering a k×k ormat can take as many as 2k–2 permutations. So I’ll be generous and offer a full $2k for a proper covering, while charging $1 per permutation. If k=3 or k=4, you can definitely make money on this deal. But is it a good bet for larger k? (Hint: I’d be willing to play the game on these terms for real money.)

•     •     •

Update 2010-08-19: No takers for my bet, eh? Too bad; I had already spent my winnings.

Barry Cipra, who raised the question about minimal ormat covers in the first place, sends this illuminating letter:

I’m going to tiptoe a short ways out on a long long limb and conjecture (really just guess) that the “worst case” behavior, in terms of the minimum number of permutations it takes to produce a given ormat, occurs for ormats of the following form, shown here for k=7:

((1 1 1 1 1 1 1) (1 1 1 1 1 1 1) (0 1 1 1 1 1 1) (0 0 1 1 1 1 1) (0 0 0 1 1 1 1) (0 0 0 0 1 1 1) (0 0 0 0 0 1 1))

So as not to abuse existing matrix terminology, I’ll call any (square) matrix of this type—i.e., whose entries below the main subdiagonal are all 0—”uppity triangular.” I can (and will!) show that this uppity triangular ormat for k=7 requires (at least) 16 permutations—and the number appears to grow concavely upwards from that, so I, for one, will definitely not take you up on your $2k wager.

The trick, I realized, is to view each ormat as the “shadow” of what I’ll call an “addmat.” If you let P1, P2, …, Pr be k×k permutation matrices, their addmat is simply the ordinary result of addition: S = P1 + P2 + … + Pr, whose entries are positive integers wherever one or more of the constituent permutations has a 1 and otherwise 0. The associated ormat is obtained by changing each of these entries to a 1, while leaving the 0’s alone. In this sense, the ormat’s 1’s are the “shadows” of the addmat’s positive entries.

What’s crucial is that addmats have a lovely little property not shared with their shadows: the row and columns sums of the entries of an addmat all equal the number of permutations that produce them, r.

Come now, let us reason together…. The uppity triangular ormat example above (for k=7) must come from an addmat of the form

{{

where a, b, and all the *’s are positive integers. In particular, each * is at least 1. Since all row and column sums must be equal, the sum a+b must equal the sum of b and all 6 *’s above it. Hence a is at least 6. Likewise a+b must equal the sum of a and all 6 *’s above it, so b is also at least 6. Hence a+b is at least 12, which means the OR-sum that produced the given ormat involves at least 12 permutations.

This clearly generalizes to arbitrary k, which is more than “direct evidence” that covering a k×k ormat can take as many as 2k–2 permutations, it’s rigorous proof! But we can immediately do better, at least on a case-by-case basis. If we try to get by with just 12 permutations for this uppity triangular ormat, we quickly run into trouble. We obviously must have a = b = 6, and it follows that all the *’s above them are 1’s (to make those column sums 12). That is, we have the addmat

{{

where I now wish to call your attention to the entry labeled “@”. To make its row-sum equal 12, we need @ = 10. But that means its column sum (with the 5 *’s above it) is at least 15, which cannot be! So we are forced to try larger values of a and/or b—which is to say, we need more permutation matrices to produce this addmat.

It turns out you can’t satisfy the row and column sum condition until you get to a = b = 8. I won’t take you through all the steps, but just give you a taste with the penultimate possibility, a = 7, b = 8. The best you can hope for in this case is

{{

Note that I put as much of the “weight” in the last two columns as close to the 7 and 8 as possible, so that I could use the smallest possible value (10) as the entry with 5 positive entries above it. This makes the last three rows, and the right three columns all have the same sum, 15, but now we see a problem in the 12’s column: Its sum is at least 16. So once again, we’re screwed. It’s only with the next attempt that we avoid contradiction:

{{8, 3, 1, 1, 1, 1, 1}, {8, 3, 1, 1, 1, 1, 1}, {0, 10, 2, 1, 1, 1,<br />
  1}, {0, 0, 12, 1, 1, 1, 1}, {0, 0, 0, 12, 2, 1, 1}, {0, 0, 0, 0, 10, 3, 3}, {0, 0, 0, 0, 0, 8, 8}}

This matrix finally has all its row and column sums equal. Please note, this may or may not be an actual addmat of a set of 16 permutation matrices—I suspect it probably is, but I haven’t bothered to check. All we know is that it satisfies a necessary condition of being an addmat, namely that its row and column sums are all equal. (It’d be nice if that were also a sufficient condition, but something tells me it isn’t.)

This example, which can clearly be played out for larger values of k, suggests that not only are you safe with a $2k wager, but with a $(2k+2) wager and higher I’ve played around with this a bit, and persuaded myself that the number of permutation matrices will go to 2k + a lot—for k=10, if I did things correctly, you need 24 permutations (or possibly more, if the uppity triangular matrix the analysis leads to is not an actual addmat). I am entirely convinced that some additional careful thought can streamline the analysis into a nice, slick proof. I’m just not sure I haven’t already made a mistake, and built an elaborate house of cards….

Does any of this jibe with what you’ve already found to be the case?

It does indeed jibe.

First of all, to answer a small question Barry left open, here is a set of 16 permutations that will successfully cover his 7×7 “uppity triangular” matrix:

{1,2,3,4,5,6,7} {2,1,4,3,6,7,5} {1,3,2,5,4,7,6} {1,3,4,2,6,5,7}
{2,3,1,5,6,4,7} {1,2,3,5,6,7,4} {1,2,4,5,3,6,7} {1,2,4,5,6,3,7}
{1,2,4,5,6,7,3} {1,3,4,5,2,6,7} {1,3,4,5,6,2,7} {1,3,4,5,6,7,2}
{2,3,4,1,5,6,7} {2,3,4,5,1,6,7} {2,3,4,5,6,1,7} {2,3,4,5,6,7,1}

This was found with a simple greedy search.

My own attempts to find an upper bound have focused not on uppity triangular matrices but on matrices I’ve been calling “flags,” like this 7×7 case:

{{0, 0, 0, 1, 1, 1, 1}, {0, 0, 0, 1, 1, 1, 1}, {0, 0, 0, 1, 1, 1, 1}, {1, 1, 1, 1, 1, 1, 1}, {1, 1, 1, 1, 1, 1, 1}, {1, 1, 1, 1, 1, 1, 1}, {1, 1, 1, 1, 1, 1, 1}}

This matrix also requires 16 permutations for a proper covering. To see why, try threading permutations through the columns of the matrix, starting at the left edge and in each column choosing a 1 element (never a 0) from a different row. Because of the block of zeros at the upper left, the first three elements of every permutation must lie in rows 4 through 7. Thus each permutation “uses up” three of the last four rows in the first three columns, and the rest of the permutation can revisit this range of rows only once. It follows that each permutation can touch only one element in the 4×4 block of 1s in the lower right corner of the matrix, and at least 16 permutations are needed to cover all the 1s in the matrix. Showing that 16 are sufficient is not hard.

This kind of analysis works for any odd k, and thus we know that such matrices can require as many as

{\biggl\lceil\frac{k}{2}\biggr\rceil}^2

permutations. (For even k the situation is a little less symmetrical, and I haven’t worked out the exact details.)

These results give us a lower bound on the upper bound on the number of permutations that may be needed to cover a k×k ormat. But we haven’t proved it’s the true upper bound. Are there other ormats that require even more permutations? My guess is no, but keep in mind that almost all my conjectures along these lines have turned out to be wrong.

Four questions about fuzzy rankings

Saturday, July 24th, 2010

The National Research Council is getting ready to release a new assessment of graduate-education programs in the U.S. The previous study, published in 1995, gave each Ph.D.-granting department a numerical score between 0 and 5, then listed all the programs in each discipline in rank order. For example, here’s the top-10 list for doctoral programs in mathematics (as presented by H. J. Newton of Texas A&M University):

 rank    school                 score   
    1    Princeton               4.94
    2    Cal Berkeley            4.94
    3    MIT                     4.92
    4    Harvard                 4.90
    5    Chicago                 4.69
    6    Stanford                4.68
    7    Yale                    4.55
    8    NYU                     4.49
    9    Michigan                4.23
   10    Columbia                4.23

Note that the scores of the first two schools are identical (to two decimal places), and the first four scores differ by less than 1 percent. Given the uncertainties in the data, it seems reasonable to suppose that the ranking could have turned out differently. If the whole survey had been repeated, the first few schools might have appeared in a different order. Doctoral candidates in mathematics are presumably sophisticated enough to understand this point. Nevertheless, the spot at top of the list still carries undeniable prestige, even when you know that the distinction could be merely an artifact of statistical noise.

The committee appointed by the NRC to conduct the new graduate-school study wants to avoid this “spurious precision problem.” They’ve adopted some jazzy statistical methods—mainly a technique called resampling—to model the uncertainty in the data, and they’ve also decreed that the results will be presented differently. There will be no sorted master list showing overall ranks in descending order. Instead the programs in each discipline will be listed alphabetically, and each program will be given a range of possible ranks. For example, a program might be estimated to rank between fifth place and ninth place. Let’s call such a range of ranks a rank-interval, and denote it {5, 6, 7, 8, 9} or {5–9}.

For a hypothetical set of 10 institutions, A through J, here’s what a set of rank-intervals might look like.

bar graph showing ranges of rankings for schools A through J.png

Acknowledging the uncertainty in your findings is commendable. But let’s be realistic. If you actually want to make use of these results—for example, if you’re a student choosing a grad-school program—the first thing you’re going to do is sort those bars into some sort of rank order, trying to figure out which school is best and how they all stack up against one another. In other words, you’re going to undo all the elaborate efforts the NRC committee has put into obscuring that information.

Below is one possible ordering of the bars. I have sorted first on the top of the rank-intervals, then, if two columns have the same top rank, I’ve sorted on the bottom rank. Other sorting rules give similar but not identical results. For example, sorting on the midpoints of the intervals would interchange columns B and F.

bar graphs showing rank-ranges sorted into one canonical order.png

Question 1. Does sorting a set of rank-intervals by one of these simple rules yield a consistent and meaningful total ordering of the data? To put it another way, can you trust this attempt to reconstruct a ranking?

I hasten to add that this is not really a practical question about finding the best grad school. If you’re facing such a choice in real life, the NRC rank-intervals are not the only available source of information. But, for the sake of the mathematical puzzle, let’s pretend that all we know about schools A through J is embodied in those ranges of rankings.

It turns out that rank-intervals have some fairly peculiar behavior. Ranges of ratings are not a problem. If the NRC merely gave each school a fuzzy rating on the 0-to-5 scale, no one would have much trouble interpreting the results. But when you turn fuzzy ratings into fuzzy rankings, there are hidden constraints. For example, not all sets of rank-intervals are well-formed.

two impossible sets of rank-ranges

The set at left is impossible because there’s no one in last place. (We can’t all be above average.) The example at right is also nonsensical because D has no ranking at all. For a set of rank-intervals to be valid, there has to be at least one entry in each row and each column.

That’s a necessary condition, but not a sufficient one, as the two graphs below illustrate.

two more impossible rank-intervals

Do you see the problem with the example at left? Column B has a rank-interval of {1–2}, but in fact B can never rank first because A has no alternative to being first. The case at right is conceptually similar but a little subtler: If B is ranked third, then either first place or second place will have to remain vacant.

The underlying issue here is the presence of constraints or linkages within a set of rankings. Suppose you have calculated ratings and rankings of several schools, and then some new information turns up about one school. You can change the rating of that school without any need to adjust other ratings, but not so the ranking. If a school goes from third place to fourth place, the old fourth-place school has to move to some other rung of the ladder, and somebody has to fill the vacancy in third place. These interdependencies are obvious in a non-fuzzy ranking, but they also exist in the fuzzy case. You can’t just assign arbitrary rank-intervals to the items in a set and assume they’ll all fit together. This observation leads to a second question:

Question 2. What are the admissible sets of rank-intervals? How do we characterize them?

I have a partial answer to this question. It goes like this. Any ranking of k things must be a permutation of the integers from 1 through k. A permutation can be embodied in a permutation matrix—a square k × k matrix in which every row has a single 1, every column has a single 1, and all the other entries are 0. For example, here are the six possible 3 × 3 permutation matrices:

3x3-permutation-matrices.png

They correspond to the rankings (1, 2, 3), (1, 3, 2), (2, 1, 3), (3, 1, 2), (2, 3, 1) and (3, 2, 1).

Since a permutation matrix represents a specific (non-fuzzy) ranking, we can build up a set of rank-intervals by taking the OR-sum of two or more permutation matrices. What do I mean by an OR-sum? It’s just the element-by-element sum of the matrices using the boolean OR operator, ∨, instead of ordinary addition. OR has the following addition table:

                      0 ∨ 0 = 0
                      0 ∨ 1 = 1
                      1 ∨ 0 = 1
                      1 ∨ 1 = 1

For the first two 3 × 3 matrices shown above the arithmetic sum is:

matrix-addition.png

whereas the OR-sum looks like this:

matrix-or-sum.png

Every valid set of rank-intervals must correspond to an OR-sum of permutation matrices, simply because a set of rank-intervals is in fact a collection of permutations. The converse also holds: Any OR-sum of permutation matrices yields an admissible set of rank-intervals. Thus the OR-sums of permutation matrices—let’s call them ormats for brevity—are in one-to-one correspondence with the admissible sets of rank-intervals. (There’s just one catch when applying this idea to the NRC study. The columns of an ormat may well have “gaps,” as in the column pattern (0 1 1 0 0 1 1), which corresponds to the rank-interval {2–3, 6–7}. Will the NRC allow such discontinuous ranges in their grad-school assessments? Perhaps the issue will never come up in practice. In any case, I’m ignoring it here.)

Arithmetic sums of permutation matrices form an open-ended, infinite series; in contrast, there are only finitely many distinguishable OR-sums. The reason is easy to see: Ormats have k2 entries, each of which can take on only two possible values, and so there can’t be more than \(2^{k^{2}}\) distinct matrices. Because of the various constraints on the arrangement of the entries, the actual number of ormats is smaller. For example, at k = 3 the \(2^{k^{2}}\) upper bound allows for 512 ormats, but there are only 49:

the-49-3-by-3-or-sums.png

Thus we come to the next question.

Question 3. For each k ≥ 1, how many distinct ormats can we build by OR-ing subsets of k × k permutation matrices? Is there a closed-form expression for this number?

I have answers only for puny values of k.

   k       upper bound        # of ormats  
   1                 1                  1
   2                16                  3
   3               512                 49
   4            65,536              7,443
   5        33,554,432          6,092,721
   6    68,719,476,736                  ?

The tallies of ormats were calculated by direct enumeration, which is not a promising approach for larger k. (I note—to spare folks the bother of looking—that the sequence 1, 3, 49, 7443, 6092721 does not yet appear in the OEIS.)

To extend this series, we might try to exploit the internal structure and symmetries of the ormats. By sorting the columns and rows of the matrices, we can reduce the 49 3×3 ormats to just six equivalence classes, with the following exemplars:

exemplars of six ormat equivalence classes

Enumerating just these reduced sets of matrices should make it possible to reach larger values of k, but I have not pursued this idea. (Furthermore, the two-dimensional sorting of matrices looks to be a curiously challenging task in itself.)

By the way, I think the number of ormats will approach the \(2^{k^{2}}\) upper bound asymptotically as k increases. Many of the features that disqualify a matrix from ormathood—such as all-zero rows or columns—become rarer when k is large. I have tested this conjecture by generating random (0,1) matrices and then counting how many of them turn out to be ormats.

fraction-of-ormats.png

For k = 1 through 5 the results are in close agreement with the actual counts of ormats, and up to k = 10 the trend is clearly upward. But continuing this inquiry to larger values of k will depend on a positive answer to the next question.

Question 4. Given a square matrix with (0,1) entries, is there an efficient algorithm for deciding whether or not it is an OR-sum of permutation matrices, and thus an admissible set of rank-intervals?

The question asks for a recognition predicate—a procedure that will return true if a matrix is an ormat and otherwise false. If efficiency doesn’t matter, there’s no question such an algorithm exists. At worst, we can generate all the k × k ormats and see if a given matrix is among them. But that’s like saying we can factor integers by producing a complete multiplication table. It just won’t do in practice. Isn’t there a quick and easy shortcut, some distinctive property of ormats that will let us recognize them at a glance?

If we could replace the OR-sum with the ordinary arithmetic sum, the answer would be yes. Permutation matrices have the handy property that all rows and columns sum to 1. An arithmetic sum of r permutation matrices has rows and columns that all sum to r. (It is a semi-magic square.) The converse is also true (though harder to prove): If a matrix of nonnegative integers has rows and columns that all sum to r, it is a sum of r permutation matrices. This fact yields a simple test: Sum the rows and the columns and check for equality.

Unfortunately, the trick won’t work for ormats, because the boolean OR operation throws away even more information than summing does. Because 0 ∨ 1 = 1 ∨ 0 = 1 ∨ 1, infinitely many sets of operands map into the same result, and there’s no obvious way to recover the operands or even to determine how many permutation matrices entered into the OR-sum.

Maybe there’s some other clever trick for recognizing ormats, but I haven’t found it. Let me make the question more concrete. Below are three (0,1) square matrices. Two of them are ormats but the third is not. Can you tell the difference?

three-puzzle-matrices.png

If it’s so hard to recognize an ormat, how did I count the ormats among a bunch of randomly generated (o,1) matrices? By hard work: I reconstructed the set of permutations allowed by each matrix. Visualize a permutation as a path threading its way through the matrix from left to right, connecting only non-zero elements and touching each column and each row just once. When you have drawn all possible permutation paths, check to see if every non-zero element is included in at least one path; if so, then the matrix is an ormat. Note that this is not an efficient recognition procedure. In the worst case (namely, an all-ones matrix), there are k! permutations, so this method has exponential running time. But k! is better than \(2^{k^2}\); and, besides, for sparse matrices the number of permutations is much smaller than k!. The 10 × 10 matrix presented as an example at the start of this post gives rise to 580 permutations, a manageable number. Here’s what they look like, plotted as a spider web of red paths across the bar chart.

ranges-with-paths

Every nonzero site is visited by at least one permutation path, so this set of rank-intervals is indeed valid.

This process of lacing permutations through a matrix finally brings me back to Question 1, about how to make sense of the NRC’s fuzzy ranking scheme. Let’s take a small example:

probability-example-1.png

Examining the graph above shows that A must rank either first or second—but which is more likely? In the absence of more-detailed information, it seems reasonable to assume the two cases are equally likely; we assign them each a probability of 1/2. Similarly, B has the rank-interval {1–3}, and so we might suppose that each of these three cases has probability 1/3. Continuing in the same way, we assign probabilities to every element of the matrix.

probability-example-2.png

But wait! This can’t be right; our probabilities have sprung a leak. Any proper set of probabilities has to sum to 1. Our procedure assures that each column obeys this rule, but there is no such guarantee for the rows. In row 1, we’re missing one-sixth of our probability, and in row 2 we have an excess of 1/2; row 4 comes up short by 1/3.

Is there any self-consistent assignment of probabilities for the elements of this matrix? Sure. As a matter of fact, there are infinitely many such assignments, including this one:

probability-example-3.png

I’ll return in a moment to the question of how I plucked those particular numbers out of the air, but note first what they imply about the ranking of items A through D. For item A, with the rank-interval {1–2}, the odds are two-to-one that it ranks first rather than second. B has the behavior we expected from the outset, with probability uniformly distributed over the three cases. But if you pick either C or D, each with the rank-interval {2–4}, your chance of getting second place is only 1/6, and half the time you’ll be in last place.

Where do these numbers come from? Instead of starting with the assumption that probability is uniformly distributed over each rank-interval, assume that each possible permutation of the ranks is equiprobable. For this matrix there are six allowed permutations: (1, 2, 3, 4), (1, 2, 4, 3), (1, 3, 2, 4), (1, 3, 4, 2), (2, 1, 3, 4) and (2, 1, 4, 3). Observe that four of the six ordering put A first, and only two permutations place A second. We can also tally up such “occupation numbers” for all the other matrix elements:

probability-example-4.png

Dividing these numbers by the total number of permutations, 6, yields the probabilities given above.

We can do the same computation for the 10 × 10 example matrix, which turns out to allow 580 permutations:

ranges-with-path-weights.png

If you care to check, you’ll find that each column and each row sums to 580; dividing all the entries by this number yields a probability matrix with columns and rows that sum to 1 (also known as a doubly stochastic matrix).

This process of tabulating permutation paths recovers some of the information we would have gotten from the arithmetic sum of the permutation matrices—information that was lost in the OR-ing operation. But we get back only some of the information because we have to assume that each permutation included in the OR-sum appears only once. (This is just another way of saying that the allowed permutations are equiprobable.) There’s no particularly good reason to make this assumption, but at least it leads to a feasible probability matrix.

Is there any way of calculating the entries in the doubly stochastic matrix without explicitly tracing out all the permutation paths? I’m sure there is. I think the construction of the matrix can be approached as an integer-programming problem, and perhaps through other kinds of optimization technology. What seems less likely is that there’s some simple and efficient shortcut algorithm. But I could be wrong about that; there’s a lot of mathematics connected with this subject that I don’t understand well enough to write about (e.g., the Birkhoff polytope). I hope others will fill in the gaps.

Getting back to the assessment of grad schools—have we finally found the right way to understand those rank-intervals that the NRC promises to publish any day now? My sense is that a semi-magic square (or, equivalently, a doubly stochastic matrix) will give a less-misleading impression than a simple eyeball sorting on the spans or midpoints of the rank-intervals. But what a lot of bother to get to that point! How many prospective grad students are going to repeat this analysis?

Acknowledgment: Thanks to Geoff Davis of PhDs.org for introducing me to this story. PhDs.org will have the new ratings as soon as the NRC releases them, and may even find a way to make them intelligible! Disclaimer: I’ve done paid work for the PhDs.org web site (but this is not a paid endorsement).

Update 2010-07-27: If you’ve gotten this far, please read the comments as well. A number of commenters have provided important insights and context, which have helped me understand what’s going on in the matrices I’ve been calling ormats. But I’m still a bit murky about the best way to recognize and count them. I’m not sure that publishing my still-murky thoughts is terribly helpful, but maybe someone else will read what follows and give us a dazzling, gemlike synthesis.

For the ormat-recognition problem (Question 4 above), three basic approaches have been mentioned: enumerating the permutation paths through the matrix, examining matrix minors, and looking for perfect matchings in a bipartite graph defined by the matrix. It seems to me that all of these methods are doing the same thing.

Start with Barry Cipra’s method of minors. The basic operation is to choose a nonzero matrix element, then delete the row and the column in which that element occurs. You then apply the same operation to the remaining, smaller matrix.

In tracing permutation paths, we’re looking for sequences of nonzero elements, drawing one element from each column and each row. A way of organizing this search is to choose a nonzero element and then, after recording its location, delete the corresponding column and row, so that no other elements can be chosen from that column or row.

In the method based on Hall’s theorem, as explained by John R., we view the ormat as the adjacency matrix of a bipartite graph, where every nonzero element designates an edge connecting a row vertex to a column vertex. To find a matching, we delete an edge, along with the two vertices it connects (and also all the other edges incident on those vertices). Then we recurse on the smaller remaining graph. (See further update below.) If you translate this operation on the graph back into the language of matrices, deleting an edge and its endpoints amounts to deleting a row and a column of the adjacency matrix.

I am not asserting that these three algorithms are all identical, but they all rely on the same underlying operation. To say more, we would need to consider the control structure of the algorithms—how the basic operations are organized, how the recursion works, all the details of the bookkeeping. I don’t trust myself to make those comparisons without trying to implement the three methods, which I have not yet done. However, at this point I just don’t see how any method can guarantee correct results without something resembling backtracking (or else exhaustive search through an exponential space). After all, we’re not looking for just one matching in the graph, or one decomposition into matrix minors, or one permutation path; we have to examine them all.

Here’s a further hand-wavy argument for the essential difficulty of the task. For a (0,1) matrix, the number of permutation paths that avoid all zero entries is equal to the permanent of the matrix. Computing the permanent of such a matrix is known to be #P-complete.

Update 2010-07-31: With lots of help from my friends, I think I finally get it. Although there could be as many as k! permutation paths in a k × k matrix, you don’t need to examine all of the paths to decide whether or not the matrix is an ormat. It’s enough to establish that one such path passes through each nonzero element. This is what the algorithm based on Hall’s theorem does. As Frans points out in a comment below, I misunderstood the essential nature of that algorithm (in spite of having it explained to me several times). There is no recursive deconstruction into progressively smaller matrix minors; instead, we just loop over all the nonzero elements of the matrix, find the minor associated with each such element, then check for a perfect matching in the minor. (Still more refinements are possible—but already we have a polynomial algorithm.)

With this efficient recognizer predicate, it’s easy to measure the proportion of ormats in random matrices at larger values of k:

fraction-of-ormats-k25.png

As expected, the fraction of ormats approaches 1 beyond about k = 20.

So much for identifying ormats. I am still unable to extend the series of exact counts beyond k = 5. The tabulations for random (0,1) matrices suggest that for k = 6 there should be about 20 billion ormats, and counting that high is just too painful. I need to work out the symmetries of the problem.

As far as I can tell, assigning exact probabilities to the nonzero matrix elements requires a full enumeration of all the permutation paths, and thus a calculation equivalent to the permanent. There may be a useful approximation.

Barry Cipra asks a really good question: The permanent tells us the maximum number of permutations that could possibly be included in a given ormat, but what is the minimum number? A naive upper bound is the number of 1s in the matrix, but I don’t see an easy path to an exact count. But enough for now.

The thrill of the chase

Sunday, July 11th, 2010

How I love to go out hunting on a bright Sunday morning—though it’s not my style to shoot furry/feathery/finny animals. My game is to get up early and stalk a wily factoid.

A posting from Mat Roberts, whose blog I’ve recently discovered, sent me out this morning to chase down a passage in How Long Is a Piece of String, a book by Rob Eastaway and Jeremy Wyndham:

passage from Eastaway-Wyndham, page 160

The concept here seemed familiar, but the term “Lincoln Index” was new to me. Lincoln who? What index?

Google offered some useful clues. (Also a generous helping of false scents—books about Honest Abe that happen to have an index.) Without even clicking on a link I had the general context:

The Lincoln Index provides a way to measure population sizes of individual animal species. It is based on a capture/mark/ recapture method…

So we’re talking ecology and population biology. The original idea was not to catch the same typo twice but to catch the same furry/feathery/finny creature twice. Interesting. However, the first couple of web pages that Google sent me to (here and here) told me nothing about Lincoln. And, oddly, I found no Wikipedia entry for “Lincoln Index.” If it’s not in Wikipedia, does it exist?

With a little more poking around, I stumbled upon another clue that seemed promising: a mention of “the Lincoln-Pearson equation for estimating population size.” I was still in the dark about Lincoln, but Pearson is quite a familiar figure. Surely that’s Karl Pearson, the pioneering statistician, who did much of his work in the biological sciences and might very well have come up with a scheme for estimating population sizes.

Back at Google, though, searching for “Lincoln-Pearson” turned up nothing pertinent other than the page I’d come from (though I did learn that Karl Pearson “read in chambers in Lincoln’s Inn” during his early years studying law).

More beating the bushes. Eventually I realized I had wandered into a blind alley. Somebody needs to hire a pair of proofreaders: The formula is not “Lincoln-Pearson” but “Lincoln-Petersen.” Try those names at Google and you’ll get an abundance of useful pointers. (You’ll also learn that Abraham Lincoln died in Petersen’s Boarding House, across the street from Ford’s Theater. Google is not just a search engine but also a coincidence engine.)

The particular web page where I finally got the correct names (notes for a course at North Carolina State University) explains that capture-mark-recapture methods

are used extensively to estimate populations of fish, game animals, and many non-game animals. The approach was first used by Petersen (1896) to study European plaice in the Baltic Sea and later proposed by Lincoln (1930) to estimate numbers of ducks. Petersen’s and Lincoln’s method is often referred to as the Lincoln-Petersen Index, even though it is not an index but a method to estimate actual population sizes. (Should it not be the Petersen-Lincoln Estimate?)

I decided to pursue Petersen first—and immediately ran into a few further bibliographic brambles. Some citations spell the name “Petersen” and others “Peterson.” Some give the initials “C. G. T.” and others “C. G. J.” or “C. J. G.” The date might be 1895 or 1896 or 1897. Here’s what I believe to be a correct citation:

Petersen, C. G. J. 1896. The yearly immigration of young plaice into the Limfjord from the German Sea. Report of the Danish Biological Station to the Home Department 6:1–48.

Wikipedia identifies our elusive author as Carl Georg Johannes Petersen (1860-1928). He was a founder of the Danish Biological Station, which was not in fact a station but a mobile laboratory—a decommissioned naval vessel that was moved around from year to year. In 1895, Petersen took the station to the Limfjord, a chain of bays, lakes and channels cutting across the Jutland peninsula in northern Denmark. There he studied the plaice fishery. (Back to Wikipedia: “The European plaice is a right-eyed flounder belonging to the Pleuronectidae family.” But let’s not get started on right-eyed and left-eyed flatfish, or we’ll never get to the end of this.)

Petersen’s report is available online, scanned from a copy belonging to the library of the Marine Biological Laboratory and Woods Hole Oceanographic Institution, and hosted by the Biodiversity Heritage Library of the Internet Archive. A second surprise: The report is written in English. But on reading through it I find only vague and murky connections between the work Petersen reports and the mark-recapture method of estimating populations. There’s nothing resembling the E1E2/S formula.

Petersen does describe a series of capture/mark/recapture experiments. A few hundred plaice were caught and marked by attaching numbered buttons, then put back in the water. Fishermen who recaught the labeled fish in later months were asked to report them. But the purpose of this study was not to estimate the total population; instead, Petersen used before-and-after measurements of the marked fish to estimate their growth rate.

In a much larger experiment, some 82,580 plaice (somebody must have counted them!) were transplanted into the fjord, and 10,900 of the fish were marked by having a hole punched in their dorsal fin. The number of marked fish was recorded as the plaice were caught during the coming year. It’s not clear whether the aim of this project was to estimate the total population, but in any case it didn’t work. The fraction of marked fish in the transplanted batch was about 1/7, but the marked fraction in the subsequent catches was 1/5. Petersen remarks, “This result is very strange,” and I have to agree.

When Petersen did try to estimate the plaice population, he didn’t rely on a recapture scheme. He went out with seine nets designed to dredge up every bottom fish in a measured plot, then extrapolated from the density of fish per unit area.

The whole report is fascinating fishy stuff, but it leaves me wondering just how Petersen came to be given credit for the resampling idea. As far as I can tell, it’s not to be found in this paper.

Having chased down Petersen, I turned back to Mr. Lincoln. Without much trouble I was able to identify the work in question:

Lincoln, F. C. 1930. Calculating waterfowl abundance on the basis of banding returns. United States Department of Agriculture Circular 118:1–4.

portrait of Frederick C. Lincoln in his office, with stuffed duck.The author was Frederick C. Lincoln, who was bird-bander-in-chief in the U.S. for some 25 years. The agency he founded has since migrated from the Department of Agriculture to the U.S. Geological Survey and become the Bird Banding Laboratory.

Google returns hundreds of works that cite Lincoln’s paper (including some quite far afield from population biology). But tracking down the USDA document itself was not so easy. If the USDA has it online, I wasn’t able to locate it. But a search of WorldCat eventually turned up an archive in the Hathi Trust Digital Library where you can page through Lincoln’s pamphlet in a copy scanned by Google at the University of Minnesota library.

Lincoln gives only a brief and informal account of the recapture idea, but the basic principle is stated clearly enough:

If in one season 5,000 ducks were banded and yielded 600 first-season returns, or 12 percent, and if during that same season the total number of ducks killed and reported by sportsmen was about 5,000,000, then this number would be equivalent to approximately 12 per cent of the waterfowl population for that year, which would be about 42,000,000.

It’s not hard to translate this formula from the language of duck hunters into the language of proofreaders. The first reader finds 5,000 typos and the second spots 5 million; 600 of these errors are common to both lists, and so the total number of typos is:

\frac{5\,000 \times 5\,000\,000}{600} = 41\,666\,667

So that’s my reward for a morning spent out hunting: 42 million typos.

Does Frederick Lincoln deserve credit for the Lincoln Index? I’d say he has a good claim, except that Pierre Simon de Laplace had the same idea more than a century earlier. In 1802 Laplace applied his method to estimating the (human) population of France. But maybe that’s a story for another Sunday morning.

Epilogue. This is not really a story about typos, or about fish and ducks. It’s about finding things—about the phenomenal ease of chasing facts on the world wide web. Does a marked fish have any hope of escaping recapture there?

 

Fake fits

Tuesday, March 30th, 2010

A few weeks ago I committed an act of deception. I needed a graph showing a few sets of data points, with curves fitted to them. This was strictly for show; neither the points nor the curves would mean anything; it just had to look good. So, out of laziness, I did it backwards: I drew a few arbitrary curves, chose some points on the curves, and then perturbed those points by small random displacements. The result is Figure 2 in Home Baked Graphics.

Apparently I have gotten away with this fakery, since no one has complained. But it nags at my conscience. The curves shown are not fitted to the given data points, but the other way around. And barring an extraordinary coincidence, we can’t expect the curves to give the best possible trend through the points.

I began to wonder: What if you iterate this process? In other words, you start out by generating a curve, say a simple quadratic:

wandering-lines-stages-1.png

You select a few points on the curve, in this case the five points with x coordinates equal to {1, 2, 3, 4, 5}. You perturb the y coordinates of these points by choosing random variates from a normal distribution with unit standard deviation:

wandering-lines-stages-2.png

Now you do a least-squares fit to the five selected points, generating a new quadratic:

wandering-lines-stages-3.png

At this stage you can start over, selecting five points on the new curve and randomly displacing them:

wandering-lines-stages-4.png

And a least-squares fit to those five purple points yields yet another quadratic curve:

wandering-lines-stages-5.png

So what happens if we continue in this way, repeatedly jiggling a few points on a curve and then fitting a new curve to the jiggled points? There’s a seeming hint of convergence in the three curves we’ve seen so far. They look like they might be approaching some fixed trajectory. Could that possibly be true?

I’ll post an update in a few days to say what more I’ve been able to learn about this question. If you just can’t wait, click here for a sneak peek.

The promised update (2010-04-02): You can’t fool anybody in this crowd, not even circa April 1. Of course there is no convergence to a limiting curve.

badhairlarge.png

The butterfly above shows 100 quadratic curves, selected from 100,000 iterations of the least-squares fitting process, with that process applied at points whose x values are in the set {–5, –4, –3, –2, –1, 0, 1, 2, 3, 4, 5}. Clearly enough, the coefficients of the quadratic curves are wandering widely; in particular note that the x2 term can be either positive or negative, yielding curves that are concave upward or downward.

On the other hand, these are not just any old quadratic curves. They all have their apex (the point where f′(x) = 0) fairly close to the origin. Here’s a closer view:

badhairdetail.png

(When Ros glanced at this graph on the screen, her comment was: “Bad hair day.”) And here’s a tracing of the wandering of the apex (taken from a different, longer, run of the program):

wandering-apex.png

The path has the look of a classic random walk, but note that it’s confined to a narrow range of x values. What attractive force holds the apex in this region? It’s the decision to fit the curves to points in the interval –5 ≤ x ≤ 5. Here’s what happens if we instead fit to points in the range 100 ≤ x ≤ 110:

skewedcurves.png

And note that without the symmetry-breaking x1 term, the curve would always be a parabola with its apex on the y axis; only the curve’s shape (or scale) and the height of the y intercept could vary.

This whole process of repeatedly fitting curves to data and then deriving new data from the fitted curves is weird and useless enough that it seems unlikely to appear anywhere in the universe outside my own idle mind. And yet…. Think of the Dow-Jones index, which is calculated as a weighted average of the prices of 30 selected stocks. They are not always the same 30 stocks. Indeed, recent years have seen quite a lot of turnover in the components of the Dow: AIG was replaced by Kraft Foods; General Motors was dropped in favor of Cisco Systems; Honeywell lost out to Chevron. Would it be too utterly cynical of me to suggest that what’s going on here resembles a process of fitting a curve to randomly chosen points, then choosing more points that lie near the curve?

Statistical error

Monday, March 22nd, 2010

Tom Siegfried, the editor of Science News, has published a blistering indictment of statistical methods in science and medicine. I am moved to speak for the defense.

Siegfried discusses a number of specific cases, mainly drawn from the biomedical literature, where faulty statistical reasoning has led to unreliable or erroneous conclusions. I don’t want to quibble over the particulars of those cases; I’ll concede that science provides plentiful examples of statistical analyses gone wrong. Indeed, I could add to Siegfried’s list. But I see these events mainly as failures to use the tools of statistics properly; Siegfried suggests that the problem goes deeper. If I understand him correctly, he believes that the tools themselves are defective and that science would be better off without statistics. Here is how he begins his essay:

For better or for worse, science has long been married to mathematics. Generally it has been for the better. Especially since the days of Galileo and Newton, math has nurtured science. Rigorous mathematical methods have secured science’s fidelity to fact and conferred a timeless reliability to its findings.

During the past century, though, a mutant form of math has deflected science’s heart from the modes of calculation that had long served so faithfully. Science was seduced by statistics, the math rooted in the same principles that guarantee profits for Las Vegas casinos. Supposedly, the proper use of statistics makes relying on scientific results a safe bet. But in practice, widespread misuse of statistical methods makes science more like a crapshoot.

It’s science’s dirtiest secret: The “scientific method” of testing hypotheses by statistical analysis stands on a flimsy foundation.

This argument strikes me as so totally wrong-headed that I have a hard time believing Siegfried is serious about it.

To begin with, the snide remarks about Las Vegas and crapshoots are off-target. The branch of mathematics with roots in the study of gambling is not statistics but the theory of probability. The two fields are closely allied, but they’re not identical. Early statistical ideas came out of astronomy and geodesy, with later developments in the social sciences, genetics and agriculture. If you really must find some vaguely disreputable locale for statistics, the apt choice is not the casino but the brewery (a notable Student of statistics worked for Guinness).

More disturbing than this minor historical flub is Siegfried’s vision of a lost golden age of “rigorous mathematical methods,” debased by the seductive wiles of statistics. I don’t believe there was any such fall from grace. Siegfried doesn’t tell us much about the nature of his prestatistical mathematical paradise, but since he mentions Galileo and Newton, I suppose he may be thinking of classical mechanics as an exemplar of lost innocence. It’s true that the study of planetary orbits and ballistic trajectories does offer up some pithy mathematical laws that purport to be exact descriptions of nature:

mechanics-eqns.png

We don’t usually attach error bars to these expressions, or hedge our bets by saying “Force is equal to mass times acceleration within one standard deviation.” But where do such “exact” laws come from? When Galileo performed his experiments with balls rolling down an inclined plane, the measured data did not exactly conform to a parabolic trajectory. Likewise with Newton’s inverse-square law: No real-world observations precisely follow the form 1/r2–not unless the experiment has been fudged. Making the leap from experimental data to mathematical law requires a process of statistical inference, where we extract some plausible model from the data and attribute any departures from the model to measurement error.

In the time of Galileo and Newton, tools for statistical inference were crude; by the time of Gauss, they were much sharper. In 1801 the newly discovered planetoid Ceres was observed for 41 days before it was lost in the glare of the sun. Astronomers hustled to predict where and when it would reappear in the sky. Among all the attempts, the clear winner was the prediction of Gauss, whose advantage in this competition was not so much superior astronomy as superior statistics. His secret was the method of least squares, which he later backed up with a comprehensive theory of measurement error, introducing the idea of the normal distribution.

Later still, statistics had a role in showing that the “exact” mathematics of Newtonian celestial mechanics is not exact after all. It took careful observations–and careful statistical analysis of those observations–to quantify a tiny anomalous precession in the perihelion of Mercury, explained by general relativity but not by classical gravitation.

Statistics is no “mutant form of math”; it’s the way that science answers the fundamental and inescapable question, “How do we know what is true?” I really can’t imagine how science could survive without statistics. What would replace it? Divination?

Siegfried complains that statistical tools offer no certainty–that when a result is reported as statistically significant at the 1-sigma 2-sigma level (or in other words with a P value of 0.05), there’s still a 1-in-20 chance that it’s a meaningless fluke. Quite so; that’s essentially the definition of a 2-sigma P value. But the uncertainty is not some methodological malfunction. It reflects the true limits of our knowledge. The strength of statistical reasoning is that it makes those limits explicit.

Again, I’ll readily agree that standards of statistical practice should be strengthened, and that weak or faulty conclusions are too common in some areas of the published literature. But the claim that “any single scientific study alone is quite likely to be incorrect, thanks largely to the fact that the standard statistical system for drawing conclusions is, in essence, illogical” is, in essence, illogical. Yes, we have lies, damn lies and statistics. But we also have lies and damn lies about statistics.

Spam by the numbers

Saturday, October 4th, 2008

Reviewing this month’s batch of incoming junk mail, I stumbled upon the following message:

numberspam440.png

In case that image is too tiny to read, here is the first word in source-code form:

     28    47   34
     74    33
      85  42
      16  43    25    5048     08124   8813    2714
      34  02    25       66   50  31   855        05
       3404     65    88362   00  25   72      01651
       8008     36   42  77   27  81   06     04  40
        72      83   02  32   47  12   24     87  33
        78      03    87100    83844   18      21813
                                  08
                              73634

The basic technique is anything but novel. I can remember green-and-white-striped printouts that had my name emblazoned in the same kind of two-inch-high characters. But why are the characters here formed entirely out of numbers, rather than other ASCII glyphs? And do the numbers themselves mean anything?

I think I know the answer to the first question: The spammer thought a message composed of nothing but numerals might slip through the spam filters. (In my case, at least, it didn’t work. I fished this message out of the garbage pail.)

As for the second question, my immediate guess was that the digits are the output of some simple pseudo-random number generator. That would be an easy way to produce them, and it would also allow the spammer to make each individual message unique. On taking a closer look, however, I realized there was something quite nonrandom about the numbers in the message.

Here is the full list of digits. There are exactly 900 of them. Do you see what’s missing?

284734807433341016202332628542642574418481303116432550480812488
132714721846667434022566503185505580464271163634046588362002572
016511712427000046735580083642772781060440148383627872830232471
224873301464000807803871008384418218130077346262602008225346571
155727363470732323181618223162744253246331737038301533254837881
148802160371074555632302255640217448457046416116253484658726108
147181540061231788804563557807254177278106044014838362787283023
247122487330146400080780387100838462042135220046847482422143746
770236783058460185444521283134537306537546855305024142275437615
010235002438258320577785451436776143066166025853832747551576004
831136831376228235381112678466011047530048032816623514158481030
413446024450055236762111281250031205166204213522004684748242214
374677023678305846018544452128313453730653754685530502414227543
761501023500243825832057778545143677614306616602585383274755157
600483113683137622

There’s nary a 9 in the bunch. And in other respects too the digit distribution looks slightly off-kilter:

digitdist.png

When I tabulated all the correlations between successive digits, that too looked a little fishy, although the sample is too small for any reliable conclusions.

                   s e c o n d
           0  1  2  3  4  5  6  7  8  9
      0   23 12 20  9 17  9  7  7  8  0
      1   11 13 11 12 16  8 13  5 10  0
   f  2   11 11 13 15 14 15  6 14  9  0
   i  3   18 13 15  7 11  8 13 13 12  0
   r  4    8  9 12 13 12 10 22 11 18  0
   s  5   11  7  5 14 12 14  4 10 11  0
   t  6   12 10 15  6  8  7 10 10  6  0
      7    6 10  9 10 12  7  9 11 14  0
      8   12 14  8 24 13 10  0  7  6  0
      9    0  0  0  0  0  0  0  0  0  0

So what’s going on here? I think the pseudo-random generator is still a leading candidate, though it would have to be a badly implemented RNG. The absence of 9s isn’t hard to explain: We only have to suppose that the spammer was working in C and wrote the plausible-looking expression random(9), thinking that would generate integers between 0 and 9.

On the other hand, maybe it isn’t random. Maybe there’s a secret message-within-the-message. Anybody see a pattern?

While I’m talking spam, I’ll update my ongoing tally of my inbox contents. I can report that September was a good, strong month for spam, with further steady growth continuing the summer-long trend. The stock market is in retreat and credit is tight, but the purveyors of replica watches are undeterred. My receipts have crossed the 5,000-per-month threshold for the first time:

spamcounts.png

And another threshold has also been left behind: For the first time this month, more than half of my spam is written in Russian. (Based on character-set declarations, 2,858 messages out of 5,021 were in Cyrllic scripts, or about 57 percent.)

Update 2008-10-12: In response to a request in the comments, I’ve uploaded the full text (including headers) of the original email. The file is here. Incidentally, I’ve searched my spam archive for other messages like this one, without success. That in itself makes this a peculiar spam. Usually, if I get a spam once, I see dozens of copies or variants within a few days.

Life Curves

Sunday, August 24th, 2008

J. John Sepkoski, Jr., was a fossil-hunter who did most of his digging in the library, sifting through the literature of paleontology to build a detailed, quantitative timeline of life on earth. Focusing on marine animals, he recorded the earliest and the latest known appearances of thousands of ancient organisms. The final edition of his compendium, published in 2002 (three years after his death at age 50), lists dates for more than 36,000 genera.

A few years ago I had a chance to get closely acquainted with Sepkoski’s compendium, when I needed a machine-readable version of the timeline. The listings were published on CD-ROM (remember those?), but the files were merely unstructured plain text. I needed something I could compute with, and so I spent a week or two reformatting the records and importing them into a database. (Others have done the same thing. Shanan Peters of the University of Wisconsin–Madison maintains an online version.)

Here is the summary graph that was the goal of my data-conversion project; it shows the number of extant genera as a function of time, according to Sepkoski’s tally of comings and goings:

Spekoski.png

My brief hands-on experience with Sepkoski’s compilation gave me a sense of how much care went into its preparation. Getting any large data collection into a computer tends to be a fiddly process. Irregularities that a human reader would hardly notice are sand in the gears of automated text processing. Sepkoski’s data files caused less trouble than I expected. The problems I encountered were mainly trivial typographic anomalies—missing punctuation, erratic spacing—and even those were surprisingly rare. The only hints of potentially meaningful errors were a dozen pairs of duplicated entries, where the same genus appeared twice in the listings. It’s easy to see how that would happen in a project that went on for almost three decades; indeed, it’s amazing there weren’t more duplicates.

In any case, I came away from this project with great respect for Sepkoski’s accomplishment, but that doesn’t mean that the curve reproduced above represents the final word on the history of life. It’s not even clear that the main features of the curve and its overall shape give an accurate portrait of changes in global biodiversity.

In constructing any such historical time series, certain biases and distortions are hard to overcome. Of particular importance in this case, fossils from more recent intervals are more likely to survive and to be discovered than those from more ancient times. This “pull of the recent” effect raises questions about the steep upward trend that dominates the Sepkoski curve from the Cretaceous to the present. Has evolution really been going crazy with innovation throughout the past 150 million years, or is that hockey-stick curve an artifact of preservational and sampling bias?

A newly completed analysis of another big fossil database addresses this question (and others). The data source for the new analysis is the Paleobiology Database, a large collaborative project coordinated by John Alroy of the University of California–Santa Barbara. The Paleobiology Database might be called a metacompilation: It brings together statistical and descriptive information from thousands of more-specialized fossil collections (83,444 at the latest count). Initial work on the database began a decade ago (Sepkoski was an early contributor), but it has shown a recent growth spurt.

Of course the new database is vulnerable to the same kinds of systematic bias that Sepkoski had to confront. There’s no avoiding the fact that, on the whole, younger geological strata are more accessible and better studied, and younger fossils are better preserved. But by organizing the data differently and retaining more information about each taxonomic group, Alroy and his colleagues see an opportunity to correct or compensate for some of the biases. Of particular note, whereas Sepkoski recorded only the first and last known appearance of each genus, Alroy et al. attempt to keep track of every occurrence of an organism. This extra information allows sampling bias to be estimated and corrected.

Consider these hypothetical fossil records, where each dot represents a single occurrence of a fossil organism in one of nine labeled intervals:

Alroy.png

In both cases Sepkoski’s protocol would merely indicate that the taxonomic group originated in period 3 and became extinct in or after period 8. The new database records each time unit in which the fossil was found and, whenever possible, the number of occurrences per interval. This data might seem like superfluous detail. After all, if an organism was alive in periods 3 and 8, we can safely infer that it must have existed in periods 4, 5, 6 and 7 as well, whether or not fossil evidence has come to light. But it turns out that recording occurrences rather than just chronological ranges allows for some helpful statistical magic.

As I understand it, the scheme works something like this. Suppose we could gather together all the fossils ever collected by paleontologists, and sort them into bins according to age. Because of the various sampling and preservational biases, the bins for fairly recent periods (say 50 million years ago, in the Tertiary) would be much fuller than the bins for earlier times (say 400 million years ago, in the Devonian). Any bin with more specimens would be likely to exhibit more diversity as well, simply because rare organisms have a better chance of showing up at least once in a larger sample. But we can control for this bias through a simple subsampling procedure: Draw a fixed number of specimens from each bin, making each selection at random and with replacement. The counts of genera in the subsamples should reflect the true diversity of the biota in each bin.

In practice it gets more complicated than that, because we can’t actually sample the entire fossil record at the level of individual specimens; the best we can do is to randomly choose collections of fossils or the publications that describe them. And the publications vary greatly in how much quantitative data they include; some are just lists of species observed.

After many adjustments, refinements and calibrations, Alroy and 34 co-authors have published a diversity curve based on the subsampling technique:

Alroy.png

(Graph courtesy of John Alroy.)

Their article (subscription required) appeared last month in Science, along with 67 pages of supplementary material.

The Sepkoski and the Alroy graphs are twins separated at birth—widely separated. The overall upward trend still exists in the newer graph, but it is much less dramatic, especially in the past 100 million years. Some of the famous mass-extinction events, such as those at the end of the Permian (P) and at the end of the Cretaceous (K), are visible in the new graph but are altered in character; instead of a sudden crash after a sustained build-up, we see something more like a return to normal after a brief, sharp spike in diversity. (Alroy elaborates on the dynamics of mass extinctions in a second recent article, this one in PNAS.)

Looking at the two curves, I arrive at this question: How is the interested but nonexpert reader to evaluate these contrasting views of our planetary past? I want to emphasize that the question animating me is not “Who is right?” but “How can we know who is right?” Is there some way that the ordinary, scientifically literate outsider can form a reasoned judgment about such competing claims to truth?

It was questions like these that got me in trouble the last time I wandered into this area. In 2005 Richard A. Muller of the Lawrence Berkeley National Laboratory and Robert A. Rohde, a graduate student at UC Berkeley, published a report in Nature claiming to detect periodic cycles of rising and falling diversity in the Sepkoski data. Applying Fourier analysis to the time series, they reported finding a strong signal at a period of 62 million years and a weaker one at 140 million years. The claim was controversial from the start, and I decided to take a do-it-yourself approach to understanding the issue. I went back to the original data, reimplemented the analytic methods and tried to assess the robustness of the conclusion. I told the story in an American Scientist column.

The column pleased no one. It certainly didn’t please Muller and Rohde, who objected that I was out of my depth in my amateur attempt to replicate their work. It didn’t please the critics of the Muller-Rohde hypothesis, who thought my focus on certain narrow technical issues deflected attention from deeper conceptual flaws in the argument. And it didn’t please me, because I agreed with the criticisms from both sides.

I should also mention that my column had zero impact on the controversy, which not only continues to rage but has also been extended to the new database. Alroy writes in the PNAS article that some of the peaks and valleys forming the supposed cycles fail to materialize in the new data set. On the other hand, a preprint from Adrian L. Melott of the University of Kansas argues that cycles with periods of 62 and 150 million years emerge from the Paleobiology Database with higher statistical significance than they had in the Sepkoski collection.

All in all, I think I’ll sit this one out. I’ve been itching to get my hands on some records from the new database and implement the subsampling algorithm (which sounds both intriguing and readily accessible). It would be fun to play with these ideas. But I’ll let someone else have the fun this time.

Science builds its credibility on the bedrock idea that experiments and other kinds of results are subject to independent confirmation or refutation. And the advent of computational science has made this egalitarian ideal much more practical than it used to be. Although experiments in high-energy physics remain beyond the means of most amateurs, anything done with a computer rather than a particle accelerator is pretty much fair game these days. Still, there are bounds. If every reader set out to replicate every experiment, the world wouldn’t make much progress.

Big Money

Sunday, August 3rd, 2008

Zimbabwean bank notes, including a ZW$50,000,000,000 Special Agro-Check

(Photo courtesy ZeroOne.)

It’s a cruel irony: As the citizens of Zimbabwe sink into bitter poverty, they are becoming millionaires and billionaires. Inflation is eroding the value of the Zimbabwean dollar so rapidly that everyday transactions turn into lessons in the arithmetic of large numbers. When the photo above was made on July 17, the largest currency denomination in circulation was a note for ZW$50,000,000,000. Last week the nation’s central bank issued a ZW$100,000,000,000 bill. (I’ll spare you the trouble of counting zeroes: That’s 1011, or 100 billion by American reckoning.)

The Zimbabwean inflation is the worst in the world at the moment, but it is not (yet) setting all-time records. Probably the most famous episode of extreme inflation was that of the German Weimar Republic (a story told vividly in Erich Maria Remarque’s novel The Black Obelisk.) In 1921, German marks traded at about 60 to the U.S. dollar; two years later, in December of 1923, the exchange rate was 4.2×1012 per dollar. The Hungarian inflation following World War II reached even greater numerical heights. In a single year the exchange rate for the Hungarian pengo went from 100 per U.S. dollar to 4×1029. As Feynman said, astronomical numbers are dwarfed by economical ones.

Takayuki Mizuno, Misako Takayasu and Hideki Takayasu have analyzed the German and Hungarian episodes of “hyperinflation.” (Citation: Physica A 308 (2002) 411; there’s also an arXiv preprint.) Inflation at its worst, they find, proceeds at a doubly exponential rate. In other words, prices rise not just as an exponential function of time—exp(t)—but as an exponentiated exponential—exp(exp(t))—or:

doubleexpt.png

This growth law has a simple meaning in terms of everyday experience. With “ordinary,” single-exponential inflation, prices have a constant doubling time. If bus fare was 1 million last month and 2 million this month, it will be 4 million next month. Under double-exponential growth, the doubling time itself decreases exponentially. In the last months of the Hungarian inflation the doubling time fell from about 20 days to 15 hours.

On a logarithmic scale, a simple exponential function yields a straight-line graph. Here is the Mizuno-Takayasu evidence that the final phase of the Hungarian inflation was superexponential:

Mizunofg1.jpg

And here are the data for the final six months plotted as log(log(p(t))), showing a simple linear trend:

Mizunofg2.jpg

How does the Zimbabwean economy look when submitted to this kind of scrutiny? I don’t know of a reliable source of data on prices in Zimbabwe, but foreign exchange rates can serve as a rough proxy. Until three months ago, the official ZW$ rate was pegged at roughly 30,000 per US$, but on May 10 the currency was allowed to float free, and the rate immediately jumped to 190,000,000 ZW$ per US$. By July 31 the rate had reached 57,381,544,140. Thus the 50 billion ZW$ note in the photo above was worth a little less than a 1 US$ by the end of last month. And that’s at the official rate of exchange; the street value is reportedly about a tenth of the official quote.

Here’s how the official exchange rate has varied in the 84 days between May 10 and August 1, as plotted on a linear scale:

ZW-rates.png

And here’s the same data after a logarithmic transformation:

ZW-log-and-fit.png

Although there’s more bumpiness here than in the Mizuno-Takayasu data, the trend looks reasonably linear to me. The fitted line has slope 0.03358, which yields a doubling time of about nine days. I see no hint of superexponential growth. I’d like to think this is an encouraging sign, a glimmer of hope that Zimbabwe will be spared an even more pernicious phase, when even inflation has inflation.

Runaway inflation is usually blamed on the incompetence or malevolence of governments and the central banks that implement their policies. In the case of Zimbabwe, the government of Robert Mugabe certainly has a lot to answer for. The country was once the shining success story of southern Africa—I have friends who migrated across the continent to go to school there—but the nation is now a basket case, and inflation is only one of many urgent crises. (The unemployment rate is reported to be 80 percent.) The Mugabe regime can’t escape blame for this situation. Still, it seems that hyperinflation is not to be explained purely in terms of fundamental economic imbalances—too many dollars and not enough goods. Sometimes it seems there is also a psychological component. When you believe that prices will double next week, you raise your own prices in anticipation. It’s a self-reinforcing process.

One sign of such a feedback loop in the inflationary spiral is that inflation sometimes stops even though the underlying economic situation hasn’t really changed. The Weimar hyperinflation ended with the introduction of the Rentenmark, which was set equal to 1012 old marks but really had no firmer backing than the earlier Papiermark. The change in currency did nothing to solve Germany’s problems of debt and unemployment, but the inflation ended anyway. Evidently, people chose to believe that the value of the Rentenmark would remain stable, and it did.

The central bank of Zimbabwe has just announced a similar effort at currency reform, devaluing the ZW$ by a factor of 1010. In other words, the ZW$100,000,000,000 note introduced a week ago is equal in value to a new ZW$10 bill. According to press reports, the main motive for the change was simply logistical convenience:

Gideon Gono, the Central Bank governor, … acted because the high rate of inflation was hampering the country’s computer systems. Computers, electronic calculators and automated teller machines at Zimbabwe’s banks cannot handle basic transactions in billions and trillions of dollars. (AP/Baltimore Sun)

But perhaps one can hope that the newly denominated currency will bring more than numerical benefits. Over the weekend, the official exchange rate has held at 6.569 new Zimbabwe dollars to the U.S. dollar. We’ll have to wait a few more days to see if the curve has really flattened out.

Update 2008-09-04: With another month of exchange-rate data, here’s what the situation looks like:

ZW-rates-904.png

ZW-log-and-fit-904.png

The blue line in the semilog graph is the same as the one in the corresponding earlier graph—that is to say, it is fitted to the first 80 days of data. It appears that the inflation rate has diminished slightly since the revaluation at the end of July. But that slightly lower rate is still formidable; in a little more than a month the value of the new Zimbabwe dollar has fallen from about 15 cents (U.S.) to about 2 cents.

Update 2008-10-02: After another month, what passes for good news is that the rate of exponential growth does not seem to be growing:

On the other hand, news reports suggest that the situation in Harare is bleaker than ever. Money is scarce as well as nearly worthless; people stand in line all night for the privilege of withdrawing the equivalent of a dollar or two from their own bank accounts. (Note that the equivalent of $1 U.S. is $ZW137 in the devalued currency issued in August. In pre-devaluation Zimbabwe dollars, it comes to $ZW1.37 trillion.)

Isn’t it curious that both here in the U.S. and in Zimbabwe, the financial pages are filled with such enormous numbers.

Update 2008-11-02: One more month of data:

Still no sign of “hyperinflation”—if that term is taken to mean doubly exponential growth—but that can’t be much solace to the Zimbabweans whose currency has yet again lost three-fourths of its value over the course of a month. Adjusting for the August devaluation, one U.S. dollar now buys 5.6 trillion Zimbabwean dollars.

Spam stats

Thursday, June 5th, 2008

Hormel Foods, the Minnesota meatpacker, reports a surge in sales of Spam. News accounts attribute the rising popularity of the pink meat-in-a-can to higher prices for other commodities. Or maybe it’s the Spam musubi fad.

Meanwhile, the other kind of spam seems to be surging as well. I’ve been keeping track of my personal spam consumption for the past five years. (I first wrote about this in 2003, with a follow-up in 2007.) Here’s a record of the total number of messages landing in my spam bin each month since the start of 2007:

spamvolume.png

The lull last spring gave me some hope that spam was finally in decline; the monthly intake even fell below 1,000 messages in March and April. But the respite didn’t last. There was steady growth through last summer and fall, and now another spike in volume has brought the rate to nearly 3,000 messages per month.

The message counts charted above lump together spam sent to several email addresses. Here’s a breakdown by address, covering the entire 17-month period:

mailboxes.png

The two addresses that attract the most unwanted traffic—namely, my address here at bit-player.org and another at amsci.org—are both published openly on the web, without any form of obfuscation. So are the addresses identified in the pie chart as “il-perms” and “il-prints”; they appear on my industrial-landscape.org web site. I’m certainly not surprised that spammers have discovered these addresses; they are fair game to anyone who knows how to scrape a web site. But there are still some puzzles in the data. I have several more email addresses that are equally vulnerable—they are published in the same places—but they receive nary a spam. Why not? And my earthlink.net and acm.org addresses are not published (or even much used), yet they get a healthy share of junk mail.

The content of the spam remains much the same—replica watches, blue pills, pirate software, phishing expeditions. Numbingly repetitious. In one week I got 25 messages with the same subject line: “eBay New Unpaid Item Message from snorelax67.” Then there were the 34 messages with subject lines such as “Viadzgra - $1.20,” “Viabqgra - $1.75,” “Viafmgra - $1.09″ and “Viategra - $1.38.” (Evidently someone has written a little program to insert random letter pairs in the middle of the word. My spam filter was not fooled. Nor did it fall for “Hihg - qualiyt repliacs of the ebst lcock of the wrold!!”) In “How Many Ways Can You Spell V1@gra?” I argued that most of the world’s spam is coming from a relatively small number of senders—tens or possibly hundreds, but not thousands—and I think the evidence continues to support that conjecture.

One interesting trend in my spam is that it seems to be growing more cosmopolitan. Back in 2003, about 18 percent of the spam I received was written in languages other than English; the figure now is 34 percent. The distribution of languages is curious. Here are the data for May 2008, when I received a total of 933 non-English spams:

spamlangs.png

Does everybody get gobs of spam in Russian, or is it just me? Is there something about my Internet activity that leads mailing-list compilers to believe I read Russian? Well, here’s the sad truth: My knowledge of Russian is so totally lacking that I’m not even sure all those messages are really Russian. They come with a Cyrillic character encoding, but for all I know some of them could be Bulgarian or Ukrainian. I’m equally in the dark about the 153 messages that appear to be written in various Asian languages (Chinese, Japanese, Korean). As for the German messages, they are something of a novelty. Until a few weeks ago, I almost never saw spam in German, and now there’s a sudden spate. It’s pretty clear that all of it comes from the same source. I’m seeing no French spam, nor Portuguese, nor Hindi, Urdu, Arabic, Hebrew.

Linguistic diversity is laudable, and in general I’m pleased to see challenges to Anglophone hegemony. I’m always flattered when someone addresses me in another language—even if I can’t respond in kind. But in this case I’m afraid there’s no reason to be congratulating myself. The spammers are not sending me these multilingual documents because they take me for an accomplished and urbane polyglot. They’re sending them to me (and to millions of others) because selectivity just isn’t worth the bother. Addressees like you and me are too cheap to count. Spam is becoming something like the cosmic microwave background radiation. It’s everywhere, it’s meaningless, it can be mistaken for birdshit.

Update 2008-07-01. More pink meat. I’ve tallied up the receipts for June, and my personal spam volume has set a new record: 3,354 messages, an increase of 20 percent over the previous high of 2,794 in May. The updated graph now covers 18 months:

spamvolume701.png

It’s worrisome to see the quantity growing so fast, but let me try to put the matter in perspective. Alongside the 3,354 spams I received in June, I also received 1,245 nonspam messages. Thus the proportion of spam is about 73 percent—well under the figure of 90 percent that’s often bandied about by companies that sell anti-spam products and services. Moreover, the spam causes me very little actual bother; almost all of it goes directly into the junk folder without need for human intervention. The nonspam messages, on the other hand, demand to be read and responded to. Perhaps I’d get more accomplished if more of my mail were spam.

I have not done a language analysis of the new batch, but I can tell at a glance that I’m still attracting a bizarre glut of Russian spam. A subject line that caught my eye reads:

programspam.png

I can sound out just enough Russian to guess the transliteration “programme spam.” Inside the message is an image of an advertisement (also in Russian) for various warez. But the decoy text that’s meant to get the message through the spam filters is a sports story written in German. Thus even individual messages are now becoming multilingual.

Update 2008-09-01: When I started this thread back in the spring, I thought I was taking note of a step function in the spam rate—a sudden jump from 2,000 a month to a new plateau at 2,500 a month. The trend looks different now: not a series of steps but sustained steady growth, with an increment of roughly 500 a month:

spamvolume901.png

Total spams received in my various inboxes came to 3,886 for July and 4,489 for August.

And I continue to be amazed and baffled by the quantity of Russian-language spam. The proportion of my spam written in a Cyrillic alphabet is now above 40 percent. The growth in Russian-language messages accounts for about two-thirds of the overall increase in the past few months. Should I read some geopolitical meaning into this trend?

The temblor forecast

Tuesday, April 15th, 2008

From the Associated Press, via the New York Times:

LOS ANGELES (AP) — California faces an almost certain risk of being rocked by a strong earthquake by 2037, scientists said in the first statewide temblor forecast.

New calculations reveal there is a 99.7 percent chance a magnitude 6.7 quake or larger will strike in the next 30 years. The odds of such an event are higher in Southern California than Northern California, 97 percent versus 93 percent.

caquake.jpg

I read this report with a certain sense of wonder. What impressed me was not the prediction itself; it’s not the first time I’ve heard that the Big One is coming. What took me by surprise was the level of mathematical sophistication that we can now take for granted in readers of the morning newspaper. No more do we have to worry that people will add up 97 percent and 93 percent to get 190 percent. Evidently, we’ve reached a state of universal numeracy, where everyone knows how to combine probabilities, and there’s no need to explain the calculation. We don’t even need to remind anyone that when we compute 1 – (1 – p)(1 – q), or p + qpq, we are assuming that p and q represent probabilities of statistically independent events; everybody knows that. And everybody understands that in this context “a chance of a quake” really means “a chance of at least one quake.”

I guess the only place where we might still stumble is in actually doing the arithmetic. My calculator tells me the number is 99.8 percent, not 99.7.

A further note: The original report on which the news item is based leaves me even more perplexed. The probability model adopted in the forecast is explained as follows:

The simplest assumption is that earthquakes occur randomly in time at a constant rate; i.e., they obey Poisson statistics. This model, which is used in constructing the national seismic hazard maps, is “time independent” in the sense that the probability of each earthquake rupture is completely independent of the timing of all others. Here we depart from the… conventions by considering “time-dependent” earthquake rupture forecasts that condition the event probabilities… on the date of the last major rupture. Such models… are motivated by the elastic rebound theory of the earthquake cycle…; they are based on stress-renewal models, in which probabilities drop immediately after a large earthquake releases tectonic stress on a fault and rise as the stress re-accumulates due to constant tectonic loading of the fault.

In other words, it doesn’t sound as though the assumption of independence is even approximately satisfied. I must be missing something. The 99.7 percent combined probability is mentioned in the executive summary of the report, but I found no explanation of how that number was calculated.

Perhaps I shouldn’t worry so much. I live thousands of kilometers away in a zone of seismic serenity.

Update, several hours later: After reading a little more carefully, I think the report does assume that all possible earthquake sites are independent. At each site the probability of an event is a function of time, but it is independent of probabilities at other sites. Thus calculating a joint probability for the northern and southern parts of the state does seem to be a valid operation. And the distinction between “exactly one” and “at least one” doesn’t really enter into the matter either. That’s because the model is only valid until the next major earthquake occurs; after that, all bets are off, since the time-dependent probabilities have to be recalculated.

If this interpretation of the model is correct, I think the way the result is expressed is somewhat misleading. To say there’s a 97 percent chance in Socal and a 93-percent chance in Nocal implies there’s a high probability (90.2 percent) of seeing both events in the course of the 30-year period. But the model is no longer valid after the first quake.

I wonder if there isn’t a better way to express the concept at the heart of this story. Qualitatively, it’s easy enough to grasp: In the next 30 years there will almost certainly be a major earthquake somewhere in California, and the event is more likely to happen in the southern part of the state than in the northern part. Putting this into numbers is somewhat tricky—or at least I’ve had a lot of trouble with it. Having finally surrendered to the computer and performed a Monte Carlo simulation, I come up with this statement: There’s a 99.8 percent chance that the next major California earthquake will happen by 2037. If indeed such a quake occurs, the odds are about 57 to 43 it will hit in Southern California.