Archive for the ‘statistics’ Category

Spam by the numbers

Saturday, October 4th, 2008

Reviewing this month’s batch of incoming junk mail, I stumbled upon the following message:

numberspam440.png

In case that image is too tiny to read, here is the first word in source-code form:

     28    47   34
     74    33
      85  42
      16  43    25    5048     08124   8813    2714
      34  02    25       66   50  31   855        05
       3404     65    88362   00  25   72      01651
       8008     36   42  77   27  81   06     04  40
        72      83   02  32   47  12   24     87  33
        78      03    87100    83844   18      21813
                                  08
                              73634

The basic technique is anything but novel. I can remember green-and-white-striped printouts that had my name emblazoned in the same kind of two-inch-high characters. But why are the characters here formed entirely out of numbers, rather than other ASCII glyphs? And do the numbers themselves mean anything?

I think I know the answer to the first question: The spammer thought a message composed of nothing but numerals might slip through the spam filters. (In my case, at least, it didn’t work. I fished this message out of the garbage pail.)

As for the second question, my immediate guess was that the digits are the output of some simple pseudo-random number generator. That would be an easy way to produce them, and it would also allow the spammer to make each individual message unique. On taking a closer look, however, I realized there was something quite nonrandom about the numbers in the message.

Here is the full list of digits. There are exactly 900 of them. Do you see what’s missing?

284734807433341016202332628542642574418481303116432550480812488
132714721846667434022566503185505580464271163634046588362002572
016511712427000046735580083642772781060440148383627872830232471
224873301464000807803871008384418218130077346262602008225346571
155727363470732323181618223162744253246331737038301533254837881
148802160371074555632302255640217448457046416116253484658726108
147181540061231788804563557807254177278106044014838362787283023
247122487330146400080780387100838462042135220046847482422143746
770236783058460185444521283134537306537546855305024142275437615
010235002438258320577785451436776143066166025853832747551576004
831136831376228235381112678466011047530048032816623514158481030
413446024450055236762111281250031205166204213522004684748242214
374677023678305846018544452128313453730653754685530502414227543
761501023500243825832057778545143677614306616602585383274755157
600483113683137622

There’s nary a 9 in the bunch. And in other respects too the digit distribution looks slightly off-kilter:

digitdist.png

When I tabulated all the correlations between successive digits, that too looked a little fishy, although the sample is too small for any reliable conclusions.

                   s e c o n d
           0  1  2  3  4  5  6  7  8  9
      0   23 12 20  9 17  9  7  7  8  0
      1   11 13 11 12 16  8 13  5 10  0
   f  2   11 11 13 15 14 15  6 14  9  0
   i  3   18 13 15  7 11  8 13 13 12  0
   r  4    8  9 12 13 12 10 22 11 18  0
   s  5   11  7  5 14 12 14  4 10 11  0
   t  6   12 10 15  6  8  7 10 10  6  0
      7    6 10  9 10 12  7  9 11 14  0
      8   12 14  8 24 13 10  0  7  6  0
      9    0  0  0  0  0  0  0  0  0  0

So what’s going on here? I think the pseudo-random generator is still a leading candidate, though it would have to be a badly implemented RNG. The absence of 9s isn’t hard to explain: We only have to suppose that the spammer was working in C and wrote the plausible-looking expression random(9), thinking that would generate integers between 0 and 9.

On the other hand, maybe it isn’t random. Maybe there’s a secret message-within-the-message. Anybody see a pattern?

While I’m talking spam, I’ll update my ongoing tally of my inbox contents. I can report that September was a good, strong month for spam, with further steady growth continuing the summer-long trend. The stock market is in retreat and credit is tight, but the purveyors of replica watches are undeterred. My receipts have crossed the 5,000-per-month threshold for the first time:

spamcounts.png

And another threshold has also been left behind: For the first time this month, more than half of my spam is written in Russian. (Based on character-set declarations, 2,858 messages out of 5,021 were in Cyrllic scripts, or about 57 percent.)

Update 2008-10-12: In response to a request in the comments, I’ve uploaded the full text (including headers) of the original email. The file is here. Incidentally, I’ve searched my spam archive for other messages like this one, without success. That in itself makes this a peculiar spam. Usually, if I get a spam once, I see dozens of copies or variants within a few days.

Life Curves

Sunday, August 24th, 2008

J. John Sepkoski, Jr., was a fossil-hunter who did most of his digging in the library, sifting through the literature of paleontology to build a detailed, quantitative timeline of life on earth. Focusing on marine animals, he recorded the earliest and the latest known appearances of thousands of ancient organisms. The final edition of his compendium, published in 2002 (three years after his death at age 50), lists dates for more than 36,000 genera.

A few years ago I had a chance to get closely acquainted with Sepkoski’s compendium, when I needed a machine-readable version of the timeline. The listings were published on CD-ROM (remember those?), but the files were merely unstructured plain text. I needed something I could compute with, and so I spent a week or two reformatting the records and importing them into a database. (Others have done the same thing. Shanan Peters of the University of Wisconsin–Madison maintains an online version.)

Here is the summary graph that was the goal of my data-conversion project; it shows the number of extant genera as a function of time, according to Sepkoski’s tally of comings and goings:

Spekoski.png

My brief hands-on experience with Sepkoski’s compilation gave me a sense of how much care went into its preparation. Getting any large data collection into a computer tends to be a fiddly process. Irregularities that a human reader would hardly notice are sand in the gears of automated text processing. Sepkoski’s data files caused less trouble than I expected. The problems I encountered were mainly trivial typographic anomalies—missing punctuation, erratic spacing—and even those were surprisingly rare. The only hints of potentially meaningful errors were a dozen pairs of duplicated entries, where the same genus appeared twice in the listings. It’s easy to see how that would happen in a project that went on for almost three decades; indeed, it’s amazing there weren’t more duplicates.

In any case, I came away from this project with great respect for Sepkoski’s accomplishment, but that doesn’t mean that the curve reproduced above represents the final word on the history of life. It’s not even clear that the main features of the curve and its overall shape give an accurate portrait of changes in global biodiversity.

In constructing any such historical time series, certain biases and distortions are hard to overcome. Of particular importance in this case, fossils from more recent intervals are more likely to survive and to be discovered than those from more ancient times. This “pull of the recent” effect raises questions about the steep upward trend that dominates the Sepkoski curve from the Cretaceous to the present. Has evolution really been going crazy with innovation throughout the past 150 million years, or is that hockey-stick curve an artifact of preservational and sampling bias?

A newly completed analysis of another big fossil database addresses this question (and others). The data source for the new analysis is the Paleobiology Database, a large collaborative project coordinated by John Alroy of the University of California–Santa Barbara. The Paleobiology Database might be called a metacompilation: It brings together statistical and descriptive information from thousands of more-specialized fossil collections (83,444 at the latest count). Initial work on the database began a decade ago (Sepkoski was an early contributor), but it has shown a recent growth spurt.

Of course the new database is vulnerable to the same kinds of systematic bias that Sepkoski had to confront. There’s no avoiding the fact that, on the whole, younger geological strata are more accessible and better studied, and younger fossils are better preserved. But by organizing the data differently and retaining more information about each taxonomic group, Alroy and his colleagues see an opportunity to correct or compensate for some of the biases. Of particular note, whereas Sepkoski recorded only the first and last known appearance of each genus, Alroy et al. attempt to keep track of every occurrence of an organism. This extra information allows sampling bias to be estimated and corrected.

Consider these hypothetical fossil records, where each dot represents a single occurrence of a fossil organism in one of nine labeled intervals:

Alroy.png

In both cases Sepkoski’s protocol would merely indicate that the taxonomic group originated in period 3 and became extinct in or after period 8. The new database records each time unit in which the fossil was found and, whenever possible, the number of occurrences per interval. This data might seem like superfluous detail. After all, if an organism was alive in periods 3 and 8, we can safely infer that it must have existed in periods 4, 5, 6 and 7 as well, whether or not fossil evidence has come to light. But it turns out that recording occurrences rather than just chronological ranges allows for some helpful statistical magic.

As I understand it, the scheme works something like this. Suppose we could gather together all the fossils ever collected by paleontologists, and sort them into bins according to age. Because of the various sampling and preservational biases, the bins for fairly recent periods (say 50 million years ago, in the Tertiary) would be much fuller than the bins for earlier times (say 400 million years ago, in the Devonian). Any bin with more specimens would be likely to exhibit more diversity as well, simply because rare organisms have a better chance of showing up at least once in a larger sample. But we can control for this bias through a simple subsampling procedure: Draw a fixed number of specimens from each bin, making each selection at random and with replacement. The counts of genera in the subsamples should reflect the true diversity of the biota in each bin.

In practice it gets more complicated than that, because we can’t actually sample the entire fossil record at the level of individual specimens; the best we can do is to randomly choose collections of fossils or the publications that describe them. And the publications vary greatly in how much quantitative data they include; some are just lists of species observed.

After many adjustments, refinements and calibrations, Alroy and 34 co-authors have published a diversity curve based on the subsampling technique:

Alroy.png

(Graph courtesy of John Alroy.)

Their article (subscription required) appeared last month in Science, along with 67 pages of supplementary material.

The Sepkoski and the Alroy graphs are twins separated at birth—widely separated. The overall upward trend still exists in the newer graph, but it is much less dramatic, especially in the past 100 million years. Some of the famous mass-extinction events, such as those at the end of the Permian (P) and at the end of the Cretaceous (K), are visible in the new graph but are altered in character; instead of a sudden crash after a sustained build-up, we see something more like a return to normal after a brief, sharp spike in diversity. (Alroy elaborates on the dynamics of mass extinctions in a second recent article, this one in PNAS.)

Looking at the two curves, I arrive at this question: How is the interested but nonexpert reader to evaluate these contrasting views of our planetary past? I want to emphasize that the question animating me is not “Who is right?” but “How can we know who is right?” Is there some way that the ordinary, scientifically literate outsider can form a reasoned judgment about such competing claims to truth?

It was questions like these that got me in trouble the last time I wandered into this area. In 2005 Richard A. Muller of the Lawrence Berkeley National Laboratory and Robert A. Rohde, a graduate student at UC Berkeley, published a report in Nature claiming to detect periodic cycles of rising and falling diversity in the Sepkoski data. Applying Fourier analysis to the time series, they reported finding a strong signal at a period of 62 million years and a weaker one at 140 million years. The claim was controversial from the start, and I decided to take a do-it-yourself approach to understanding the issue. I went back to the original data, reimplemented the analytic methods and tried to assess the robustness of the conclusion. I told the story in an American Scientist column.

The column pleased no one. It certainly didn’t please Muller and Rohde, who objected that I was out of my depth in my amateur attempt to replicate their work. It didn’t please the critics of the Muller-Rohde hypothesis, who thought my focus on certain narrow technical issues deflected attention from deeper conceptual flaws in the argument. And it didn’t please me, because I agreed with the criticisms from both sides.

I should also mention that my column had zero impact on the controversy, which not only continues to rage but has also been extended to the new database. Alroy writes in the PNAS article that some of the peaks and valleys forming the supposed cycles fail to materialize in the new data set. On the other hand, a preprint from Adrian L. Melott of the University of Kansas argues that cycles with periods of 62 and 150 million years emerge from the Paleobiology Database with higher statistical significance than they had in the Sepkoski collection.

All in all, I think I’ll sit this one out. I’ve been itching to get my hands on some records from the new database and implement the subsampling algorithm (which sounds both intriguing and readily accessible). It would be fun to play with these ideas. But I’ll let someone else have the fun this time.

Science builds its credibility on the bedrock idea that experiments and other kinds of results are subject to independent confirmation or refutation. And the advent of computational science has made this egalitarian ideal much more practical than it used to be. Although experiments in high-energy physics remain beyond the means of most amateurs, anything done with a computer rather than a particle accelerator is pretty much fair game these days. Still, there are bounds. If every reader set out to replicate every experiment, the world wouldn’t make much progress.

Big Money

Sunday, August 3rd, 2008

Zimbabwean bank notes, including a ZW$50,000,000,000 Special Agro-Check

(Photo courtesy ZeroOne.)

It’s a cruel irony: As the citizens of Zimbabwe sink into bitter poverty, they are becoming millionaires and billionaires. Inflation is eroding the value of the Zimbabwean dollar so rapidly that everyday transactions turn into lessons in the arithmetic of large numbers. When the photo above was made on July 17, the largest currency denomination in circulation was a note for ZW$50,000,000,000. Last week the nation’s central bank issued a ZW$100,000,000,000 bill. (I’ll spare you the trouble of counting zeroes: That’s 1011, or 100 billion by American reckoning.)

The Zimbabwean inflation is the worst in the world at the moment, but it is not (yet) setting all-time records. Probably the most famous episode of extreme inflation was that of the German Weimar Republic (a story told vividly in Erich Maria Remarque’s novel The Black Obelisk.) In 1921, German marks traded at about 60 to the U.S. dollar; two years later, in December of 1923, the exchange rate was 4.2×1012 per dollar. The Hungarian inflation following World War II reached even greater numerical heights. In a single year the exchange rate for the Hungarian pengo went from 100 per U.S. dollar to 4×1029. As Feynman said, astronomical numbers are dwarfed by economical ones.

Takayuki Mizuno, Misako Takayasu and Hideki Takayasu have analyzed the German and Hungarian episodes of “hyperinflation.” (Citation: Physica A 308 (2002) 411; there’s also an arXiv preprint.) Inflation at its worst, they find, proceeds at a doubly exponential rate. In other words, prices rise not just as an exponential function of time—exp(t)—but as an exponentiated exponential—exp(exp(t))—or:

doubleexpt.png

This growth law has a simple meaning in terms of everyday experience. With “ordinary,” single-exponential inflation, prices have a constant doubling time. If bus fare was 1 million last month and 2 million this month, it will be 4 million next month. Under double-exponential growth, the doubling time itself decreases exponentially. In the last months of the Hungarian inflation the doubling time fell from about 20 days to 15 hours.

On a logarithmic scale, a simple exponential function yields a straight-line graph. Here is the Mizuno-Takayasu evidence that the final phase of the Hungarian inflation was superexponential:

Mizunofg1.jpg

And here are the data for the final six months plotted as log(log(p(t))), showing a simple linear trend:

Mizunofg2.jpg

How does the Zimbabwean economy look when submitted to this kind of scrutiny? I don’t know of a reliable source of data on prices in Zimbabwe, but foreign exchange rates can serve as a rough proxy. Until three months ago, the official ZW$ rate was pegged at roughly 30,000 per US$, but on May 10 the currency was allowed to float free, and the rate immediately jumped to 190,000,000 ZW$ per US$. By July 31 the rate had reached 57,381,544,140. Thus the 50 billion ZW$ note in the photo above was worth a little less than a 1 US$ by the end of last month. And that’s at the official rate of exchange; the street value is reportedly about a tenth of the official quote.

Here’s how the official exchange rate has varied in the 84 days between May 10 and August 1, as plotted on a linear scale:

ZW-rates.png

And here’s the same data after a logarithmic transformation:

ZW-log-and-fit.png

Although there’s more bumpiness here than in the Mizuno-Takayasu data, the trend looks reasonably linear to me. The fitted line has slope 0.03358, which yields a doubling time of about nine days. I see no hint of superexponential growth. I’d like to think this is an encouraging sign, a glimmer of hope that Zimbabwe will be spared an even more pernicious phase, when even inflation has inflation.

Runaway inflation is usually blamed on the incompetence or malevolence of governments and the central banks that implement their policies. In the case of Zimbabwe, the government of Robert Mugabe certainly has a lot to answer for. The country was once the shining success story of southern Africa—I have friends who migrated across the continent to go to school there—but the nation is now a basket case, and inflation is only one of many urgent crises. (The unemployment rate is reported to be 80 percent.) The Mugabe regime can’t escape blame for this situation. Still, it seems that hyperinflation is not to be explained purely in terms of fundamental economic imbalances—too many dollars and not enough goods. Sometimes it seems there is also a psychological component. When you believe that prices will double next week, you raise your own prices in anticipation. It’s a self-reinforcing process.

One sign of such a feedback loop in the inflationary spiral is that inflation sometimes stops even though the underlying economic situation hasn’t really changed. The Weimar hyperinflation ended with the introduction of the Rentenmark, which was set equal to 1012 old marks but really had no firmer backing than the earlier Papiermark. The change in currency did nothing to solve Germany’s problems of debt and unemployment, but the inflation ended anyway. Evidently, people chose to believe that the value of the Rentenmark would remain stable, and it did.

The central bank of Zimbabwe has just announced a similar effort at currency reform, devaluing the ZW$ by a factor of 1010. In other words, the ZW$100,000,000,000 note introduced a week ago is equal in value to a new ZW$10 bill. According to press reports, the main motive for the change was simply logistical convenience:

Gideon Gono, the Central Bank governor, … acted because the high rate of inflation was hampering the country’s computer systems. Computers, electronic calculators and automated teller machines at Zimbabwe’s banks cannot handle basic transactions in billions and trillions of dollars. (AP/Baltimore Sun)

But perhaps one can hope that the newly denominated currency will bring more than numerical benefits. Over the weekend, the official exchange rate has held at 6.569 new Zimbabwe dollars to the U.S. dollar. We’ll have to wait a few more days to see if the curve has really flattened out.

Update 2008-09-04: With another month of exchange-rate data, here’s what the situation looks like:

ZW-rates-904.png

ZW-log-and-fit-904.png

The blue line in the semilog graph is the same as the one in the corresponding earlier graph—that is to say, it is fitted to the first 80 days of data. It appears that the inflation rate has diminished slightly since the revaluation at the end of July. But that slightly lower rate is still formidable; in a little more than a month the value of the new Zimbabwe dollar has fallen from about 15 cents (U.S.) to about 2 cents.

Update 2008-10-02: After another month, what passes for good news is that the rate of exponential growth does not seem to be growing:

On the other hand, news reports suggest that the situation in Harare is bleaker than ever. Money is scarce as well as nearly worthless; people stand in line all night for the privilege of withdrawing the equivalent of a dollar or two from their own bank accounts. (Note that the equivalent of $1 U.S. is $ZW137 in the devalued currency issued in August. In pre-devaluation Zimbabwe dollars, it comes to $ZW1.37 trillion.)

Isn’t it curious that both here in the U.S. and in Zimbabwe, the financial pages are filled with such enormous numbers.

Update 2008-11-02: One more month of data:

Still no sign of “hyperinflation”—if that term is taken to mean doubly exponential growth—but that can’t be much solace to the Zimbabweans whose currency has yet again lost three-fourths of its value over the course of a month. Adjusting for the August devaluation, one U.S. dollar now buys 5.6 trillion Zimbabwean dollars.

Spam stats

Thursday, June 5th, 2008

Hormel Foods, the Minnesota meatpacker, reports a surge in sales of Spam. News accounts attribute the rising popularity of the pink meat-in-a-can to higher prices for other commodities. Or maybe it’s the Spam musubi fad.

Meanwhile, the other kind of spam seems to be surging as well. I’ve been keeping track of my personal spam consumption for the past five years. (I first wrote about this in 2003, with a follow-up in 2007.) Here’s a record of the total number of messages landing in my spam bin each month since the start of 2007:

spamvolume.png

The lull last spring gave me some hope that spam was finally in decline; the monthly intake even fell below 1,000 messages in March and April. But the respite didn’t last. There was steady growth through last summer and fall, and now another spike in volume has brought the rate to nearly 3,000 messages per month.

The message counts charted above lump together spam sent to several email addresses. Here’s a breakdown by address, covering the entire 17-month period:

mailboxes.png

The two addresses that attract the most unwanted traffic—namely, my address here at bit-player.org and another at amsci.org—are both published openly on the web, without any form of obfuscation. So are the addresses identified in the pie chart as “il-perms” and “il-prints”; they appear on my industrial-landscape.org web site. I’m certainly not surprised that spammers have discovered these addresses; they are fair game to anyone who knows how to scrape a web site. But there are still some puzzles in the data. I have several more email addresses that are equally vulnerable—they are published in the same places—but they receive nary a spam. Why not? And my earthlink.net and acm.org addresses are not published (or even much used), yet they get a healthy share of junk mail.

The content of the spam remains much the same—replica watches, blue pills, pirate software, phishing expeditions. Numbingly repetitious. In one week I got 25 messages with the same subject line: “eBay New Unpaid Item Message from snorelax67.” Then there were the 34 messages with subject lines such as “Viadzgra - $1.20,” “Viabqgra - $1.75,” “Viafmgra - $1.09″ and “Viategra - $1.38.” (Evidently someone has written a little program to insert random letter pairs in the middle of the word. My spam filter was not fooled. Nor did it fall for “Hihg - qualiyt repliacs of the ebst lcock of the wrold!!”) In “How Many Ways Can You Spell V1@gra?” I argued that most of the world’s spam is coming from a relatively small number of senders—tens or possibly hundreds, but not thousands—and I think the evidence continues to support that conjecture.

One interesting trend in my spam is that it seems to be growing more cosmopolitan. Back in 2003, about 18 percent of the spam I received was written in languages other than English; the figure now is 34 percent. The distribution of languages is curious. Here are the data for May 2008, when I received a total of 933 non-English spams:

spamlangs.png

Does everybody get gobs of spam in Russian, or is it just me? Is there something about my Internet activity that leads mailing-list compilers to believe I read Russian? Well, here’s the sad truth: My knowledge of Russian is so totally lacking that I’m not even sure all those messages are really Russian. They come with a Cyrillic character encoding, but for all I know some of them could be Bulgarian or Ukrainian. I’m equally in the dark about the 153 messages that appear to be written in various Asian languages (Chinese, Japanese, Korean). As for the German messages, they are something of a novelty. Until a few weeks ago, I almost never saw spam in German, and now there’s a sudden spate. It’s pretty clear that all of it comes from the same source. I’m seeing no French spam, nor Portuguese, nor Hindi, Urdu, Arabic, Hebrew.

Linguistic diversity is laudable, and in general I’m pleased to see challenges to Anglophone hegemony. I’m always flattered when someone addresses me in another language—even if I can’t respond in kind. But in this case I’m afraid there’s no reason to be congratulating myself. The spammers are not sending me these multilingual documents because they take me for an accomplished and urbane polyglot. They’re sending them to me (and to millions of others) because selectivity just isn’t worth the bother. Addressees like you and me are too cheap to count. Spam is becoming something like the cosmic microwave background radiation. It’s everywhere, it’s meaningless, it can be mistaken for birdshit.

Update 2008-07-01. More pink meat. I’ve tallied up the receipts for June, and my personal spam volume has set a new record: 3,354 messages, an increase of 20 percent over the previous high of 2,794 in May. The updated graph now covers 18 months:

spamvolume701.png

It’s worrisome to see the quantity growing so fast, but let me try to put the matter in perspective. Alongside the 3,354 spams I received in June, I also received 1,245 nonspam messages. Thus the proportion of spam is about 73 percent—well under the figure of 90 percent that’s often bandied about by companies that sell anti-spam products and services. Moreover, the spam causes me very little actual bother; almost all of it goes directly into the junk folder without need for human intervention. The nonspam messages, on the other hand, demand to be read and responded to. Perhaps I’d get more accomplished if more of my mail were spam.

I have not done a language analysis of the new batch, but I can tell at a glance that I’m still attracting a bizarre glut of Russian spam. A subject line that caught my eye reads:

programspam.png

I can sound out just enough Russian to guess the transliteration “programme spam.” Inside the message is an image of an advertisement (also in Russian) for various warez. But the decoy text that’s meant to get the message through the spam filters is a sports story written in German. Thus even individual messages are now becoming multilingual.

Update 2008-09-01: When I started this thread back in the spring, I thought I was taking note of a step function in the spam rate—a sudden jump from 2,000 a month to a new plateau at 2,500 a month. The trend looks different now: not a series of steps but sustained steady growth, with an increment of roughly 500 a month:

spamvolume901.png

Total spams received in my various inboxes came to 3,886 for July and 4,489 for August.

And I continue to be amazed and baffled by the quantity of Russian-language spam. The proportion of my spam written in a Cyrillic alphabet is now above 40 percent. The growth in Russian-language messages accounts for about two-thirds of the overall increase in the past few months. Should I read some geopolitical meaning into this trend?

The temblor forecast

Tuesday, April 15th, 2008

From the Associated Press, via the New York Times:

LOS ANGELES (AP) — California faces an almost certain risk of being rocked by a strong earthquake by 2037, scientists said in the first statewide temblor forecast.

New calculations reveal there is a 99.7 percent chance a magnitude 6.7 quake or larger will strike in the next 30 years. The odds of such an event are higher in Southern California than Northern California, 97 percent versus 93 percent.

caquake.jpg

I read this report with a certain sense of wonder. What impressed me was not the prediction itself; it’s not the first time I’ve heard that the Big One is coming. What took me by surprise was the level of mathematical sophistication that we can now take for granted in readers of the morning newspaper. No more do we have to worry that people will add up 97 percent and 93 percent to get 190 percent. Evidently, we’ve reached a state of universal numeracy, where everyone knows how to combine probabilities, and there’s no need to explain the calculation. We don’t even need to remind anyone that when we compute 1 – (1 – p)(1 – q), or p + qpq, we are assuming that p and q represent probabilities of statistically independent events; everybody knows that. And everybody understands that in this context “a chance of a quake” really means “a chance of at least one quake.”

I guess the only place where we might still stumble is in actually doing the arithmetic. My calculator tells me the number is 99.8 percent, not 99.7.

A further note: The original report on which the news item is based leaves me even more perplexed. The probability model adopted in the forecast is explained as follows:

The simplest assumption is that earthquakes occur randomly in time at a constant rate; i.e., they obey Poisson statistics. This model, which is used in constructing the national seismic hazard maps, is “time independent” in the sense that the probability of each earthquake rupture is completely independent of the timing of all others. Here we depart from the… conventions by considering “time-dependent” earthquake rupture forecasts that condition the event probabilities… on the date of the last major rupture. Such models… are motivated by the elastic rebound theory of the earthquake cycle…; they are based on stress-renewal models, in which probabilities drop immediately after a large earthquake releases tectonic stress on a fault and rise as the stress re-accumulates due to constant tectonic loading of the fault.

In other words, it doesn’t sound as though the assumption of independence is even approximately satisfied. I must be missing something. The 99.7 percent combined probability is mentioned in the executive summary of the report, but I found no explanation of how that number was calculated.

Perhaps I shouldn’t worry so much. I live thousands of kilometers away in a zone of seismic serenity.

Update, several hours later: After reading a little more carefully, I think the report does assume that all possible earthquake sites are independent. At each site the probability of an event is a function of time, but it is independent of probabilities at other sites. Thus calculating a joint probability for the northern and southern parts of the state does seem to be a valid operation. And the distinction between “exactly one” and “at least one” doesn’t really enter into the matter either. That’s because the model is only valid until the next major earthquake occurs; after that, all bets are off, since the time-dependent probabilities have to be recalculated.

If this interpretation of the model is correct, I think the way the result is expressed is somewhat misleading. To say there’s a 97 percent chance in Socal and a 93-percent chance in Nocal implies there’s a high probability (90.2 percent) of seeing both events in the course of the 30-year period. But the model is no longer valid after the first quake.

I wonder if there isn’t a better way to express the concept at the heart of this story. Qualitatively, it’s easy enough to grasp: In the next 30 years there will almost certainly be a major earthquake somewhere in California, and the event is more likely to happen in the southern part of the state than in the northern part. Putting this into numbers is somewhat tricky—or at least I’ve had a lot of trouble with it. Having finally surrendered to the computer and performed a Monte Carlo simulation, I come up with this statement: There’s a 99.8 percent chance that the next major California earthquake will happen by 2037. If indeed such a quake occurs, the odds are about 57 to 43 it will hit in Southern California.

Measure twice, average once

Friday, December 7th, 2007

plywood panel with seven measurements in crayon or magic marker

Whenever Norm Abram tells me to “measure twice, cut once,” I wonder what I’m supposed to do if the two measurements disagree. Perhaps I should measure a third time, in hope of settling the question by majority rule; but then I might well wind up with three discrepant values.

Strolling by a construction site the other day, I came upon the plywood panel shown above. There was no one around to help me interpret these curious scrawled measurements, but I could easily enough imagine the scene. A carpenter—Skilsaw at the ready—is surrounded by a group of statisticians and decision theorists eager to advise him on where to make the cut.

“Obviously,” says the first consultant, “we take the average—the arithmetic mean. Gauss proved 200 years ago that the sample mean is always the best estimator for a measurement subject to normally distributed random errors.”

“Actually, he proved just the opposite,” says another hardhatted and hardheaded savant. “He started by assuming that the mean is the most probable value, and then he invented the normal distribution as a way of ensuring that this rule will hold.”

“Whatever. But we’ve come a long ways since 1805. We know that the mean is an admissible estimator. Even without assuming a normal distribution, the sample mean is the estimator that minimizes the sum of the squared errors.”

“But who says the sum of the squared errors is the function we want to optimize? It’s just one of many possibilities. And it gives undue influence to the extremes of the distribution. In this case, the presence of that peculiar-looking eight-and-an-eighth value pulls the mean down to 55.875. Is that really where we should saw the board?”

“That 8.125 is obviously an outlier. Somebody was reading the wrong end of the tape measure. Excluding that bogus value, the mean is 63.833.”

“If you’re going to be picking and choosing which data points to trust, what about the one at the upper right? I’m not even sure I can read it: 64 and seven-eighths? And somebody seems to have crossed it out. Maybe we should drop that one, too.”

“And 64 is the only other item that isn’t circled. That must mean something.”

Another direction is suggested: “Instead of Gauss’s sum of the squared errors, we could adopt the criterion of Laplace, the sum of the absolute errors. With this choice, the favored estimator is the median rather than the mean. The median of our data is 63.625. And the median is much less sensitive to outliers and strangely shaped distributions. Whether we include or exclude the eight-and-an-eighth measurement makes only a minor difference.”

“What makes you all so sure we’re seeing several attempts to measure the same quantity? I think we actually have three distinct sets of measurements here, which just happen to be scribbled on the same piece of wood. The eight-and-an-eighth is clearly on its own. The two uncircled measurements form another set. And then we have four circled values all clustering around 63-and-something. If we want to simultaneously optimize the least-squares error for all three sets, we should be using a James-Stein estimator, which shrinks the average of each set toward the overall average.”

At this point a Bayesian is heard from. Others mention maximum likelihood, Pitman’s measure of closeness, minimum variance, the method of moments….

The conversation does not end here, but the rest is lost in the whine of the power saw. The carpenter has cut off the plank somewhere out beyond 64 inches and explains this choice as follows: Cutting long may mean cutting twice, but cutting short means buying twice.

*       *       *

One lesson you might draw from this little farce and fable is that if you have a hard decision to make, you should call a carpenter rather than a statistician. But that’s not the conclusion I intended.

You sometimes get the impression that statistics is a dry and lifeless discipline, where all the interesting questions were answered long ago, and all that remains now is to memorize some formulas and learn when to apply them. I think not!

Problems in statistics don’t get much simpler than this one. It concerns a small set of observations, with one variable in one dimension and one parameter to be estimated. It’s a problem that would have been perfectly familiar to Gauss and Laplace, Legendre and Adrain. And yet there’s still room for doubt and controversy about how best to approach such questions.

I found the plywood puzzle challenging enough that I was led to do some reading. Most of it is well above my grade level, and so I can’t claim to have absorbed everything the authors have to say. But I’ll offer a few pointers in case anyone else wants to follow along:

  • Colin R. Blyth (1951) directly confronts the Norm Abram question: How do you decide when to stop measuring and start cutting? I gather that this paper was a major landmark in estimation theory. R. H. Farrell (1964) follows up on related themes. (A number of other papers could be mentioned in the same context; I draw attention to these two because they are freely available online through Cornell’s Project Euclid.)
  • There’s an “Introduction to Estimation Theory” by Don Johnson of Rice at the Connexions web site. The context is signal processing, but there’s plenty of use to carpenters.
  • For the history of statistics, Stephen Stigler is always the place to start. His article on “Gauss and the Invention of Least Squares” is chapter 17 in Statistics on the Table (Harvard University Press, 1999). The original 1981 version from Annals of Statistics is online here through Project Euclid.
  • For a gentle introduction to the James-Stein estimator, I recommend a Scientific American article by Bradley Efron and Carl Morris, “Stein’s Paradox in Statistics” (Vol. 236 No. 5, May 1977, pp. 119–127). (Disclaimer: I was the editor of that article.)
  • Finally, at the moment I’m halfway through Pitman’s Measure of Closeness: A Comparison of Statistical Estimators, by Jerome P. Keating, Robert L. Mason and Pranab K. Sen (SIAM, 1993). I really don’t yet know what to make of this, but it has opened up a world I knew nothing about.