Archive for the ‘statistics’ Category

The temblor forecast

Tuesday, April 15th, 2008

From the Associated Press, via the New York Times:

LOS ANGELES (AP) — California faces an almost certain risk of being rocked by a strong earthquake by 2037, scientists said in the first statewide temblor forecast.

New calculations reveal there is a 99.7 percent chance a magnitude 6.7 quake or larger will strike in the next 30 years. The odds of such an event are higher in Southern California than Northern California, 97 percent versus 93 percent.

caquake.jpg

I read this report with a certain sense of wonder. What impressed me was not the prediction itself; it’s not the first time I’ve heard that the Big One is coming. What took me by surprise was the level of mathematical sophistication that we can now take for granted in readers of the morning newspaper. No more do we have to worry that people will add up 97 percent and 93 percent to get 190 percent. Evidently, we’ve reached a state of universal numeracy, where everyone knows how to combine probabilities, and there’s no need to explain the calculation. We don’t even need to remind anyone that when we compute 1 – (1 – p)(1 – q), or p + qpq, we are assuming that p and q represent probabilities of statistically independent events; everybody knows that. And everybody understands that in this context “a chance of a quake” really means “a chance of at least one quake.”

I guess the only place where we might still stumble is in actually doing the arithmetic. My calculator tells me the number is 99.8 percent, not 99.7.

A further note: The original report on which the news item is based leaves me even more perplexed. The probability model adopted in the forecast is explained as follows:

The simplest assumption is that earthquakes occur randomly in time at a constant rate; i.e., they obey Poisson statistics. This model, which is used in constructing the national seismic hazard maps, is “time independent” in the sense that the probability of each earthquake rupture is completely independent of the timing of all others. Here we depart from the… conventions by considering “time-dependent” earthquake rupture forecasts that condition the event probabilities… on the date of the last major rupture. Such models… are motivated by the elastic rebound theory of the earthquake cycle…; they are based on stress-renewal models, in which probabilities drop immediately after a large earthquake releases tectonic stress on a fault and rise as the stress re-accumulates due to constant tectonic loading of the fault.

In other words, it doesn’t sound as though the assumption of independence is even approximately satisfied. I must be missing something. The 99.7 percent combined probability is mentioned in the executive summary of the report, but I found no explanation of how that number was calculated.

Perhaps I shouldn’t worry so much. I live thousands of kilometers away in a zone of seismic serenity.

Update, several hours later: After reading a little more carefully, I think the report does assume that all possible earthquake sites are independent. At each site the probability of an event is a function of time, but it is independent of probabilities at other sites. Thus calculating a joint probability for the northern and southern parts of the state does seem to be a valid operation. And the distinction between “exactly one” and “at least one” doesn’t really enter into the matter either. That’s because the model is only valid until the next major earthquake occurs; after that, all bets are off, since the time-dependent probabilities have to be recalculated.

If this interpretation of the model is correct, I think the way the result is expressed is somewhat misleading. To say there’s a 97 percent chance in Socal and a 93-percent chance in Nocal implies there’s a high probability (90.2 percent) of seeing both events in the course of the 30-year period. But the model is no longer valid after the first quake.

I wonder if there isn’t a better way to express the concept at the heart of this story. Qualitatively, it’s easy enough to grasp: In the next 30 years there will almost certainly be a major earthquake somewhere in California, and the event is more likely to happen in the southern part of the state than in the northern part. Putting this into numbers is somewhat tricky—or at least I’ve had a lot of trouble with it. Having finally surrendered to the computer and performed a Monte Carlo simulation, I come up with this statement: There’s a 99.8 percent chance that the next major California earthquake will happen by 2037. If indeed such a quake occurs, the odds are about 57 to 43 it will hit in Southern California.

Measure twice, average once

Friday, December 7th, 2007

plywood panel with seven measurements in crayon or magic marker

Whenever Norm Abram tells me to “measure twice, cut once,” I wonder what I’m supposed to do if the two measurements disagree. Perhaps I should measure a third time, in hope of settling the question by majority rule; but then I might well wind up with three discrepant values.

Strolling by a construction site the other day, I came upon the plywood panel shown above. There was no one around to help me interpret these curious scrawled measurements, but I could easily enough imagine the scene. A carpenter—Skilsaw at the ready—is surrounded by a group of statisticians and decision theorists eager to advise him on where to make the cut.

“Obviously,” says the first consultant, “we take the average—the arithmetic mean. Gauss proved 200 years ago that the sample mean is always the best estimator for a measurement subject to normally distributed random errors.”

“Actually, he proved just the opposite,” says another hardhatted and hardheaded savant. “He started by assuming that the mean is the most probable value, and then he invented the normal distribution as a way of ensuring that this rule will hold.”

“Whatever. But we’ve come a long ways since 1805. We know that the mean is an admissible estimator. Even without assuming a normal distribution, the sample mean is the estimator that minimizes the sum of the squared errors.”

“But who says the sum of the squared errors is the function we want to optimize? It’s just one of many possibilities. And it gives undue influence to the extremes of the distribution. In this case, the presence of that peculiar-looking eight-and-an-eighth value pulls the mean down to 55.875. Is that really where we should saw the board?”

“That 8.125 is obviously an outlier. Somebody was reading the wrong end of the tape measure. Excluding that bogus value, the mean is 63.833.”

“If you’re going to be picking and choosing which data points to trust, what about the one at the upper right? I’m not even sure I can read it: 64 and seven-eighths? And somebody seems to have crossed it out. Maybe we should drop that one, too.”

“And 64 is the only other item that isn’t circled. That must mean something.”

Another direction is suggested: “Instead of Gauss’s sum of the squared errors, we could adopt the criterion of Laplace, the sum of the absolute errors. With this choice, the favored estimator is the median rather than the mean. The median of our data is 63.625. And the median is much less sensitive to outliers and strangely shaped distributions. Whether we include or exclude the eight-and-an-eighth measurement makes only a minor difference.”

“What makes you all so sure we’re seeing several attempts to measure the same quantity? I think we actually have three distinct sets of measurements here, which just happen to be scribbled on the same piece of wood. The eight-and-an-eighth is clearly on its own. The two uncircled measurements form another set. And then we have four circled values all clustering around 63-and-something. If we want to simultaneously optimize the least-squares error for all three sets, we should be using a James-Stein estimator, which shrinks the average of each set toward the overall average.”

At this point a Bayesian is heard from. Others mention maximum likelihood, Pitman’s measure of closeness, minimum variance, the method of moments….

The conversation does not end here, but the rest is lost in the whine of the power saw. The carpenter has cut off the plank somewhere out beyond 64 inches and explains this choice as follows: Cutting long may mean cutting twice, but cutting short means buying twice.

*       *       *

One lesson you might draw from this little farce and fable is that if you have a hard decision to make, you should call a carpenter rather than a statistician. But that’s not the conclusion I intended.

You sometimes get the impression that statistics is a dry and lifeless discipline, where all the interesting questions were answered long ago, and all that remains now is to memorize some formulas and learn when to apply them. I think not!

Problems in statistics don’t get much simpler than this one. It concerns a small set of observations, with one variable in one dimension and one parameter to be estimated. It’s a problem that would have been perfectly familiar to Gauss and Laplace, Legendre and Adrain. And yet there’s still room for doubt and controversy about how best to approach such questions.

I found the plywood puzzle challenging enough that I was led to do some reading. Most of it is well above my grade level, and so I can’t claim to have absorbed everything the authors have to say. But I’ll offer a few pointers in case anyone else wants to follow along:

  • Colin R. Blyth (1951) directly confronts the Norm Abram question: How do you decide when to stop measuring and start cutting? I gather that this paper was a major landmark in estimation theory. R. H. Farrell (1964) follows up on related themes. (A number of other papers could be mentioned in the same context; I draw attention to these two because they are freely available online through Cornell’s Project Euclid.)
  • There’s an “Introduction to Estimation Theory” by Don Johnson of Rice at the Connexions web site. The context is signal processing, but there’s plenty of use to carpenters.
  • For the history of statistics, Stephen Stigler is always the place to start. His article on “Gauss and the Invention of Least Squares” is chapter 17 in Statistics on the Table (Harvard University Press, 1999). The original 1981 version from Annals of Statistics is online here through Project Euclid.
  • For a gentle introduction to the James-Stein estimator, I recommend a Scientific American article by Bradley Efron and Carl Morris, “Stein’s Paradox in Statistics” (Vol. 236 No. 5, May 1977, pp. 119–127). (Disclaimer: I was the editor of that article.)
  • Finally, at the moment I’m halfway through Pitman’s Measure of Closeness: A Comparison of Statistical Estimators, by Jerome P. Keating, Robert L. Mason and Pranab K. Sen (SIAM, 1993). I really don’t yet know what to make of this, but it has opened up a world I knew nothing about.