Archive for the ‘books’ Category

Pub date

Tuesday, April 14th, 2009

For those of you who have been waiting patiently for the paperback edition of Group Theory in the Bedroom, and Other Mathematical Diversions: It’s finally here. Today is the official release date.

GTiB-pb-cover.jpg

For those of you waiting for the movie version, well, so am I.

Long division

Monday, February 9th, 2009

I’ve been dividing the continent again. Back in the summer of 2000, on a coast-to-coast car trip, I began wondering about the Great Divide, the line that traces the spine of North America, separating the Atlantic and Pacific watersheds. How would you identify that line out on the landscape, or on a topographic map? Eventually I wrote some programs to explore the problem, and then an American Scientist column. The column became a chapter in my book Group Theory in the Bedroom.

Noticescover.jpgWhat brings me back to the problem now is a wonderfully generous review of that book by David Austin, published in the February issue of the Notices of the American Mathematical Society (PDF). Bill Casselman, the Graphics Editor of the Notices, suggested putting a montage of my continental-divide maps on the cover of the issue. When I pulled out the old images to get them ready for republication, I also started reading through the code that generated them, and I checked back with the National Geophysical Data Center to make sure the base map was still available.

The topographic map I worked with in 2000 is called ETOPO5; it samples the elevation of the earth’s surface with a grid spacing of 5 minutes of arc. Roughly speaking, that works out to square pixels 10 kilometers on a side. ETOPO5 is still there on the NGDC site, but it’s listed as “deprecated.” The Center now offers finer-grained data sets: ETOPO2 has 2-arc-minute resolution and ETOPO1 has a pixel every 1 arc minute. There is also something called the GLOBE Database, which gets down to 30 arc seconds, corresponding to pixels about 1 kilometer on a side. I decided it might be fun to try to reproduce my old result with the higher-density data. Would the algorithm put the divide along the same path?

Here’s the answer to that question:

flood-2008-12-01-11.jpg

na30sec-rg-fat-450.jpg

The upper map is the original version from 2000; the lower one uses the 30-arc-second elevation data. The red line is the computed location of the Great Divide, separating the blue Atlantic from the green Pacific. Over most of the length of the divide, the two maps coincide pretty closely, but there’s a notable deviation in western Wyoming, just above the middle latitude of the region shown. On the low-res map, the Atlantic bulges westward from the Wind River Range all the way to the Utah border. In the high-res map, the Pacific has reclaimed that territory. Which map is correct? I’ll come back to that question, but first I want to point out that creating the higher-resolution image was not just a matter of dusting off the old program and running it on new data. What fun would that be, anyway?

When the project began in 2000, I was working on a brand new Macintosh G4 running MacOS 9.0.4, and my favorite programming environment was MacGambit, an implementation of the Scheme programming language created by Marc Feeley. By now, my old G4 is probably leaching heavy metals into some Asian landfill, and the software of that era is equally inaccessible. Nevertheless, the divide-drawing program was not beyond salvaging. After an hour of fiddling and tweaking, I got it running on new hardware with a new operating system and a new Scheme implementation. I was able to regenerate the same maps I produced nine years ago, using the same data as input. But working with the higher-resolution terrain models was another matter. A first try with the 30-arc-second data ran for 12 hours without producing a result.

The ETOPO5 map, with its 100-square-kilometer pixels, dices my slice of North America into a 720-by-300 array, for a total of 216,000 pixels. When I downloaded a swath of the GLOBE Database covering the same territory, I had a 7,200-by-3,000 array, or 21.6 megapixels. If we assume that the program’s running time is proportional to the pixel count—that it’s an O(n) program—then we should expect a 100-fold increase. That would be bad enough, but on reviewing the code I got worse news: I discovered that one step in the algorithm was likely to be quadratic in the number of pixels—O(n2) rather than O(n). That meant a 10,000-fold increase in running time.

Here’s the basic idea behind the divide-finding algorithm. You start by putting a drop of blue dye somewhere in the Atlantic and a drop of green dye in the Pacific. Then you gradually raise the water level, letting each ocean flood the adjacent shoreline pixel by pixel. (I named the main procedure global-warming.) Eventually, the blue and green waters have to meet; where they do, erect a red wall to keep them from mixing. As the waters continue to rise, extend the wall as needed. The path of the red wall is the continental divide. A simple invariant captures the essence of this process: A pixel p is on the divide if and only if, at some point in the process of raising the sea level, p is still unflooded but has neighboring pixels of different colors.

A crucial requirement of the algorithm is that the two oceans rise synchronously; otherwise, one ocean would reach the site of the divide first and would spill over into the other basin. In a physical model (i.e., real water) you could count on the self-leveling property of liquids to enforce the rule of simultaneity; but computational fluids are not so cooperative. To avoid spillage, I raised the level of each ocean in small increments, equal to the difference in elevation between the current water level and the height of the next highest pixel anywhere in the terrain. To implement this idea, I set up a baroquely complicated structure of stacks and queues, which kept track of all pixels currently at the land-water interface. That’s where the quadratic process intruded. I was adding each neighbor of each coastline pixel to a linked list, and checking in each case that the neighbor wasn’t already present in the list. Checking for uniqueness in an unordered list of n items takes n steps, and performing the check for each of the n items therefore takes n2 steps.

(Actually, the n in this formula is not the total number of pixels in the map. It’s the number of pixels on the waterline. How large could a list of waterline pixels grow? The length of the list is given by the fractal dimension of the coastline! That dimension is likely to be somewhere near 1.5. For the 7,200-by-3,000-pixel map, we can expect roughly 300,000 pixels in the list. Thus the situation isn’t quite as bleak as 21,600,0002, but the square of 300,000 is still a large number; I don’t want to have to do anything 300,0002 times.)

There are lots of remedies for this problem—a sorted list, a priority queue, a hash table. I wound up taking a really easy way out: I just ignored the problem. It’s much quicker to store a pixel twice and then discard the duplicate than it is to check in advance for uniqueness. No pixel can appear in the queue more than four times (once for each neighbor), and so the cost in wasted space is tolerable. The data structure I finally settled on is an array of stacks, with one stack for each elevation in the map. (There are roughly 4,000 distinct elevations, measured in integer units of one meter.) Pushing a pixel onto a stack is an O(1) operation. Popping a pixel off the array of stacks is usually O(1) as well, although there’s an additional small cost when we take the last pixel at a given elevation and have to search for the next occupied level.

When Bill Casselman proposed the Notices cover, I set out merely to revise and tidy up my old program, but inevitably I succumbed to the urge to scrap it all and start fresh. The new program is written in Common Lisp rather than Scheme—not for any deep reason, just because that’s the flavor of the week. The new version sifts through the 21-megapixel map in a matter of seconds, although writing out Postscript images at various stages in the process (125 megabytes each) takes somewhat longer.

So what about that Wyoming bulge? When I first noticed the discrepancy between the older and the newer maps, I thought I knew exactly what was going on. Some diagrams of the continental divide show a “bubble” in Wyoming, where the divide itself divides, and then the two branches rejoin, enclosing a basin that doesn’t drain into either ocean. I immediately guessed that my program was following one edge of the bubble with the low-resolution data and taking the other path with the high-res data. But that’s not the answer. The bulge is too large and in the wrong place. Here are details extracted from the two maps (one has been enlarged and the other reduced):

bulge-detail.jpg

The bubble, if it exists, is roughly in the position of the yellow dot, southeast of the prominent white “eyebrow” formed by the Wind River Range. In the low-res map, the divide takes an excursion 150 kilometers farther west, circling another major basin. I’ve seen no published sources that support this geography; all maps show the divide following the Wind River Range, as in the high-res image at right above. River drainage patterns on standard maps also make it clear: Streams on the western slopes of the Wind River Range flow into the Green River and then into the Colorado. Thus if you make a rest stop at Marbleton, Wyoming, near the center of these images, what you leave behind will end up in the Pacific, not the Atlantic. And so I have to conclude that my 2000 map of the divide, which has now been published three times (in American Scientist, in Group Theory in the Bedroom and on the cover of the Notices), takes us down the wrong path in this region.

The cause of the error could be a bug in my program—it wouldn’t be the first time—but I don’t think that’s what’s going on here. I believe that both maps correctly describe drainage patterns on the landscapes they represent, but those landscapes are not the real North American continent; they are checkerboard arrays of absolutely level, absolutely square tiles. Smaller tiles capture more detail than large ones. In particular, the 100-square-kilometer tiles of the ETOPO5 map flatten out small topographic features that might form a barrier to the flow of water. With closer scrutiny, perhaps I’ll be able to identify the specific site of the leak.

Interestingly, the “bubble” explanation that I hoped would account for differences between the 5-minute and the 30-second maps does appear to work in the case of the 1-minute and the 30-second maps. The image below superimposes the divides calculated by the algorithm for these two data sets, with the 1-minute map in blue and the 30-second map in red:

compare-1min-30sec.jpg

The paths diverge just southeast of the Wind River Range, forming a bubble very much like the one on standard topo maps. There’s another smaller bubble to the northwest, in the Yellowstone caldera. Everywhere else, the two trajectories agree to within a pixel or two.

If the Wind River bubble is a real feature of the North American landscape, shouldn’t a divide-finding program detect it? In the global-warming algorithm, if you start out by dyeing one pixel blue and another green, you are guaranteed to get a two-coloring of the entire map. The expansion of one color cannot stop until it butts up against the other color; there’s no way to leave an uncolored void. The series of diagrams below shows what happens.

bubble-diagram.png

At the far left is the initial state: a ridge with a bubble in the middle. In the second panel, the blue and green waters are lapping at the rim of the central basin. The barrier is perfectly symmetrical, with all pixels on the rim at the same height. Nevertheless, the symmetry is necessarily broken when the sea level is raised still further. The basin is filled either with blue or with green, and the divide runs only on one side; the result looks like either the third or the fourth panel. The choice is essentially random, determined by the sequence in which the pixels are flooded. The only way to create a bubble in the divide is to deliberately implant a seed for a third color in the interior of the basin, as shown at the far right.

The 2000 version of the program was hard-wired for two colors only. It knew about the Atlantic and the Pacific but had no provisions for other bodies of water. In the rewrite I allowed for additional basins. Below is a detail of a high-resolution map seeded to create independent watersheds in the Wyoming bubble and also in two other basins that don’t currently have significant natural drainage to an ocean.

western-basins.jpg

Here’s the computed outline of the bubble at 100 percent magnification:

bubble-closeup.jpg

It looks plausible enough, but I need to add a caveat: Just as the two-color version of the program is certain to produce a two-colored map, starting with three colors ensures that the map will show three watersheds, whether or not they exist on the landscape. For that bubble to be a real feature, under the strictest definition of a divide, the two branches of the divide must have the same minimum elevation. If two hikers walked along those paths, each carrying an instrument that records their minimum altitude, the readings would match when they reunited. The fact that the flooding algorithm shows a bubble in the divide does not guarantee that this condition is satisfied. Seeding a third color anywhere on the map will produce a basin of that color.

Quite apart from the whole question of continental divides and mathematical analysis of terrain, I have become enchanted by the ghostly view of landforms that emerges from a simple grayscale rendering of the high-resolution elevation maps.

s-appalaichia.jpg

Above is a region of southern Appalachia, where the feathery dendritic traces of drainage channels seem to be inscribed as positive images in some areas and negatives elsewhere.

volcanoes.jpg

On the west coast (above), we see more dramatic drainage patterns (the big river near the top is the Columbia) and also a glorious string of volcanoes lit up like a constellation in the night sky, from Mount Shasta at the bottom to the diamond configuration of Mount Hood, Mount St. Helens, Mount Adams and Mount Rainier at the top.

new-england.jpg

Finally there’s this disturbing image. I don’t think I would be able to identify the location if it weren’t for the telltale forms of Long Island and Cape Cod toward the bottom. The deep north-to-south fissure at the left is the Hudson River Valley; the even bigger diagonal gash at the top is the St. Lawrence. Together they cut off the New England states and the Atlantic provinces of Canada, which seem not to belong to the same continent as the rest of us. What we’re seeing here looks for all the world like a great mass of bedrock scraped clean by glacial action, but remember that this is not a photographic image or anything like one; it’s an elevation map. Most of what looks like polished bare rock is actually forested terrain.

Also intriguing is a line of white dots—they look like puffs of smoke—in the St. Lawrence valley. I thought at first they must be some artifact in the data, but a glance at satellite photos shows they’re quite real. They are the Monteregian Hills, isolated outcroppings of igneous (but not volcanic) rock exposed as softer surrounding material has washed away. One of the dots is Mont-Royal, the hill from which Montreal takes its name.

All these features are best appreciated in the full, high-resolution images. Frustratingly, I can’t show those here. I’m sure I’ve already transgressed on the bounds of blog politeness with this long and bulky post; adding a 125-megabyte image file would get me beaten up by the bandwidth bullies. The best I can offer is a pair of images that are full size in terms of pixel count (7,200 by 3,000) but have been JPEG compressed to a more manageable size. Click to download the unadorned 30-arc-second base map (1.2 MB) or the final continental-divide diagram, with bubbles and extra basins (1.8 MB). Almost forgot: There’s also an animated GIF that I did back in 2000. And if anyone wants it, I can post the Lisp source code.

There’s still more that might be done on the theme of continental divides. One project I have in mind is to add the rest of North America. I hear there’s a whole nother country up there beyond the 49th parallel, with drainage into the Arctic Ocean.

Update 2009-02-11: Thanks to all those who asked about the source code. I’m grateful not only for the interest but also for the nudge to properly document the program, which I would not have done if left to my own devices. Of course I found another bug. (The very last one, surely!) Because of an off-by-one error, the program was ignoring the very highest pixels in the landscape.

Anyway, the source code is now online here.

Quick responses to some of the comments:

Thanks for the tip on the Shuttle Radar Topography Mission data. It looks like a fabulous resource for some project focused on a smaller area, but I’m not quite ready to bump up the data volume by another factor of 900.

Graph cuts for image segmentation: New to me. Evidently I need to do some reading. Thanks for the pointer. (I am aware of connections to other work on image segmentation; this was discussed at the end of my 2000 article.)

The idea of inundating the landscape and then doing the analysis as the water drains away is intriguing, but I can’t see quite how to make it work. At the instant when the first peak of land emerges, the waters extend continuously in every direction. How do you deduce which area drains to which basin? Or am I missing something obvious?

On the problem of demarcating boundaries between oceans: The traditional definition of a topologist is someone who can’t tell a donut from a coffee cup. Now we have a new definition: Someone who goes swimming in Asbury Park and expects to come out of the water in Malibu. Yes, mathematically there’s no clear boundary between the Atlantic and the Pacific; all the waters communicate, and there’s really just one world ocean. But some of us can still tell the difference between New Jersey and California.

Norway: The NGDC data covers the globe, so you should be able to do a Scandinavian map. As for the lake that drains to both coasts, the weirdest geographic fact I know is something I learned from a reader of my American Scientist column: I’m told that you can take a boat up the Orinoco River in Venezuela and without ever leaving the water cross over into the Amazon basin, continuing downstream back to the sea in Brazil. The continental divide is underwater!

To Brooks Moses on the idea of automatically identifying basins: There are some corner cases where you need to be cautious. For example, what happens when the Atlantic and the Pacific both reach the rim of a basin at the same time? Nevertheless, I think you’re right, and a careful implementation of your idea would work as you describe. But is the output useful? In the low-resolution (5 minute) map of the U.S. there are at least 10,000 independent watersheds that your algorithm would identify and color. There would be even more in the high-res data. The spiderweb of divides becomes so dense that you learn nothing about overall drainage patterns. For other applications—especially in image analysis—finding segments without the need for manual seeding is what it’s all about, but the situation is not so clear for the peculiar problem of geographic watersheds.

Marketplace of Ideas interview

Friday, September 5th, 2008

A couple of weeks ago I had the pleasure of chatting with Colin Marshall, whose radio show “The Marketplace of Ideas” is broadcast and webcast from KCSB in Santa Barbara. The main subject of our talk was Group Theory in the Bedroom, but the conversation also wandered into topics like the nature of publishing in the Internet age. The interview is available in streaming audio or as an MP3 file or as a podcast distributed via the iTunes store.

And if you grow weary of my voice, you might browse some of the other interviews—maybe Denis Dutton of “Arts and Letters Daily,” Michael Shermer of The Skeptic, or Steve Wozniak.

Promoting my promotions

Saturday, May 24th, 2008

I’ll be on the radio Monday (May 26th) at 6 p.m., chatting with Dorian Devins about my recent book Group Theory in the Bedroom, and Other Mathematical Diversions. If you’re within shouting range of Jersey City, N.J., you can listen in on WFMU; otherwise, stream it live over the net; or wait for the podcast. (Incidentally, even if you have no interest in listening to my self-promotional rants, I recommend the WFMU web site for excellent advice and instructions on doing Internet radio.)

In other GTiBaOMD news, I received a Google Alert the other day, pointing me to a new review of the book. When I followed the link, I was momentarily perplexed. Before I could see the review, I had to press a button bearing the legend: “I am over 18 and agree to the viewing of sexually explicit material.” As it happens, the page that awaited me beyond this warning had no sexually explicit material at all. Indeed, that lack was the essence of the reviewer’s complaint:

Sex and Math. You would think I would be in heaven at the mere thought that somebody had written a book combining these two things. And this book would be heaven if it had combined those two things. Instead, sadly, this is a case where the title is both a little too literal and yet not quite accurate.

Point taken. It’s true that there’s little titillation in my tale of mattress-flipping. Considering that disappointment, the reviewer’s assessment is remarkably even-handed. (”Would I recommend this book? Sure, especially if you have a fancy for computing, math and solving problems.”) Thanks, Naughty Bookworm.

Hard covers

Thursday, April 3rd, 2008

My new book has come out this week: Group Theory in the Bedroom, and Other Mathematical Diversions, Hill and Wang, xi+269 pages, $25. ISBN-10: 0-8090-5219-9, ISBN-13: 978-0-8090-5219-9, Library of Congress Call Number: T185.H39 2008.

GTiBcover200.jpgThis is a collection of essays on themes that will be familiar to many readers of bit-player.org. Indeed, the essays themselves may be familiar! Eleven of the twelve essays appeared earlier as “Computing Science” columns in American Scientist; the twelfth was published in The Sciences. But there is some new material: Appended to each chapter is an “Afterthoughts” section where I confess to errors, castigate critics, bring outmoded notions up to date, and generally try to offer readers some excuse for selling them a $25 book, when they could find most of the content free on the web.

The book has its own handsome web site (I was lucky to snap up the domain name “grouptheoryinthebedroom.com” before some speculator squatted on it). I cordially invite all of you to go over there and have a look around. Meanwhile, back here at bit-player HQ, I’m going to throw myself a little party. See you in a day or two.

’Tis pleasant, sure, to see one’s name in print; A book’s a book, although there’s nothing in’t. —Lord Byron

Amazon poker

Thursday, May 10th, 2007

Investors are constantly checking the stock ticker, gamblers check the point spread, and everybody is forever checking their e-mail. For a writerly type like me, however, the unshakeable obsession is checking my Amazon sales rank. Amazon.com calculates a sales rank for every book listed on its Web site, and updates the ranking hourly. Here’s a graph of the hourly fluctuations in the ranking of my book Infrastructure: A Field Guide to the Industrial Landscape over the course of a single day last week:

Rankforest graph of Infrastructure sales rank 2007-05-03

At any given moment the current rank is listed in the “Product Details” section of the Amazon page for both the hardcover and the paperback editions. The graph above comes from a service called Rankforest, which also tracks both the hardcover and the paperback versions.

From an author’s point of view, there’s a lot of mystery in these numbers. How are the hourly rankings calculated, and what (if anything) do they mean? How do the rankings correlate with actual sales of the book? Amazon is not telling, and so authors and other interested parties have been left to speculate and experiment. The obvious experiment is to order a copy of the book and observe the effect of this purchase on the ranking. I welcome such experimentation with my book. (I would prefer that you do it with the hardcover edition, for which I get a more generous royalty.)

Morris Rosenthal of Foner Books seems to be the leading scryer of signs in this field. The interpretation of the rankings has also been discussed by Chris Anderson, the editor of Wired, in an article, a blog, and a book (current Amazon rank = 463, way above mine, dammit). It’s clear that the hourly ranking cannot simply reflect the number of copies sold within the past hour. After all, in any given hour the vast majority of books sell zero copies, and so there would be a gigantic tie for last place. The rankings also can’t be based in any simple way on total sales since publication, because the standings are much too volatile. The received wisdom is that each sale produces an uptick, followed by an exponential decay until the next sale.

I find all these speculations fascinating, but they are not what I want to write about today. My topic is even more trivial. I’ve been tracking my Amazon ranking for more than a year and a half now, ever since the hardcover edition was first listed in September 2005. (The paperback came out about a year later.) Here’s what the trend looks like:

Amazon rankings of Infrastructure since publication

The graph records the highest (i.e., numerically smallest) rank noted on each day, with rare gaps when I happened to be offline all day. Without question it would be fairer to use the daily average rather than the daily peak. But an author’s ego is a fragile thing, and so I have gone out of my way to make the outlook as rosy as I could.

Here are some of the actual numbers I recorded, for the month of April 2007:

   date          HB      PB
2007-04-01     138263   47878
2007-04-02     192152   22728
2007-04-03      29146   41862
2007-04-04      73628   57155
2007-04-05      37172   16948
2007-04-06     127858   12779
2007-04-07     171363   25212
2007-04-08      55256   20770
2007-04-09
2007-04-10      36333    7121
2007-04-11     112423   19015
2007-04-12     183063   40015
2007-04-13     225457   46781
2007-04-14     239879   29259
2007-04-15     142252   16030
2007-04-16      24803   39200
2007-04-17      93485   18939
2007-04-18     173691   26434
2007-04-19     217440   19360
2007-04-20      44426   17276
2007-04-21     213765   26652
2007-04-22      39014   21699
2007-04-23     150012   18598
2007-04-24      61301   46268
2007-04-25      33800   22474
2007-04-26      10603   39335
2007-04-27      19746   15984
2007-04-28      14027   19324
2007-04-29      16527   25999
2007-04-30       5844   63439

Notice anything out of the ordinary? What I’m looking at is not the overall pattern of rising and falling magnitudes but rather the inner patterns of digits within the numbers. As I’ve been writing down these rankings over the past 20 months, I have had the persistent impression of a peculiar overabundance of repeated digits. Just in this small sample we have 47878, 22728, 192152, 57155, 25212, 36333, 44426, 33800, 39335, 25999, and lots more. Is there something going on here? Is Jeff Bezos broadcasting secret signals hidden in the digit patterns of the Amazon sales rankings?

I consider myself a reasonably sophisticated probabilist. I know I’m not supposed to be shocked when I bring together 23 people and find that two of them share a birthday. And I know that when people try to generate a random sequence of digits by plucking numbers out of their imagination, the result is almost always too homogeneous, with a deficiency of repetitions and other patterns of the kind I’m calling attention to. Thus I was prepared to believe that the patterns I perceived were conjured up out of nothing—phantom regularities in purely random data. Still, day after day, I would note numbers like 55256 and 20770 (which in fact appeared on the same day, one for the hardcover, one for the paperback), and wonder about the odds of such coincidences. Maybe I wasn’t just letting my imagination run away with me.

Finally, this past weekend, I could stand it no longer; I had to find out.

I’m going to have to give away the punchline right now, lest it come as a disappointment later: There is nothing unusual about those numbers. The distribution of digits—the number of pairs and triples and what-not—is well within the range expected for randomly generated numbers of the same size. In other words, I have confirmed the null hypothesis. Often, such a negative result is considered unpublishable, but this one cost me a fair amount of effort, and so I’m determined to get a blog item out of it for better or worse. If the conclusion itself is not very interesting, perhaps I can find something to say about the techniques and technologies that led to it.

How do you decide whether a number like 36333 is too unusual to be a product of random processes? I decided to analyze the numbers in my sample as if they were poker hands (with the rules of poker adapted to a deck of cards with ten ranks and no suits). I confined my attention to the five-digit numbers, which are the most numerous in my sample; of the 850 rankings I had recorded, 458 were in the range between 10,000 and 99,999. Then I wrote a five-line program that takes any such number and classifies it as one of seven types of poker hands:

  • five of a kind (e.g., 77777)
  • four of a kind (36333, 11119)
  • full house (44555, 28288)
  • three of a kind (57155, 20900)
  • two pairs (97097, 28002)
  • one pair (36739, 14912)
  • bust (53208, 16897)

I decided to ignore straights; the poker hand 56789 is simply a “bust” according to this scheme of classification. The concept of a flush—all cards of the same suit—doesn’t arise, since there are no suits.

When I ran my little program over the 458 five-digit Amazon sales rankings, here’s what I got:

   hand            number       frequency

five of a kind        0          0.0000
four of a kind        3          0.0066
full house            7          0.0153
three of a kind      34          0.0742
two pairs            51          0.1114
one pair            233          0.5087
bust                130          0.2838

   TOTAL            458          1.0000

The question now, of course, is what I should expect to see in such a data set, on the hypothesis that the digit patterns are random rather than contrived for some secret, nefarious purpose. Should I find it remarkable that half of the hands have a single pair, or that more than 70 percent have a pair or better? What about those seven full-house hands—should that abundance arouse suspicion? To begin answering questions like these, we need to calculate the probabilities of the various hands.

Having just boasted of my sophistication as a probabilist, I must now confess that the main thing I’ve learned about calculating probabilities is that getting the right answer is highly improbable. I understand the principle of the thing. It’s just a matter of counting. You count the “success” cases and divide by the total number of cases. But counting—even though it tends to come early in the mathematical curriculum—is not always easy.

In the Amazon poker problem, the first trap for the unwary is in counting the total number of cases. You might think that with five decimal digits, there would be 105 possible arrangements, but in fact 104 of those arrangements are excluded from the sample, because numbers in this context cannot have a leading digit of zero. Thus the denominator in the probability calculation will be 90,000 rather than 100,000.

I think I can trust myself to calculate the probability of a five-of-a-kind hand. The first card dealt must not be a zero but can be any of the other nine digits; thereafter, each subsequent digit must be identical to the first one. Thus the number of ways of forming a five-of-a-kind hand is 9×1×1×1×1, and the probability of this outcome is 9/90,000, or 0.0001. You can expect five of a kind in one hand out of every 10,000, if the deal is fair.

I believe this answer is correct, and I am proud of having obtained it; on the other hand, the five-of-a-kind calculation is by far the easiest case. Allow me to try to work out a harder problem: the odds of a full house. I invite you to listen in on my so-called thought process:

Well, a full house is a hand that matches the pattern aaabb. The first digit can be anything but zero, and so there are nine possibilities, but then the second and third digits have to match the first. The fourth digit must differ from the first three, and so there are eight candidates left…. No, wait…. This time zero is allowed, and so there are nine possibilities again. Then the fifth digit has to be identical to the fourth. That gives us the product 9×1×1×9×1, or 81 successes out of 90,000 total cases, for a probability of 0.0009…. Did I get that right?… Of course not. What I’ve calculated is the probability of seeing the pattern aaabb in that precise sequence; I am counting occurrences of 11100, 11122, 11133, …, 99988, but I am not including sequences such as 23233 or 45454. Okay. So one approach, now that I have the number of aaabb sequences, is to multiply by the number of ways of permuting that sequence. We can just take the aaa and intercalate the two b’s in all possible positions: bbaaa, babaa, baaba,…. No, wait…. The b could be a zero, and so it’s not always allowed to appear in the first position. Hmmm. This is not going well. For the time being, let’s forget about the prohibition of leading zeros, and we’ll correct for it later. That way we can consider all possible permutations of aaabb. As a first step, take aaa, where a can be any of the ten decimal digits, and place a b is all possible positions, allowing b to assume any value that differs from a, so that there are nine choices. There are four places to put the b: baaa, abaa, aaba and aaab. Now, for each of these four sequences, the second b can be placed in any of five positions, and so there are 4×5 = 20 permutations overall. Which means that the total number of full houses is…. Hold on. No, no, no, no, no. Not all those 20 permutations are distinguishable. If we start with baaa and insert a b in either the first or the second slot, we wind up with bbaaa in either case; this sequence should not be counted twice. So how many permutations are there, really? Offhand, I don’t see any way to count them that’s easier than direct enumeration: bbaaa, babaa, baaba, baaab, abbaa, ababa, abaab, aabba, aabab, aaabb. That’s ten cases. Let’s sum up. We know that a can take any of ten values and b has nine possible values, and there are ten ways of arranging three a’s and two b’s. Thus the total number of combinations is 10×9×10 = 900. But, don’t forget, now we have to subtract away all those sequences that start with a zero. For this purpose it doesn’t matter whether the first symbol is an a or a b; exactly 10 percent of the sequences will have a leading zero. Thus the number of full houses is 900–90=810. This gives a probability of 810/90,000 = 0.009.

I happen to know, from an independent calculation, that this answer is correct, and so it’s time to stop. If I didn’t know, however, I might well go on to ask whether we need to interchange the a’s and b’s—that is, consider the case of sequences aaabb where a has only nine allowed values and b has ten. (Why don’t we have to take that into account?)

I am mildly embarrassed to put this lurching, stumbling, caricature of a probability calculation on public exhibition, and yet it is a fair description of how I often struggle with a problem of this kind. Am I the only one who suffers so? Not everyone does. I know people who could carry out the same computation quite deftly; they command the intuition, the spürkraft, to zero in immediately on the right approach, like a chess player who doesn’t waste time considering fruitless moves. I admire that kind of finesse, but I don’t possess it.

On the other hand, I know something else. I know another way to solve the problem, with less fuss. As I mentioned above, I have already cooked up a little program that can take any five-digit Amazon rank and classify it into one of the seven categories of poker hands. In milliseconds I can run that program on all possible five-digit numbers; after all, there are only 90,000 of them, and it’s quite easy to generate all of them in sequence. Then I can just count the number of hands in each category, and all the probabilities come tumbling out in a neatly formatted table:

   hand            number       frequency

five of a kind        9          0.0001
four of a kind      405          0.0045
full house          810          0.0090
three of a kind    6480          0.0720
two pairs          9720          0.1080
one pair          45360          0.5040
bust              27216          0.3024

   TOTAL          90000          1.0000

If Fermat or Pascal or the Bernoullis were asked their opinion on this approach to probability, something tells me they would find it distasteful. I’m ambivalent myself. Returning to the chess analogy, this is the equivalent of the machine that beats a grandmaster by brute force, scanning millions of positions but knowing nothing of strategy. You can win that way, but you don’t get any style points. More important, the method is frustratingly opaque. I get the answers, and I have reasonable confidence that they’re correct, but the computation gives me no understanding of why they’re correct. Still another objection is that the method does not scale well. If my Amazon rankings had ten digits instead of five (perish the thought!), I’d have a hard time classifying the nine billion possible hands. (But then again I’m not sure the more analytic method scales all that well either.)

Setting aside these qualms, we can now take the predictions of theory and the observations of the Amazon sample and compare them side by side:

   hand            prediction       observation

five of a kind        0.0001          0.0000
four of a kind        0.0045          0.0066
full house            0.0090          0.0153
three of a kind       0.0720          0.0742
two pairs             0.1080          0.1114
one pair              0.5040          0.5087
bust                  0.3024          0.2838

   TOTAL              1.0000          1.0000

Some of these frequencies match quite closely (three of a kind, one pair); others are a bit off the mark (full house, bust). In a finite sample, of course, you would never expect an exact match to the theoretical frequencies—but are the discrepancies we’re seeing significant or not? My personal instinct says that there’s nothing amiss here, that the observed frequencies are consistent with the null hypothesis. In other words, the rankings could just as well be random numbers. The old rule of thumb that the variation should be less than the square root of the observation leads to the same conclusion. We could quantify these intuitions by calculating variances or standard deviations, doing a Χ2 test, and so on. But if I can barely calculate a simple probability, can I be trusted to navigate all those treacherous subtleties such as choosing the correct number of degrees of freedom?

Again there’s another way to go about it, relying on lots of ignorant computation to replace a little smart mathematics. The question we want to answer is this: Given a set of 458 randomly generated five-digit numbers, what is the probability that the random set will differ from the predicted frequencies by at least as much as the observed Amazon set? Suppose we measure distance from the theoretical prediction in terms of the sum of the squared differences:

S^2=\\sum_{i=1}^7(y_i-\\bar{y}_i)^2

where the index i ranges over the seven types of poker hand, and the expression in parentheses is the difference between the observed and predicted frequency for each type of hand. (Technical note: The seven numbers entering into this statistic are not independent. For example, if full-house hands are in surfeit, there has to be a compensating deficiency somewhere else. How much should I worry about this?) For the actual Amazon results, S2 works out to 4.267×10–4. Now we can generate lots of batches of random Amazon poker hands, each batch consisting of 458 five-digit numbers, and calculate S2 for each batch. What proportion of them will have an S2 value exceeding 4.267×10–4? The answer, based on half a million batches, is 80 percent. Thus all the anomalies I thought I was seeing in those numbers are pure delusion.

Needless to say, I was rooting for another outcome. I would have enjoyed finding something spooky and inexplicable in the Amazon rankings. Instead, all I’ve proved is that my intuition about what random numbers look like is not to be trusted. This is a disappointment, but maybe I can salvage something from the ruins. Perhaps I can claim the discovery that the Amazon rankings are a fairly good source of random numbers.

Large-scale differences and movements in the rankings are surely not random. It’s not purely a matter of chance that my book’s current rank is 43,888 (what an interesting number!), while some preposterous tale about a pubescent wizard occupies the top of the list—even though that book hasn’t actually been published yet. (Do I sound bitter?) The rankings are nonrandom in another way as well: The first digits are not uniformly distributed but have a Benford or Zipf distribution, with an excess of ones and a shortage of nines. It’s interesting that even though the randomly generated numbers do not share this property—the first digits have a uniform distribution over the range 1 through 9—the poker-hand analysis shows that the frequency of pairs, triples, and other patterns is identical in the two data sets. Thus the digits to the right of the first digit do seem to have the statistical properties of random numbers. The source of the randomness is presumably the hourly reshuffling of the rankings by the actions of thousands of Amazon shoppers—actions that are all too predictable in the aggregate (curse you, Harry Potter) but quite random in detail.

Perhaps it’s worth noting that when the RAND Corporation prepared their famous book A Million Random Digits with 100,000 Normal Deviates in 1955, they also employed the poker test as a measure of randomness. I’m relieved to find that their calculation of the theoretical frequency of the seven types of hands agrees with mine. (They don’t say how they performed the calculation.) The current Amazon sales rank for the book is 2,668,928 (hardcover) and 281,270 (paperback).

Softer infrastructure

Thursday, September 14th, 2006

For those who’ve been waiting patiently… My book Infrastructure: A Field Guide to the Industrial Landscape is just out in paperback. Except for the cover and the price, it’s identical to the hard-cover edition (errors and all!). Statistics:

  • characters: 1,302,389
  • words: 219,748
  • sentences: 10,940
  • pages: 541
  • photographs: 735
  • weight: 4.6 pounds
  • dimensions: 10.3 x 10.3 x 1.2 inches
  • list price: $49.95 (US) hardbound, $35 paperbound
  • ISBN: 0-393-05997-9 hardbound, 0-393-32959-3 paperbound

Links for the paperback edition:

More info at http://industrial-landscape.com/index.html

End of sales pitch.

Beach reading

Thursday, August 3rd, 2006

A few years ago Michael Berry revealed that his first choice in reading matter, if he were stranded on a desert island, would be Abramowitz and Stegun, also known as Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables, published by the National Bureau of Standards in 1964. For those of you seeking lighter fare for the beach this summer, I can recommend an article about Abramowitz and Stegun, written by David Alan Grier and appearing in the new issue of The American Mathematical Monthly. The article runs to 12 pages, making it considerably easier to fit in your tote bag than the xiv+1046 pages of the original.

Although I’ve had occasion to consult Abramowitz and Stegun from time to time, I have to admit that it never occurred to me to ask who Abramowitz and Stegun are or were, or how they came to compile their collection of formulas, graphs and tables. Yet it’s a fascinating story—a real page-turner! I don’t want to give it all away, so I’ll offer just a few teasers:

  • Grier dispatches one of the two protagonists in the very first paragraph. Milton Abramowitz helped frame the project, but: “His role ended on a hot summer’s day in 1958, when he unwisely decided to mow the lawn of his home in suburban Washington. Succumbing to the heat, he collapsed and died, leaving Stegun as the sole editor.”
  • Stegun is Irene Stegun (still living), who was “desperate for permanent employment” in 1943 and took a job with the Mathematical Tables Project, which was eventually absorbed into the National Bureau of Standards.
  • The Mathematical Tables Project was launched in 1937 under the Works Progress Administration, the New Deal unemployment-relief agency. It created jobs for 450 “computers” in New York City. “All of the project’s computers were drawn from the city’s welfare rolls and were desperately poor. Most had been unemployed for at least a year. Only a few had attended high school.”
  • Abramowitz and Stegun were not among the 450 computers but were mathematicians on the project’s Planning Committee. Yet even the members of this elite, according to Grier, “came from the margins of the mathematical community, from the ranks of people who lacked doctorates, or who were unemployed, or who were not fully considered scientists.”

Here are the bibliographic details for the Grier article: “Irene Stegun, the Handbook of Mathematical Functions, and the Lingering Influence of the New Deal,” by David Alan Grier, The American Mathematical Monthly, Vol. 113, No. 7, August-September 2006, pp. 585-597. It’s not available online. On the other hand, if your beach has WiFi, you can read Abramowitz and Stegun here and here and doubtless elsewhere.

Incidentally, Michael Berry’s second choice for his desert-island library is Gradshteyn and Ryzhik, or Table of Integrals, Series and Products, the Russian equivalent of Abramowitz and Stegun. This is also a book with lore and legend behind it. Specifically, the legend is that one of the authors was shot when an error in a table caused an aircraft or a missile to crash. That’s surely untrue, but I hope someday we’ll learn the real story. Maybe it will be next summer’s beach reading.

Playfair’s Powerpoint Presentation

Saturday, February 18th, 2006

“To those interested in the effective visual communication of quantitative phenomena, William Playfair’s Atlas is like the Bible: an ancient and revered book that is often cited but rarely read.”

—Howard Wainer and Ian Spence

The Commercial and Political Atlas is rarely read simply because it’s rare. Playfair published three versions of the book between 1786 and 1801, but apparently it has never been reprinted since then except in a microfilm edition. Looking at the microfilm a few years ago, I thought that someone really ought to put together a handsome facsimile edition, equipped with a learned introduction that would explain the work’s place in the history of scientific illustration. Now someone has done exactly that. Actually sometwo: Howard Wainer and Ian Spence. Wainer is with the National Board of Medical Examiners and the University of Pennsylvania and also has a recent book of his own on graphics and visualization, Graphic Discovery: A Trout in the Milk and Other Visual Adventures. Spence is professor of psychology at the University of Toronto.

The new edition of Playfair is published by Cambridge University Press (ISBN 9780521855549, U.S. price $39.99). It presents the third and final edition of the Atlas, photographically reproduced from a copy held by the University of Pennsylvania. As a bonus, the volume also includes a later Playfair pamphlet, The Statistical Breviary.

Playfair’s work is treated with Biblical reverence because it was the first publication to include graphs and charts as a way of presenting quantitative information. The Atlas is full of line graphs or area graphs showing the balance of trade between Britain and various other countries through the 1700s. A bar chart made an appearance in one of the early editions of the Atlas but was later dropped. The Breviary has pie charts.

In their introduction, Wainer and Spence address the obvious question, which is not “How did Playfair come up with these novelties?” but rather “Why didn’t anyone do it before?” Although the costs and difficulties of reproducing graphics were surely a factor, Wainer and Spence argue that the major impediment was distrust of such graphic devices. Scholars and scientists of the time might have drawn graphs for their own private use, but in publications they wanted to see raw data, in big tables of numbers. Graphs and charts had the same doubtful reputation then as Powerpoint presentations do today.

And perhaps Playfair’s colleagues were justified in their skepticism. Wainer and Spence point out that Playfair’s interpolation often looks either careless or fanciful, and some of the circular charts in the Breviary suffer from perceptual problems that still plague graphic artists today. All the same, it’s great to have the book in one’s hands.