Archive for the ‘computing’ Category

Bits from its

Wednesday, April 15th, 2009

Sometime in the early 1980s my friend Greg Chaitin decided to go digital. He wanted to have all of his own writings, along with some other documents he particularly valued, ready at hand in machine-readable form. This was long before Postscript or PDF; scanners and OCR were exotic technology. So Greg hired a typist. I know about all this because he sent me a printout of his digitized oeuvre—an impressively thick sheaf of fanfold paper, with sprocket strips still attached. (The irony of this mode of distribution was not lost on either of us.)

I’m finally trying to catch up with Greg. Instead of hiring a typist, though, I’ve bought a cute little document scanner, which has been busily transforming heaps of moldering cellulose into shiny new bits. The scanner produces PDFs, which then run through an OCR program to build an index, making the page images searchable. (Bring on the Googlebot!) The result falls short of ideal—files are obese, graphics are low-res, there’s no opportunity to edit or reformat—but still it’s better than rummaging through cobwebby filing cabinets in my basement. And I don’t have to FedEx cartons of fanfold to my friends.

I’ll be adding links to my publications list as I upload the files. Here are a few teasers:

The HP-41C: A Literate Calculator? Byte, January 1981, pages 118-138. PDF (3.3 MB)

40,000 Points of Light. Pixel, Vol. 1, No. 2, May-June 1990, pages 36-41. [On a graphics language in which an image is defined as a locus of points.] PDF (2.4 MB)

Rank-and-File Thinking. Lotus, June 1985, pages 73-77. PDF (1.8 MB)

Theory & Practice: On the bathtub algorithm for dot-matrix holograms. Computer Language, October 1986, pages 21-32. [On patterns generated by severely oversampled signals.] PDF (4.6 MB)

The Information Age: The Numbering Crisis in World Zone 1. The Sciences, Vol. 32, No. 6, November-December 1992, pages 12-15. [On the impending shortage of telephone numbers.] PDF (1.8 MB)

I note for the record that every one of the publications mentioned in the list above is now defunct. I wonder if there’s any significance in that?

Computing in the classroom

Wednesday, April 8th, 2009

As the students file into the classroom, each of them reaches into a basket by the door and selects a ticket, on which a number is printed. For convenience, let’s assume the numbers are integers, they’re of reasonable size, and no two tickets bear the same number. Once everyone has settled down, the instructor assigns the class this exercise: Consult among yourselves and show me the ticket with the largest number.

How will the students solve this problem? Perhaps they pass all the tickets to one nimble kid in the first row, who thumbs through them looking for the maximum. This is a sequential algorithm, where the running time ought to be roughly proportional to N, the number of tickets. An obvious parallel solution is based on a tournament tree: Pairs of students compare their numbers, then the winners of each of these matches get together in pairs, and so on, until only one student remains. This routine yields an answer in about log2 N rounds of comparisons.

Here’s another idea. The students all stand up and shout out their numbers repeatedly, like commodity dealers in the trading pit. If a student hears a number larger than her own, she sits down and remains quiet thereafter. The last student standing has the largest number.

The efficiency of this shouting-match algorithm is hard to quantify. If we assume that everyone can yell and listen at the same time without confusion, then the problem of identifying the maximum number is solved instantaneously. For large N, however, that assumption is surely unrealistic.

A variant of the outcry algorithm is easier to analyze: A single student chosen at random calls out his number, and all those students with smaller numbers sit down. Then another student is selected from among those still standing, and the procedure is repeated until only one student remains. This algorithm also seems to have an expected running time of log2 N, since each round of eliminations should exclude (on average) half the students still in contention.

Suppose the students are asked to deliver the sum of the numbers, rather than the maximum. The tournament-tree computation adapts readily to this task, again requiring depth log2 N. On the other hand, I don’t know any reliable way to do addition by shouting.

What about calculating the average, or arithmetic mean, of the numbers? Of course this could be done by summing the numbers and then dividing by N—or by dividing first and then summing—but either of these strategies requires the students to count themselves before they can compute the average. Can’t they just use the tournament-tree algorithm to compute the average directly, as they did for the maximum and the sum? Suppose they replace the ‘max’ or ‘+’ operator with a new binary average operator, ‘•’, defined as (a • b) = (a + b)/2. Plugging this operator into the tree algorithm works just fine if N happens to be an exact power of 2, but otherwise it fails. In the case of N=5, for example, the computational tree might look like this:

averagetree.png

The four values on the left are combined in a regular, symmetrical tree to yield the correct average, but then the fifth value has to be inserted in an extra, ad hoc operation. Applying the binary average operator at the top node of the tree would give a result of 3.75, whereas of course the true average is 3. The underlying problem here is that the averaging operator is not associative: (a • (b • c)) ≠ ((a • b) • c). [Forgive me if all this is thunderingly obvious to everyone but me. It actually took me a while to figure it out.]

One could fix this problem by passing pairs of numbers up the tree; one member of each pair is the running average, and the other member is a weight equal to the number of numbers that went into calculating that average. But this is just the sum-and-divide procedure in disguise. Let me note that there is a very simple algorithm for approximating the average. Have the students form random pairs repeatedly; in each pair, both students replace the value on their ticket with the average of the two values (in other words, they apply the ‘•’ operator). In this procedure the students never actually calculate either N or the sum, but both of those quantities are conserved as invariants of the computation. As the random process continues, all of the tickets eventually converge on the same value, namely the overall average. Experiments suggest that about log2 N rounds of random coupling yield a reasonably good approximation.

In judging the merits of these algorithms, my criteria have more to do with style than with computational complexity. My dislike for the sum-and-divide method probably has no firmer basis than stubborn prejudice. All the same, a computer in which the processors are people does seem to call for a particular style of program. Ideally (in my view) an algorithm for the classroom computer would be fully parallel (every student gets an equal chance to play), homogeneous (all the students get the same instructions) and local (students interact in small groups, without needing to know much about the global state of the system—such as the value of N). An algorithm gets extra points if it exploits the mobility and flexible communication patterns of human computers.

So here’s another challenge for the class. As before, the students take numbered tickets, but now the instructor asks to see the ticket with the median number—the value that is larger than half the other numbers and smaller than half. (Let’s make life easy and assume that N is odd.) Here is a median-finding algorithm—a variation on a method sometimes called Quickselect—that I would like to believe a group of students might invent. First, all the students raise their right hand. Then they repeat the following series of steps until the median is identified.

  • Choose a student at random from among those who have a hand raised. This student’s ticket number is designated the pivot.
  • All students whose number is smaller than the pivot line up on the left and those with larger numbers line up on the right.
  • If the two lines are of equal length, then the pivot is the median. Stop.
  • Otherwise, all students in the shorter line lower their hand (if it’s not down already); the pivot student also lowers her hand.
  • Repeat.

The expected number of times the loop will be executed again appears to be approximately log2 N. Note that in this hand-waving algorithm the students never have to sort themselves. The method also requires no counting—not even for the comparison of line lengths. But we do assume that all the students can simultaneously compare their ticket numbers with the pivot.

The diagram below shows an example calculation. The green dots at the top of each configuration are the successive pivots; students holding lesser and greater numbers are queued up to the left and right of the pivot; blue dots represent numbers still eligible for consideration as pivots (i.e., students with raised hands); red dots are numbers that have been excluded because at some stage of the computation they were in the shorter queue or were pivots that turned out not to be the median.

mediandiagram.png

One could dream up plenty of other exercises to keep the students busy. They could search for subsets of numbers with special properties, such as Pythagorean triples. They could be asked to identify the longest sequence of consecutive integers in the set of numbers they hold. And then there’s the balanced subset-sum problem: The students divide themselves in two groups in such a way that the sums of the numbers in the groups are equal, or as close as possible to equal. (This is an NP-complete problem, but many instances yield easily to the right heuristics.)

The various marching-band patterns and parliamentary shouting matches described above are my own fantasies about how groups of people might “enact” algorithms. I’m curious to know how real students, in a real classroom setting, would actually go about solving such problems. Would they adopt the kinds of local and homogeneous algorithms I find aesthetically pleasing? Or would they favor a simpler and more pragmatic approach? In some cases I suspect that the ruse of passing all the tickets to the class whiz might well be fastest and most reliable way to get an answer. Moreover, for any other approach the time needed for the students to choose an algorithm and organize themselves may well exceed the execution time. Indeed, the very process of coming to agreement on how to solve a problem calls for significant group dynamics; this is not something we see in other kinds of parallel computers.

Another interesting point about people as processors is that their capabilities are not very sharply defined. For example, most algorithms assume that comparisons are strictly pairwise; a predicate such as ‘>’ or ‘=’ is applied to exactly two numbers, and the efficiency of the algorithm is measured by counting those operations. But it seems likely that three or four students could huddle together and find the maximum of their numbers about as quickly as two students could. Thus the complexity of some algorithms might be reduced to log3 N or log4 N.

On the other hand, I’ve assumed the existence of some primitive operations that may be difficult to realize. In particular, the axiom of random choice seems a little dubious in this context. If a group of students is asked to choose one their members at random, how do they go about it? Can they always do it in unit time?

There’s also the question of how such algorithms scale as N increases. A scheme that works well in a classroom with 30 students might not be ideal in a lecture hall with 300. And surely the process would go very differently in a stadium holding 30,000.

I assume that many experiments of this kind have been tried over the years, although I haven’t found much literature on the subject. (Perhaps the closest match is a 1994 article by Bachelis, James, Maxim and Stout; they also note the scarcity of published material.) If anyone has reports from the field, I’d be delighted to learn about them.

Update 2009-04-14: Thanks to the commenters for some particularly helpful and lucid responses. I would like to explore one of these themes a little further.

Mario Klingemann’s “small optimization” to the outcry algorithm is more than just a minor improvement; it corrects an outright bug in the procedure I described. Klingemann notes that when a student calls out a number and all the students with smaller numbers sit down, the calling student must also sit unless no one else remains standing. This is important! Consider the difference it makes when N=2.

So which version of the algorithm—my buggy one or Klingemann’s correct one—matches up with JeffE’s analytic result showing that the expected number of rounds is the Nth harmonic number? In the graph below the harmonic numbers H(N) are represented by the red line; the green and blue lines are empirical results averaged over 100,000 repetitions of programs implementing the two algorithms.

outcry.png

For a given N, the Klingemann result is H(N) – 1/2; the buggy algorithm yields H(N) + 1 (asymptotically, for large N).

Wrong number

Wednesday, February 25th, 2009

Is it just me, or does it happen to everyone? When I try to correct someone else’s arithmetic, I can count on making an error of my own. For instance: I was reviewing William Goldbloom Bloch’s clever and provocative book The Unimaginable Mathematics of Borges’ Library of Babel. I noted in passing a numerical error, trivial in significance even though it’s enormous in magnitude. Bloch had occasion to introduce the number:

blochnumber.png

To give a sense of this number’s size, he mentioned that it has 33 million digits. I pointed out that the true digit count is exponentially larger: 1033,013,740. Now I’ve received a correction to my correction, in a note from Michael Rosenthal:

The author [of the book] stated 33 million digits as the length, but Hayes getting closer states that the length is (10 to the 33,013,730). Isn’t the actual number of digits ((10 to the 33,013,730) + 1)?

Of course Rosenthal is right. There are 10n n-digit decimal numbers, but 10n is not one of them. On the other hand, Rosenthal also gives a double-barrelled demonstration of my point about the perils of correcting other people’s arithmetic: In the course of setting me straight, he makes a typographical slip of his own, twice mistranscribing 33,013,740.

My correspondence with Rosenthal provoked me to take a closer look at where that enormous number came from in the first place. This has led me into some diverting adventures with factorials and logarithms. But it has also put me in the position of proposing another correction to a published number, which means I am in grave danger of making still more mistakes.

Bloch’s book is a mathematical companion to “The Library of Babel,” a brief ficción by the Argentinian fabulist Jorge Luis Borges. In the Library of Babel all books are written in an alphabet of just 25 symbols. Each book has 410 pages, and each page has 40 lines of 80 characters, for a total of 1,312,000 characters per volume. We are told that the library possesses one copy of every possible volume composed in this way. Thus the number of volumes on the library’s shelves is 251,312,000. For convenience of reference let us call this number B (for Book, or Babel, or Borges, or Bloch).

The much larger number mentioned in my review enters Bloch’s story when he asks how many ways we might arrange the books on the library shelves. The answer, of course, is the factorial of B. If we imagine that the shelves have discrete slots that hold one book each, then we have B choices for the first slot, B–1 choices for the second slot, B–2 for the third, and so on, until we come to the final slot where there’s just one book left to choose. The product of all these numbers is B!.

So what sort of a number is B!? Can we have a look at it? In this age of computational wonders, surely we can just fire up Sage or Mathematica, or write a four-liner in Lisp, and have the digits of that number served to us on a silver platter, no?

No. We can’t.

After three decades of playing with computers, I am still childishly agog at the ease with which I can tap a few keys and calculate a gargantuan quantity such as 52!, the number of permutations of a standard deck of playing cards:

8065817517094387857166063685640376
6975289505440883277824000000000000

I can even calculate 10,000!, unleashing a Niagara of digits that spill out onto my screen in a long gray torrent—some 35,660 of them by the time the flood abates. (Digression: How many of those digits are trailing zeros? This is a question Fred Gruenberger would always ask when he talked about factorials in his old Popular Computing newsletter, back in the days when computing that many digits took a bit of doing.)

Computing has come a long way, but even with so much numerical horsepower at our fingertips, the direct calculation of B! is still totally unthinkable. If Bloch had been right about the 33 million digits, the computation might be feasible, if tedious, but computing a number with 1033,013,740 digits is just ridiculous, with or without the extra digit I forgot.

If we can’t compute B! outright, maybe we can at least get its logarithm. For a rough estimate of the magnitude—in other words, for counting the digits—the realm of logarithms is the right place to be. Since the factorial of n is a product,

factorial.png

the logarithm of the factorial has to be a sum:

logfactorial.png

I don’t know a quick and easy way to perform this summation, but I do know how to evaluate a slightly different sum, one where every term on the right hand side of the equation is log n. Since there are n of these terms, the sum is obviously just n log n. What we have calculated in this way is the logarithm of nn. Thus nn can be taken as a crude approximation—a very sloppy upper bound—to the value of n!. For our particular case we have log BB = B log B. When I plug in the numerical value of B, and use base-10 logarithms, I come up with

logbtob.png

Exponentiating both sides gives this answer:

btob.png

Hmm. There’s a problem here. This is supposed to be a loose upper bound, but it’s smaller than the value we were expecting for B!. Clearly, BB can’t be less than B!.

Bloch reports that he came to his estimate of B! via Stirling’s approximation. If you look up this bit of mathematical technology, you’ll find a formula something like this:

stirlingformula.png

It would be easy to translate the formula directly into computer code, but it’s pointless to do so if the aim is to calculate B!. Again the only practical way to proceed is to take logarithms of both sides. Then the product in the formula becomes a sum of three terms. The first term, log nn, we already know: It’s n log n. The second term, log e–n, is simply –n if we choose to work with natural logarithms; if we want to stick with base-10 logs, it’s –n/log 10. And the third term we don’t need to worry about; at the level of precision we can hope to achieve in these calculations, ½ log 2πn is too small to notice. So, keeping just the first two terms, we have a rough-and-ready approximation to Stirling’s approximation, which ought to work well enough for very large n. B certainly seems to qualify as a large n; when I plug in its value, here’s what I get:

blogbminusb.png

I have plodded my way slowly and methodically through the steps of this computation in an effort to persuade myself that I might finally have it right. At this point I’m fairly sure that log B! is closer to 101,834,103 than it is to 1033,013,740, but I’m chary of making a stronger claim. With all my crossing back and forth between the safe world of logarithms and the uninhabitable wasteland of exponentials, I worry that I’ve misplaced a factor of e or, worse, mistaken the logarithm of a number for the number itself. If I’ve goofed again somewhere in this calculation, I trust that readers will let me know.

I want to emphasize that my aim in telling this story is certainly not to embarrass Bloch, whose work I admire. His book does a splendid job of explaining ideas in combinatorics, number theory and even topology, in a literary context where such themes don’t get much exposure. His little numerical error—if it is an error—has no impact on the meaning of the passage where it appears.

Indeed, it’s hard to imagine a circumstance where the difference between the two proposed values of B! could have any earthly consequence. Both numbers are always going to be ghostly presences in our universe—actors whose voice we hear from offstage but who never show their face under the lights. A certain breed of radical constructivist would argue that the numbers we’re talking about don’t exist at all. I don’t find that doctrine helpful or persuasive, but I do think there’s a distinction worth making between numbers we can actually hold in our hands and these strange, outsized objects that come equipped with their own exclamation points. Computing with numbers this large is like working with toxic or radioactive materials. You never get to touch anything directly; all manipulations are performed by remote control from behind a blast shield. The experience can be exciting, but afterwards you don’t come away with a sense that you really know what you’re doing.

Fred Gruenberger would have wanted to know how many trailing zeros there are in B!. Can we answer?

Math, fonts, and HTML

Monday, February 16th, 2009

In the latest issue of American Scientist I write about the awkward and ugly business of trying to present mathematical typography on the web. All the while I was writing the column, I had constantly in mind the uncomfortable knowledge that the column itself—with all of its examples of equations and arcane symbols—would have to be prepared for presentation in HTML. The conversion went more smoothly than I expected, thanks to the energy and expertise of Anna Lena Phillips. Nevertheless, I recommend reading the PDF version.

One issue that arises in this story goes beyond the particular problems of making mathematical notation intelligible. Why is it that a web page can’t include the fonts that are needed to display the text properly? It’s routine for web documents to incorporate images and audio and video and various kinds of code that affect the display of information—CSS, JavaScript, etc. When it comes to fonts, however, the web server is not empowered to supply what’s needed; instead, it has to play a silly guessing game with the client computer. “Do you have ‘Times New Roman’?” the server asks. “No, well how about ‘Times’? ‘Georgia’? Okay, give me anything with serifs, please.” It would be easier for both parties if the server could just package up a font and send it along—or so it seems to me.

This is not a new idea. Most of the recent discussion of embedable or linkable fonts focuses on legal impediments, which seem quite manageable to me (though of course I’m not a lawyer, nor am I an owner of typeface copyrights). Are there other reasons to back away from such a proposal? The bandwidth required seems moderate by present-day standards, and there are obvious schemes for avoiding waste by sending only glyphs that are actually needed in a document and not already present on the client computer. Publishers and advertisers would doubtless abuse the facility with garish attempts to attract attention, but we survived the <marquee> and <blink> tags back in the nineties, and an attack of circus-poster typefaces seems no more obnoxious. Am I missing something obvious?

Distant shores

Thursday, February 12th, 2009

Before I put my toys away, here are two more views of continental divides. First, a broader look at North America, seeing it as more than a bicoastal continent. The southernmost incursion of the Arctic watershed is in the Red River valley, along the Minnesota-Dakota border.

na2min-rgy-fat.jpg

The map is based on NGDC data with 2-arc-minute resolution. Here’s a detail of the triple junction point:

na2min-rgy-triple.jpg

The calculated location of the ‘Y’ seems to correspond quite closely to Triple Divide Peak, in Glacier National Park in northwestern Montana. (The three streams flowing away from this peak are Atlantic Creek, Pacific Creek and Hudson Bay Creek.)


View Larger Map

Second, I’ve done Europe, shown below in views before and after the flood. (For these maps I’ve done some terraforming, plugging up the Straits of Gibraltar and the Bosporus and erecting a barricade between the Atlantic and the North Sea. The base map is at 1-arc-second resolution.)

eur1min-rgyc-1.jpg

eur1min-rgyc-fat.jpg

Finally, below, here’s a detail of the Alpine region where waters diverge toward all four points of the compass.

eur1min-rgyc-detail.jpg

If I had a hammer

Thursday, February 12th, 2009

Since I’m a bit-player rather than a bit-worker, I generally stick to toy-sized problems. However, in recent weeks I’ve been fooling around with multimegabyte elevation maps, and I’ve had trouble scaling up. What I’ve found most challenging is not writing programs to digest lots of megabytes; instead, it’s the trivial-seeming data-preparation tasks that gave me fits.

At one point I had five plain text files of about 30 megabytes each that I needed to edit in minor ways (e.g., removing headers) and then concatenate. What is the right tool for that chore? I tried opening the files in various text editors, but the programs weren’t up to the job. Some quit when I attempted to load a big file. Others opened the file but then quit when I tried to scroll through the text. Some programs didn’t quit, but they became so lethargic and unresponsive that eventually I quit.

Note that I’m talking about big files but not huge ones. At most they run to a few hundred megabytes, a volume that ought to fit in memory on a machine with a few gigabytes of RAM.

Surely this is a common problem. Is there some obvious answer that everyone but I has always known?

Eventually I did manage to finish what needed doing. I discovered that a very modest Macintosh hex editor called 0xED—meant for editing binary files more than text files—would do the trick instantly. 0xED opened the largest files, let me scroll through and make changes, and saved the new version back to disk—all without fuss. But I still have the feeling that I’m pounding nails with a monkey wrench, and I’d like to know if there’s a hammer designed just for this task.

Long division

Monday, February 9th, 2009

I’ve been dividing the continent again. Back in the summer of 2000, on a coast-to-coast car trip, I began wondering about the Great Divide, the line that traces the spine of North America, separating the Atlantic and Pacific watersheds. How would you identify that line out on the landscape, or on a topographic map? Eventually I wrote some programs to explore the problem, and then an American Scientist column. The column became a chapter in my book Group Theory in the Bedroom.

Noticescover.jpgWhat brings me back to the problem now is a wonderfully generous review of that book by David Austin, published in the February issue of the Notices of the American Mathematical Society (PDF). Bill Casselman, the Graphics Editor of the Notices, suggested putting a montage of my continental-divide maps on the cover of the issue. When I pulled out the old images to get them ready for republication, I also started reading through the code that generated them, and I checked back with the National Geophysical Data Center to make sure the base map was still available.

The topographic map I worked with in 2000 is called ETOPO5; it samples the elevation of the earth’s surface with a grid spacing of 5 minutes of arc. Roughly speaking, that works out to square pixels 10 kilometers on a side. ETOPO5 is still there on the NGDC site, but it’s listed as “deprecated.” The Center now offers finer-grained data sets: ETOPO2 has 2-arc-minute resolution and ETOPO1 has a pixel every 1 arc minute. There is also something called the GLOBE Database, which gets down to 30 arc seconds, corresponding to pixels about 1 kilometer on a side. I decided it might be fun to try to reproduce my old result with the higher-density data. Would the algorithm put the divide along the same path?

Here’s the answer to that question:

flood-2008-12-01-11.jpg

na30sec-rg-fat-450.jpg

The upper map is the original version from 2000; the lower one uses the 30-arc-second elevation data. The red line is the computed location of the Great Divide, separating the blue Atlantic from the green Pacific. Over most of the length of the divide, the two maps coincide pretty closely, but there’s a notable deviation in western Wyoming, just above the middle latitude of the region shown. On the low-res map, the Atlantic bulges westward from the Wind River Range all the way to the Utah border. In the high-res map, the Pacific has reclaimed that territory. Which map is correct? I’ll come back to that question, but first I want to point out that creating the higher-resolution image was not just a matter of dusting off the old program and running it on new data. What fun would that be, anyway?

When the project began in 2000, I was working on a brand new Macintosh G4 running MacOS 9.0.4, and my favorite programming environment was MacGambit, an implementation of the Scheme programming language created by Marc Feeley. By now, my old G4 is probably leaching heavy metals into some Asian landfill, and the software of that era is equally inaccessible. Nevertheless, the divide-drawing program was not beyond salvaging. After an hour of fiddling and tweaking, I got it running on new hardware with a new operating system and a new Scheme implementation. I was able to regenerate the same maps I produced nine years ago, using the same data as input. But working with the higher-resolution terrain models was another matter. A first try with the 30-arc-second data ran for 12 hours without producing a result.

The ETOPO5 map, with its 100-square-kilometer pixels, dices my slice of North America into a 720-by-300 array, for a total of 216,000 pixels. When I downloaded a swath of the GLOBE Database covering the same territory, I had a 7,200-by-3,000 array, or 21.6 megapixels. If we assume that the program’s running time is proportional to the pixel count—that it’s an O(n) program—then we should expect a 100-fold increase. That would be bad enough, but on reviewing the code I got worse news: I discovered that one step in the algorithm was likely to be quadratic in the number of pixels—O(n2) rather than O(n). That meant a 10,000-fold increase in running time.

Here’s the basic idea behind the divide-finding algorithm. You start by putting a drop of blue dye somewhere in the Atlantic and a drop of green dye in the Pacific. Then you gradually raise the water level, letting each ocean flood the adjacent shoreline pixel by pixel. (I named the main procedure global-warming.) Eventually, the blue and green waters have to meet; where they do, erect a red wall to keep them from mixing. As the waters continue to rise, extend the wall as needed. The path of the red wall is the continental divide. A simple invariant captures the essence of this process: A pixel p is on the divide if and only if, at some point in the process of raising the sea level, p is still unflooded but has neighboring pixels of different colors.

A crucial requirement of the algorithm is that the two oceans rise synchronously; otherwise, one ocean would reach the site of the divide first and would spill over into the other basin. In a physical model (i.e., real water) you could count on the self-leveling property of liquids to enforce the rule of simultaneity; but computational fluids are not so cooperative. To avoid spillage, I raised the level of each ocean in small increments, equal to the difference in elevation between the current water level and the height of the next highest pixel anywhere in the terrain. To implement this idea, I set up a baroquely complicated structure of stacks and queues, which kept track of all pixels currently at the land-water interface. That’s where the quadratic process intruded. I was adding each neighbor of each coastline pixel to a linked list, and checking in each case that the neighbor wasn’t already present in the list. Checking for uniqueness in an unordered list of n items takes n steps, and performing the check for each of the n items therefore takes n2 steps.

(Actually, the n in this formula is not the total number of pixels in the map. It’s the number of pixels on the waterline. How large could a list of waterline pixels grow? The length of the list is given by the fractal dimension of the coastline! That dimension is likely to be somewhere near 1.5. For the 7,200-by-3,000-pixel map, we can expect roughly 300,000 pixels in the list. Thus the situation isn’t quite as bleak as 21,600,0002, but the square of 300,000 is still a large number; I don’t want to have to do anything 300,0002 times.)

There are lots of remedies for this problem—a sorted list, a priority queue, a hash table. I wound up taking a really easy way out: I just ignored the problem. It’s much quicker to store a pixel twice and then discard the duplicate than it is to check in advance for uniqueness. No pixel can appear in the queue more than four times (once for each neighbor), and so the cost in wasted space is tolerable. The data structure I finally settled on is an array of stacks, with one stack for each elevation in the map. (There are roughly 4,000 distinct elevations, measured in integer units of one meter.) Pushing a pixel onto a stack is an O(1) operation. Popping a pixel off the array of stacks is usually O(1) as well, although there’s an additional small cost when we take the last pixel at a given elevation and have to search for the next occupied level.

When Bill Casselman proposed the Notices cover, I set out merely to revise and tidy up my old program, but inevitably I succumbed to the urge to scrap it all and start fresh. The new program is written in Common Lisp rather than Scheme—not for any deep reason, just because that’s the flavor of the week. The new version sifts through the 21-megapixel map in a matter of seconds, although writing out Postscript images at various stages in the process (125 megabytes each) takes somewhat longer.

So what about that Wyoming bulge? When I first noticed the discrepancy between the older and the newer maps, I thought I knew exactly what was going on. Some diagrams of the continental divide show a “bubble” in Wyoming, where the divide itself divides, and then the two branches rejoin, enclosing a basin that doesn’t drain into either ocean. I immediately guessed that my program was following one edge of the bubble with the low-resolution data and taking the other path with the high-res data. But that’s not the answer. The bulge is too large and in the wrong place. Here are details extracted from the two maps (one has been enlarged and the other reduced):

bulge-detail.jpg

The bubble, if it exists, is roughly in the position of the yellow dot, southeast of the prominent white “eyebrow” formed by the Wind River Range. In the low-res map, the divide takes an excursion 150 kilometers farther west, circling another major basin. I’ve seen no published sources that support this geography; all maps show the divide following the Wind River Range, as in the high-res image at right above. River drainage patterns on standard maps also make it clear: Streams on the western slopes of the Wind River Range flow into the Green River and then into the Colorado. Thus if you make a rest stop at Marbleton, Wyoming, near the center of these images, what you leave behind will end up in the Pacific, not the Atlantic. And so I have to conclude that my 2000 map of the divide, which has now been published three times (in American Scientist, in Group Theory in the Bedroom and on the cover of the Notices), takes us down the wrong path in this region.

The cause of the error could be a bug in my program—it wouldn’t be the first time—but I don’t think that’s what’s going on here. I believe that both maps correctly describe drainage patterns on the landscapes they represent, but those landscapes are not the real North American continent; they are checkerboard arrays of absolutely level, absolutely square tiles. Smaller tiles capture more detail than large ones. In particular, the 100-square-kilometer tiles of the ETOPO5 map flatten out small topographic features that might form a barrier to the flow of water. With closer scrutiny, perhaps I’ll be able to identify the specific site of the leak.

Interestingly, the “bubble” explanation that I hoped would account for differences between the 5-minute and the 30-second maps does appear to work in the case of the 1-minute and the 30-second maps. The image below superimposes the divides calculated by the algorithm for these two data sets, with the 1-minute map in blue and the 30-second map in red:

compare-1min-30sec.jpg

The paths diverge just southeast of the Wind River Range, forming a bubble very much like the one on standard topo maps. There’s another smaller bubble to the northwest, in the Yellowstone caldera. Everywhere else, the two trajectories agree to within a pixel or two.

If the Wind River bubble is a real feature of the North American landscape, shouldn’t a divide-finding program detect it? In the global-warming algorithm, if you start out by dyeing one pixel blue and another green, you are guaranteed to get a two-coloring of the entire map. The expansion of one color cannot stop until it butts up against the other color; there’s no way to leave an uncolored void. The series of diagrams below shows what happens.

bubble-diagram.png

At the far left is the initial state: a ridge with a bubble in the middle. In the second panel, the blue and green waters are lapping at the rim of the central basin. The barrier is perfectly symmetrical, with all pixels on the rim at the same height. Nevertheless, the symmetry is necessarily broken when the sea level is raised still further. The basin is filled either with blue or with green, and the divide runs only on one side; the result looks like either the third or the fourth panel. The choice is essentially random, determined by the sequence in which the pixels are flooded. The only way to create a bubble in the divide is to deliberately implant a seed for a third color in the interior of the basin, as shown at the far right.

The 2000 version of the program was hard-wired for two colors only. It knew about the Atlantic and the Pacific but had no provisions for other bodies of water. In the rewrite I allowed for additional basins. Below is a detail of a high-resolution map seeded to create independent watersheds in the Wyoming bubble and also in two other basins that don’t currently have significant natural drainage to an ocean.

western-basins.jpg

Here’s the computed outline of the bubble at 100 percent magnification:

bubble-closeup.jpg

It looks plausible enough, but I need to add a caveat: Just as the two-color version of the program is certain to produce a two-colored map, starting with three colors ensures that the map will show three watersheds, whether or not they exist on the landscape. For that bubble to be a real feature, under the strictest definition of a divide, the two branches of the divide must have the same minimum elevation. If two hikers walked along those paths, each carrying an instrument that records their minimum altitude, the readings would match when they reunited. The fact that the flooding algorithm shows a bubble in the divide does not guarantee that this condition is satisfied. Seeding a third color anywhere on the map will produce a basin of that color.

Quite apart from the whole question of continental divides and mathematical analysis of terrain, I have become enchanted by the ghostly view of landforms that emerges from a simple grayscale rendering of the high-resolution elevation maps.

s-appalaichia.jpg

Above is a region of southern Appalachia, where the feathery dendritic traces of drainage channels seem to be inscribed as positive images in some areas and negatives elsewhere.

volcanoes.jpg

On the west coast (above), we see more dramatic drainage patterns (the big river near the top is the Columbia) and also a glorious string of volcanoes lit up like a constellation in the night sky, from Mount Shasta at the bottom to the diamond configuration of Mount Hood, Mount St. Helens, Mount Adams and Mount Rainier at the top.

new-england.jpg

Finally there’s this disturbing image. I don’t think I would be able to identify the location if it weren’t for the telltale forms of Long Island and Cape Cod toward the bottom. The deep north-to-south fissure at the left is the Hudson River Valley; the even bigger diagonal gash at the top is the St. Lawrence. Together they cut off the New England states and the Atlantic provinces of Canada, which seem not to belong to the same continent as the rest of us. What we’re seeing here looks for all the world like a great mass of bedrock scraped clean by glacial action, but remember that this is not a photographic image or anything like one; it’s an elevation map. Most of what looks like polished bare rock is actually forested terrain.

Also intriguing is a line of white dots—they look like puffs of smoke—in the St. Lawrence valley. I thought at first they must be some artifact in the data, but a glance at satellite photos shows they’re quite real. They are the Monteregian Hills, isolated outcroppings of igneous (but not volcanic) rock exposed as softer surrounding material has washed away. One of the dots is Mont-Royal, the hill from which Montreal takes its name.

All these features are best appreciated in the full, high-resolution images. Frustratingly, I can’t show those here. I’m sure I’ve already transgressed on the bounds of blog politeness with this long and bulky post; adding a 125-megabyte image file would get me beaten up by the bandwidth bullies. The best I can offer is a pair of images that are full size in terms of pixel count (7,200 by 3,000) but have been JPEG compressed to a more manageable size. Click to download the unadorned 30-arc-second base map (1.2 MB) or the final continental-divide diagram, with bubbles and extra basins (1.8 MB). Almost forgot: There’s also an animated GIF that I did back in 2000. And if anyone wants it, I can post the Lisp source code.

There’s still more that might be done on the theme of continental divides. One project I have in mind is to add the rest of North America. I hear there’s a whole nother country up there beyond the 49th parallel, with drainage into the Arctic Ocean.

Update 2009-02-11: Thanks to all those who asked about the source code. I’m grateful not only for the interest but also for the nudge to properly document the program, which I would not have done if left to my own devices. Of course I found another bug. (The very last one, surely!) Because of an off-by-one error, the program was ignoring the very highest pixels in the landscape.

Anyway, the source code is now online here.

Quick responses to some of the comments:

Thanks for the tip on the Shuttle Radar Topography Mission data. It looks like a fabulous resource for some project focused on a smaller area, but I’m not quite ready to bump up the data volume by another factor of 900.

Graph cuts for image segmentation: New to me. Evidently I need to do some reading. Thanks for the pointer. (I am aware of connections to other work on image segmentation; this was discussed at the end of my 2000 article.)

The idea of inundating the landscape and then doing the analysis as the water drains away is intriguing, but I can’t see quite how to make it work. At the instant when the first peak of land emerges, the waters extend continuously in every direction. How do you deduce which area drains to which basin? Or am I missing something obvious?

On the problem of demarcating boundaries between oceans: The traditional definition of a topologist is someone who can’t tell a donut from a coffee cup. Now we have a new definition: Someone who goes swimming in Asbury Park and expects to come out of the water in Malibu. Yes, mathematically there’s no clear boundary between the Atlantic and the Pacific; all the waters communicate, and there’s really just one world ocean. But some of us can still tell the difference between New Jersey and California.

Norway: The NGDC data covers the globe, so you should be able to do a Scandinavian map. As for the lake that drains to both coasts, the weirdest geographic fact I know is something I learned from a reader of my American Scientist column: I’m told that you can take a boat up the Orinoco River in Venezuela and without ever leaving the water cross over into the Amazon basin, continuing downstream back to the sea in Brazil. The continental divide is underwater!

To Brooks Moses on the idea of automatically identifying basins: There are some corner cases where you need to be cautious. For example, what happens when the Atlantic and the Pacific both reach the rim of a basin at the same time? Nevertheless, I think you’re right, and a careful implementation of your idea would work as you describe. But is the output useful? In the low-resolution (5 minute) map of the U.S. there are at least 10,000 independent watersheds that your algorithm would identify and color. There would be even more in the high-res data. The spiderweb of divides becomes so dense that you learn nothing about overall drainage patterns. For other applications—especially in image analysis—finding segments without the need for manual seeding is what it’s all about, but the situation is not so clear for the peculiar problem of geographic watersheds.

Disappearing nablas, disappearing web sites

Friday, January 9th, 2009

Earlier today Barry Cipra sent the following note as a comment to the post jsMath:

Website weirdness: The nabla symbol has suddenly disappeared from the equation on my browser (Safari, on a MacBook). It was there the last time I checked (a day or two ago), but now… gone!

With the generous help of Davide Cervone, author of jsMath, I’ve learned that the disappearing-nabla trick is not magic, and it doesn’t really have anything to do with jsMath. Barry is right that the character was present last week and that it was gone this morning. What happened in between is that bit-player.org (and my other web sites, industrial-landscape.com and grouptheoryinthebedroom.com) were transferred to new server hardware by the company that hosts my little Internet empire. It appears that several hundred files were left behind. Thanks, Bluehost.

Among the missing files, it seems, were some holding the PHP code that runs this blog. As a result, I couldn’t log in or even post an explanation of what had gone wrong. The only remedy was to restore a backup from before the migration. This has now been done, and I can confirm that the nabla is back. I need to check to see what other damage might remain. I’d be grateful for reports of any further weirdness: brian@bit-player.org

The Zune Bug

Saturday, January 3rd, 2009

The last day of 2008 was the day the world stood still for Microsoft Zune music players. First-generation Zunes—those with 30-gigabyte disk drives—went silent everywhere on December 31. The cause was soon traced to calendrical code in the device’s firmware. The bug is an interesting one, if only because all the details, including the source code, immediately came to light. Explanations were posted in the Zuneboards forum, at Ars Technica, at ProgramPhases.com and elsewhere. A copy of the C source module (which bears copyright notices from both Microsoft and Freescale Semiconductor) was posted here.

When a high-profile bug like this one comes along, it’s tempting to heap abuse and scorn on the hapless programmers who released the faulty code into the wild. And I do think this is an error that should have been caught in testing or during a code review. On the other hand, I’ve made so many mistakes myself over the years that I’m reluctant to give others a hard time. And I think the main lesson from this incident is not the usual refrain that Microsoft is lame. Rather it’s the even more commonplace observation that software is hard. Subtle traps lie in wait even in what seems a very simple algorithm.

Here’s the code that caused all the trouble:

    year = ORIGINYEAR;
    while (days > 365)
    {
        if (IsLeapYear(year))
        {
            if (days > 366)
            {
                days -= 366;
                year += 1;
            }
        }
        else
        {
            days -= 365;
            year += 1;
        }
    }

The purpose of this routine is to calculate the current year. On entry, the variable days is equal to the number of days since January 1, 1980, which is taken as the beginning of time. On exit, the variable year is supposed to hold the correct year number. (The global constant ORIGINYEAR is set to 1980. The IsLeapYear predicate does just what you would guess.)

The intent of the code is clear enough: We’ll repeatedly decrement the days count—by 365 in common years and by 366 in leap years—and each time we do that, we’ll increment the year number by 1. At the end of the process, the year count should have reached the correct current value. But of course I wouldn’t be writing this if the program always gave the right answer. Can you spot the problem?

Consider what happens when the initial value of days is 366. Since this number is greater than 365, we pass the test at the head of the while loop and enter that block. Since the initial year is 1980, the IsLeapYear predicate returns true, and we enter the consequent clause. Now we try the test (days > 366), which is false, and so we do not decrement days or increment year. In fact, we do nothing at all, because there’s no else clause attached to this if statement. We simply return to the head of the while loop, where we find that days is still greater than 365, and so we go through the same motions again, and again, ad infinitum. This is just what happened on December 31, 2008, which was day 10,593 in Zune time. The bug would be triggered on the last day of any leap year; it wasn’t observed before now simply because the Zune didn’t yet exist the last time February had 29 days.

A reader at Zuneboards.com quickly suggested a fix: Replace the test (days > 366) with (days >= 366). As another reader pointed out, this is incorrect. The change eliminates the infinite loop, but the altered routine returns the wrong year on December 31 of every leap year. The reader who spotted this error suggested a different solution, adding an else clause with a break command:

    year = ORIGINYEAR;
    while (days > 365)
    {
        if (IsLeapYear(year))
        {
            if (days > 366)
            {
                days -= 366;
                year += 1;
            }
            else
            {
                break;
            }
        }
        else
        {
            days -= 365;
            year += 1;
        }
    }

Still another reader offered this option:

    year = ORIGINYEAR;
    while (days > 365)
    {
        if (IsLeapYear(year))
        {
            if (days > 366)
            {
                days -= 366;
                year += 1;
            }
            else if (days == 366)
                days -= 366;
        }
        else
        {
            days -= 365;
            year += 1;
        }
    }

As far as I can tell from running a few test cases, both of these versions yield correct results, but their logic is anything but perspicuous. Trying to verify correctness by ure reason—as opposed to trial-and-error testing—looks almost hopeless. And then there’s the matter of aesthetics: In my opinion, all three programs are just plain ugly. A task this simple shouldn’t need so many ifs and elses.

Or so I thought. After spending some hours futzing about with the problem, and finding awful bugs in several of my own attempts, I’m not so sure there’s a simple and satisfying solution. Here’s my best effort so far:

    year = ORIGINYEAR - 1;
    while (days > 0)
    {
        year += 1;
        if (IsLeapYear(year))
            days -= 366;
        else
            days -= 365;
    }

I suppose I consider this an improvement over the alternatives given above, but I don’t really like it. What annoys me most is the trick of initially setting year to ORIGINYEAR - 1. This violates an implicit assumption that year will always be equal to or greater than 1980. In some programming environments, we could even make that assumption into an enforceable assertion, in which case this code would break.

To avoid introducing the phantom year 1979, we could do it this way:

    year = ORIGINYEAR;
    while (days > 0)
    {
        if (IsLeapYear(year))
            days -= 366;
        else
            days -= 365;
        if (days > 0)
            year += 1;
    }

But, to my taste, repeating the (days > 0) test is even more offensive than starting the program before the beginning of time. And both of these routines can leave the variable days with a negative value, which is also nonsensical.

So I put the question: Is there a better way? Can anyone come up with the One True Algorithm for calculating years from elapsed days?

Three further notes:

1. In many counting problems there’s doubt about how to begin. Is the first day of January in 1980 to be interpreted as day 0 or as day 1? Nothing I could find in the original code resolves this ambiguity. When I first looked into the matter, I guessed that counting began with day 0 because the Zune routine correctly returns year 1980 for day values from 0 through 365. But if the zero-based assumption were correct, the Zunes would have died on January 1, not December 31. Evidently, we’re counting from 1.

2. For the record, I haven’t actually tested any of the C code above. I’ve been translated to and from Lisp. Here’s the Lisp version of the 1979 routine:

    (defun my-zune-year (day-number)
      (let ((year 1979) (day day-number))
        (loop while (> day 0)
              do (incf year)
                 (if (leap-year-p year)
                   (decf day 366)
                   (decf day 365)))
        year))

3. Elsewhere in the Zune firmware, there’s another routine that raises questions about programming style and practice. Here’s the implementation of the IsLeapYear function:

    static int IsLeapYear(int Year)
    {
        int Leap;
        Leap = 0;
        if ((Year % 4) == 0) {
            Leap = 1;
            if ((Year % 100) == 0) {
                Leap = (Year%400) ? 0 : 1;
            }
        }
        return (Leap);
    }

The code implements the rules of the Gregorian calendar: A year is a leap year if it is divisible by 4, unless it is divisible by 100 but not by 400. So far so good. Elsewhere in the program, however, we find these declarations:

#define ORIGINYEAR 1980               // the begin year
#define MAXYEAR    (ORIGINYEAR + 100) // the maxium year

Thus it turns out the Zune has a fixed lifespan of 101 years, from January 1, 1980, through December 31, 2080. Throughout that interval, leap years can be detected by the simple test ((Year % 4) == 0), without worrying about centuries. (Many systems have been designed to work from 1901 through 2099 precisely in order to take advantage of this shortcut.) Given the MAXYEAR declaration, should the Zune have adopted a simpler leap year algorithm? Or have the authors done the right thing by putting the full logic into the code, just in case someone later decided to extend the MAXYEAR deadline past 2100?

Zimaths

Tuesday, December 16th, 2008

A friend pointed me to Zimaths, a magazine for high school mathematics students, published in Zimbabwe in happier times. (The last issue available online is from 2005. The next-to-the-last issue has a helpful tutorial on inflation, which was then running at about 600 percent per year in Zimbabwe. The rate has gone up considerably since then, and lately the country has had even worse troubles.)

Looking back to the first issue of Zimaths, from October of 1996, I happened upon a list of problems from the Zimbabwe Maths Olympiad of that year. Problem 5 looked like something I might be able to do in my head:

The number a is randomly selected from the set {0, 1, 2, 3, …, 98, 99}. The number b is selected from the same set. What is the probabilty that the number 3a + 7b has a units digit equal to 8?

Clearly, all positive integer powers of 3 and 7 are odd numbers not divisible by 5, and so they must end in 1, 3, 7 or 9. Running through the first few of those powers (1, 3, 9, 27, …; 1, 7, 49, 343, …) suggested that all possible terminal digits appear with equal probability. The sum of two such odd numbers must be an even number. It seemed like a good bet that all even terminal digits would be equally likely—that is, each of 0, 2, 4, 6 and 8 would appear with probability 1/5. I checked this hypothesis by listing all possible pairs of the four allowed odd digits (with addition done modulo 10):

1 + 1 = 2
1 + 3 = 4
1 + 7 = 8
1 + 9 = 0
3 + 3 = 6
3 + 7 = 0
3 + 9 = 2
7 + 7 = 4
7 + 9 = 6
9 + 9 = 8

Sure enough, in this list of 10 combinations each even digit appears twice, confirming that the probability of seeing an 8 should be 0.2. If I had been competing in the Olympiad, I would have written down that answer and gone on to the next problem without a further thought.

Later in the day, though, I was looking for something I could use to test a new toy—the latest version of Mathematica. Here’s what I came up with:

f[] := Mod[3^RandomInteger[{0, 99}]
         + 7^RandomInteger[{0, 99}], 10]
BarChart[Part[#,2]&
    /@ Sort[Tally[Table[f[], {100000}]]]]

This generates 100,000 values of 3a + 7b mod 10 and tallies them up:

Zisums.png

So I’ve discovered a bug in Mathematica, right? I know there’s nothing wrong with my program, and as for a mistake in my thinking—well, that’s just unthinkable.

Update: Thank you, commenters.

Svat’s question — “How do you think that error could have been avoided?” — goes right to the heart of the matter, but I need to start with a prior question: “How do I know which of my answers is the erroneous one?” When a computer program and my own reasoning yield different results, how should I resolve the conflict? Where should I look first for a mistake?

I don’t expect to find a universal and eternal answer to this question. On the contrary, everything depends on one’s personal equation. Some of us are better at analytical thinking and some of us are bit-players. Nevertheless, in this case I think it’s pretty clear that the computer calculation relies on fewer and simpler assumptions. It asks us to trust the random number generator and certain arithmetic operations, but if we grant that the implementation of those functions is correct, the program can hardly go wrong. Its structure follows directly from the problem statement. (Of course there’s also statistical uncertainty, but if we’re patient enough, we can reduce that to any level we please.)

The probability analysis depends crucially — as I discovered to my chagrin — on properly counting all the cases. My mistake was a very elementary one; I suspect it is also a very common one. It was when I saw the empirical results, with probability values clustered near 3/16, that I realized I should be looking at 16 cases rather than 10.

The point I want to make here isn’t about mathematical truths but about the art of problem solving — even when, as in this case, there’s no artistry involved. Barry’s variation on the problem (see the comments) offers me another sterling opportunity to make a fool of myself. So I present below, in first-person present-tense stream-of-consciousness diary style, a record of my wary engagement with that problem.

  • First, why the shift from {0..99} to {1..100}? Because including zeroth powers would contaminate each term in the sum with a 1. For example, 2n mod 10 is {2, 4, 6, 8} for n={1..100}, but for n={0..99} there’s a 1 percent chance of turning up a 1. That would make the sum much messier.
  • 1n and 5n are easy. After a further moment’s thought: So is 6n. So every sum will include a term equal to (1 + 5 + 6) mod 10, or in other words 2.
  • 4n is {4, 6} and 9n is {1, 9}. We can combine them — but carefully, now, let’s not make the same mistake again — to produce a single term of {3, 5, 5, 7}.
  • We already know that 3n and 7n both yield {1, 3, 7, 9}. And 2n and 8n generate {2, 4, 6, 8}. Therefore the whole sum is
    {2}1 + {3, 5, 5, 7}1 + {1, 3, 7, 9}2 + {2, 4, 6, 8}2,

    where the superscripts indicate the number of elements we are to choose (independently at random with replacement) from each of these sets. (Not sets, really. Bags.)

  • Parity: There are an odd number of odd terms, so the sum will be odd.
  • Is there some cute number-theoretical trick that’s going to make this easy? I don’t see it.
  • Stuck. Back to the computer to do it the unclever way:

    g[]:=Mod[Sum[i^RandomInteger[{1,100}],{i,1,9}],10]

    Tallying 100,000 repetitions of this function produces the following result:

    barrysums.png

    Repeating the computation a few times more suggests that the slight excess of 7s is not statistical noise. So now I think I know the answer, but I don’t know where it came from.

  • Back to the probability calculation. The 4 × 4 sum matrix for the original problem was easy (once I knew what to do with it). This one is 1 × 4 × 4 × 4 × 4 × 4. There must be a shortcut; that’s too much to do with pencil and paper.
  • [After a break.] Aha: The sum matrix for {2, 4, 6, 8}2 is exactly the same as the matrix for {1, 3, 7, 9}2 (after permuting some columns and rows). Thus the sum can be expressed as:

    {2}1 + {3, 5, 5, 7}1 + {1, 3, 7, 9}4.

    However, this still looks like a 4 × 256 matrix, which exceeds my patience.

  • Back to the computer again, this time to do all those little sums. What I want is something along the lines of:

    Outer[PlusMod, {1, 3, 7, 9}, {1, 3, 7, 9}],

    where Outer is like a cross product; but instead of multiplying it applies the PlusMod procedure, which I define to do addition modulo 10. The result returned by the expression above is the answer to the original Zimbabwe Olympiad problem:

    {{2, 4, 8, 0},
     {4, 6, 0, 2},
     {8, 0, 4, 6},
     {0, 2, 6, 8}}
    

    For Barry’s more-elaborate variation the full program (without shortcuts) is:

    Sort[Tally[Flatten[
      Outer[PlusMod, {1}, {2, 4, 6, 8}, {1, 3, 7, 9},
         {4, 6}, {5}, {6}, {1, 3, 7, 9},
         {2, 4, 6, 8}, {1, 9}]]]]
    

    The Outer[] expression yields 1,024 values, which are then tallied up. The returned value is:

    {{1, 204}, {3, 204}, {5, 205}, {7, 206}, {9, 205}}

    This appears to be our answer. Barry asked about the frequency of the digit 5: It is 205/1024. The 7 digit is most common by a slight edge, at 206/1024.

  • Going back to the simulation program and running 1,024 × 100,000 iterations to get better statistics yields this result (about an hour later):
    {{1, 20400852},
     {3, 20396224},
     {5, 20501004},
     {7, 20601096},
     {9, 20500824}}
  • In the end I still wonder if there’s an easier way. Barry?