Pilgrim’s Progress

8 May 2010

In the past few weeks I’ve had little time for bit-playing; I’ve been playing with atoms instead. I’ve been sorting and packing and toting atoms, then hauling them, rearranging them, offloading them, storing them. Lots and lots of atoms: maybe 1029. Bits are so much easier to handle. My bitly possessions—some hundreds of gigabytes—fit comfortably in a shirt pocket. My atomic chattels are a bulkier burden.

I dreamed, and behold I saw a Man clothed with Raggs standing in a certain place, with his face from his own House, a Book in his hand, and a great burden upon his Back.

All this atom-pushing was done in the course of moving my household from Durham, North Carolina, to Cambridge, Massachusetts. I’ve moved before, but this time the experience was unusually physical. In vacating my Durham home, I carried all my belongings out of the house and onto a truck—and I did it without helpers and without the use of carts or dollies or other wheeled implements. In other words, everything I own (except a car) I have now lifted up, cradled in my arms, carried 50 feet or more, and set down again. It took a day and a half.

What a goofy thing to do, eh? I even turned away offers of help. I guess I’m just a do-it-yourself kind of guy. In a strange way the labor was worth it: The process gave me a vivid and visceral sense of how stuff accumulates over a lifetime—how my possessions possess me. Late in the afternoon of the second day, as I loaded up the last few items and shut the door of the truck, there was a bright glow of satisfaction and accomplishment.

Yet I never want to do it again. When I arrived in Cambridge, I did not refuse the generous help of a younger, stronger friend. (Thanks, Mici!)

The total weight of my load was roughly two and a half metric tons. At least two tons of that mass consisted of “information goods”—books, periodicals, manuscripts and proofs, file drawers full of paper documents, photographs, musical recordings in various formats, art works. And this was the residue remaining after a two-year effort to lighten my load, mainly by transforming atoms into bits. In particular, I had scanned 22 drawers full of files, converting paper into PDFs and then recycling all the cellulose.

Before I lug my belongings out of this dwelling, I vow to jettison another ton or more. If only I could figure out how to digitize my clothes or my pots and pans.

As for my new home, everyone knows that Cambridge is the intellectual capital of North America. But I didn’t quite realize how high the standard had become. The sign in the photo below is on the garden gate next door.

Dogs: Keep Gate Closed

The literate local canines seem to comply with this order, since the gate is always closed.

Which Steve invented the iPad?

30 March 2010

UIUC-tablet-winners.png

Maybe Wolfram? Or Omohundro and Skiena?

Back in 1988, a contest sponsored by a major computer company asked for visions of a future personal computer. The winning team came from the University of Illinois at Urbana-Champaign. The faculty advisers were Stephen Wolfram and Stephen Omohundro (at far left in the photo above); the student members (from left to right) were Arch Robison, Steven Skiena, Bartlett Mel, Luke Young and Kurt Thearling. Their proposal of a device called the TABLET had eerie anticipations of the contraption that goes on sale this weekend.

Some quotes from the description of the TABLET:

This rectangular slab will weigh but a few pounds, and have no buttons or knobs to play with. The front surface will be a touch-sensitive display screen and will blink to life upon touching two corners.

The front surface of our computer is a high-resolution touchscreen that yields slightly to the touch. With this single input device, we can get a tremendous range of flexibility and options. We can use it to create an entirely soft interface.

A high-resolution color display can do more than just imitate a notebook page. It will be fast enough to support video…. It takes only a little more courage to predict a Global Positioning System (GPS) receiver on our machine, either as a clip-on or a built-in component.

If we can take our computer anywhere, we need to be able to use it anywhere. This brings us to communications capabilities. Through our national telephone network, we can access any person or machine within seconds.

It seems the UIUC prophets even saw Facebook coming over the horizon:

In addition to communicating with peripherals… we can also talk to other computers. Each machine can continually broadcast personal facts that users may wish the world to know: perhaps their name, image, interests, and marital status for openers.

And YouTube:

One interesting problem is who will appreciate all this new art? Some form of “shareware video” might arise.

Of course they didn’t get everything right. There’s the matter of the stylus and handwriting recognition (suggesting a foreknowledge of the Newton rather than the iPad). And it was all supposed to happen in the year 2000, so we’re a decade behind schedule.

Which computer company sponsored this contest? It was Apple, of course. The judges were Ray Bradbury, Alan Kay, Diane Ravitch, Alvin Toffler and Stephen Wozniak. (Steve Jobs was in exile at the time.) All of the prizes were paid in Macintosh merchandise.

tablet-teardown.pngThe story of the 1988 predecessor of the iPad is told in Communications of the ACM, Vol. 31, No. 6, pp. 639–646. Included there is the first teardown photograph of an iPad (right). Note the wafer-scale integration. For those who can’t get past the ACM pay-wall, a slightly different telling of the story appears on Kurt Thearling’s web site. There’s also a brief account from the UIUC computer science department.

[Just to be clear: I am aware that there's a considerable difference between imagining an innovative device and actually building it.]

Fake fits

30 March 2010

A few weeks ago I committed an act of deception. I needed a graph showing a few sets of data points, with curves fitted to them. This was strictly for show; neither the points nor the curves would mean anything; it just had to look good. So, out of laziness, I did it backwards: I drew a few arbitrary curves, chose some points on the curves, and then perturbed those points by small random displacements. The result is Figure 2 in Home Baked Graphics.

Apparently I have gotten away with this fakery, since no one has complained. But it nags at my conscience. The curves shown are not fitted to the given data points, but the other way around. And barring an extraordinary coincidence, we can’t expect the curves to give the best possible trend through the points.

I began to wonder: What if you iterate this process? In other words, you start out by generating a curve, say a simple quadratic:

wandering-lines-stages-1.png

You select a few points on the curve, in this case the five points with x coordinates equal to {1, 2, 3, 4, 5}. You perturb the y coordinates of these points by choosing random variates from a normal distribution with unit standard deviation:

wandering-lines-stages-2.png

Now you do a least-squares fit to the five selected points, generating a new quadratic:

wandering-lines-stages-3.png

At this stage you can start over, selecting five points on the new curve and randomly displacing them:

wandering-lines-stages-4.png

And a least-squares fit to those five purple points yields yet another quadratic curve:

wandering-lines-stages-5.png

So what happens if we continue in this way, repeatedly jiggling a few points on a curve and then fitting a new curve to the jiggled points? There’s a seeming hint of convergence in the three curves we’ve seen so far. They look like they might be approaching some fixed trajectory. Could that possibly be true?

I’ll post an update in a few days to say what more I’ve been able to learn about this question. If you just can’t wait, click here for a sneak peek.

The promised update (2010-04-02): You can’t fool anybody in this crowd, not even circa April 1. Of course there is no convergence to a limiting curve.

badhairlarge.png

The butterfly above shows 100 quadratic curves, selected from 100,000 iterations of the least-squares fitting process, with that process applied at points whose x values are in the set {–5, –4, –3, –2, –1, 0, 1, 2, 3, 4, 5}. Clearly enough, the coefficients of the quadratic curves are wandering widely; in particular note that the x2 term can be either positive or negative, yielding curves that are concave upward or downward.

On the other hand, these are not just any old quadratic curves. They all have their apex (the point where f′(x) = 0) fairly close to the origin. Here’s a closer view:

badhairdetail.png

(When Ros glanced at this graph on the screen, her comment was: “Bad hair day.”) And here’s a tracing of the wandering of the apex (taken from a different, longer, run of the program):

wandering-apex.png

The path has the look of a classic random walk, but note that it’s confined to a narrow range of x values. What attractive force holds the apex in this region? It’s the decision to fit the curves to points in the interval –5 ≤ x ≤ 5. Here’s what happens if we instead fit to points in the range 100 ≤ x ≤ 110:

skewedcurves.png

And note that without the symmetry-breaking x1 term, the curve would always be a parabola with its apex on the y axis; only the curve’s shape (or scale) and the height of the y intercept could vary.

This whole process of repeatedly fitting curves to data and then deriving new data from the fitted curves is weird and useless enough that it seems unlikely to appear anywhere in the universe outside my own idle mind. And yet…. Think of the Dow-Jones index, which is calculated as a weighted average of the prices of 30 selected stocks. They are not always the same 30 stocks. Indeed, recent years have seen quite a lot of turnover in the components of the Dow: AIG was replaced by Kraft Foods; General Motors was dropped in favor of Cisco Systems; Honeywell lost out to Chevron. Would it be too utterly cynical of me to suggest that what’s going on here resembles a process of fitting a curve to randomly chosen points, then choosing more points that lie near the curve?

Statistical error

22 March 2010

Tom Siegfried, the editor of Science News, has published a blistering indictment of statistical methods in science and medicine. I am moved to speak for the defense.

Siegfried discusses a number of specific cases, mainly drawn from the biomedical literature, where faulty statistical reasoning has led to unreliable or erroneous conclusions. I don’t want to quibble over the particulars of those cases; I’ll concede that science provides plentiful examples of statistical analyses gone wrong. Indeed, I could add to Siegfried’s list. But I see these events mainly as failures to use the tools of statistics properly; Siegfried suggests that the problem goes deeper. If I understand him correctly, he believes that the tools themselves are defective and that science would be better off without statistics. Here is how he begins his essay:

For better or for worse, science has long been married to mathematics. Generally it has been for the better. Especially since the days of Galileo and Newton, math has nurtured science. Rigorous mathematical methods have secured science’s fidelity to fact and conferred a timeless reliability to its findings.

During the past century, though, a mutant form of math has deflected science’s heart from the modes of calculation that had long served so faithfully. Science was seduced by statistics, the math rooted in the same principles that guarantee profits for Las Vegas casinos. Supposedly, the proper use of statistics makes relying on scientific results a safe bet. But in practice, widespread misuse of statistical methods makes science more like a crapshoot.

It’s science’s dirtiest secret: The “scientific method” of testing hypotheses by statistical analysis stands on a flimsy foundation.

This argument strikes me as so totally wrong-headed that I have a hard time believing Siegfried is serious about it.

To begin with, the snide remarks about Las Vegas and crapshoots are off-target. The branch of mathematics with roots in the study of gambling is not statistics but the theory of probability. The two fields are closely allied, but they’re not identical. Early statistical ideas came out of astronomy and geodesy, with later developments in the social sciences, genetics and agriculture. If you really must find some vaguely disreputable locale for statistics, the apt choice is not the casino but the brewery (a notable Student of statistics worked for Guinness).

More disturbing than this minor historical flub is Siegfried’s vision of a lost golden age of “rigorous mathematical methods,” debased by the seductive wiles of statistics. I don’t believe there was any such fall from grace. Siegfried doesn’t tell us much about the nature of his prestatistical mathematical paradise, but since he mentions Galileo and Newton, I suppose he may be thinking of classical mechanics as an exemplar of lost innocence. It’s true that the study of planetary orbits and ballistic trajectories does offer up some pithy mathematical laws that purport to be exact descriptions of nature:

mechanics-eqns.png

We don’t usually attach error bars to these expressions, or hedge our bets by saying “Force is equal to mass times acceleration within one standard deviation.” But where do such “exact” laws come from? When Galileo performed his experiments with balls rolling down an inclined plane, the measured data did not exactly conform to a parabolic trajectory. Likewise with Newton’s inverse-square law: No real-world observations precisely follow the form 1/r2–not unless the experiment has been fudged. Making the leap from experimental data to mathematical law requires a process of statistical inference, where we extract some plausible model from the data and attribute any departures from the model to measurement error.

In the time of Galileo and Newton, tools for statistical inference were crude; by the time of Gauss, they were much sharper. In 1801 the newly discovered planetoid Ceres was observed for 41 days before it was lost in the glare of the sun. Astronomers hustled to predict where and when it would reappear in the sky. Among all the attempts, the clear winner was the prediction of Gauss, whose advantage in this competition was not so much superior astronomy as superior statistics. His secret was the method of least squares, which he later backed up with a comprehensive theory of measurement error, introducing the idea of the normal distribution.

Later still, statistics had a role in showing that the “exact” mathematics of Newtonian celestial mechanics is not exact after all. It took careful observations–and careful statistical analysis of those observations–to quantify a tiny anomalous precession in the perihelion of Mercury, explained by general relativity but not by classical gravitation.

Statistics is no “mutant form of math”; it’s the way that science answers the fundamental and inescapable question, “How do we know what is true?” I really can’t imagine how science could survive without statistics. What would replace it? Divination?

Siegfried complains that statistical tools offer no certainty–that when a result is reported as statistically significant at the 1-sigma 2-sigma level (or in other words with a P value of 0.05), there’s still a 1-in-20 chance that it’s a meaningless fluke. Quite so; that’s essentially the definition of a 2-sigma P value. But the uncertainty is not some methodological malfunction. It reflects the true limits of our knowledge. The strength of statistical reasoning is that it makes those limits explicit.

Again, I’ll readily agree that standards of statistical practice should be strengthened, and that weak or faulty conclusions are too common in some areas of the published literature. But the claim that “any single scientific study alone is quite likely to be incorrect, thanks largely to the fact that the standard statistical system for drawing conclusions is, in essence, illogical” is, in essence, illogical. Yes, we have lies, damn lies and statistics. But we also have lies and damn lies about statistics.

A double flip

17 March 2010

All my closest friends know about my strange obsession with the mathematics of mattress flipping. A few thousand other people also know my secret, since I have written about it in an American Scientist column (HTML, PDF), which later became a chapter in a book.

To recapitulate: Fussy housekeepers rotate their mattress twice a year to ensure even wear. But a mattress has three axes of twofold symmetry (roll, pitch and yaw). Rotating around the same axis over and over will not cycle through all four possible states of the mattress, so you need to vary the procedure. Perhaps you do a roll one time and a yaw the next. But in the spring you may have forgotten which way you turned last fall. Thus the quest for a “golden rule” of mattress flipping:

A golden rule of mattress flipping would be some set of geometric maneuvers that you could perform in the same way every time in order to cycle through all the configurations of the mattress. Following this algorithm might entail extra physical labor on each occasion–perhaps making multiple flips or turns–but at least it would eliminate the mental effort of remembering.

The rules of this peculiar game forbid putting any marks on the mattress (which would break the symmetry). If we abide by this constraint, the multiplication table for the group of flip operations looks like this:

2005-09-F2-Klein-table.png

R, P and Y indicate half-turns around the roll, pitch and yaw axes; I is the identity or do-nothing operation. This table is bad news for golden rules. We already know that no single operation will cycle through all four states of the system, and the table shows that every combination of two operations is equivalent to some single operation. Hence there is no golden rule.

There the matter stood until a few days ago, when I received a letter from Tim Knoll:

I just completed reading your collection of columns entitled “Group Theory in the Bedroom” and I wanted to offer up an alternative solution to the mattress flipping problem. I believe if you add one more operation to the mattresses you can achieve your “golden rule.” What you need to add is a second bed to allow a shift operation in addition to the rotations. If you have the second bed oriented in the opposite manner as the first bed (with the first having the headboard facing north, the second facing south) you can simply have an operation of a non-rotating shift from the first (north-facing) bed to the second (south-facing) and then do a shift in addition to a rotate about the pitch axis from the south-facing bed to the north-facing bed. This mattress juggling should also achieve the desired results using the roll axis in the second step. I unfortunately can’t implement this method in my own apartment since it is equipped with one queen-sized bed and one full-sized bed, but my tests using an index card seemed accurate.

Ingenious, no? But is it truly a golden rule? I’m going to leave that question for readers to decide.

Knoll also points out that even if the new “shift” operation doesn’t yield a golden rule, it does succeed in getting two mattresses flipped with the same mental effort that would otherwise be needed for one mattress. This raises a further question: Why stop at two beds? What about mattress flipping in a barracks, where many beds are lined up in a row? And then there’s the job of flipping mattresses in the Hilbert Hotel, which has infinitely many beds. Does the mental effort per mattress go to zero in this limit?

Home-baked graphics

9 March 2010

A couple of commenters have asked what software package I use to create the graphs that appear in bit-player posts–illustrations like the one below, which is a slightly improved version of something I posted last week. Let’s call it Figure 1.

rms-graph2-revised.png

Prompted by these inquiries, I immodestly ask myself: Why do my graphs look so darn good? I immodestly answer: It’s not because of any packaged software! I don’t need a cake mix, or even a recipe. These are home-baked graphs, made from scratch out of locally grown organic pixels.

I have strong opinions about the aesthetics of scientific illustrations, and I could certainly spout off about the design elements of Figure 1, such as that putty-colored background, just dark enough to allow drop-out white grid lines, yet neutral enough to avoid competing with the data curves, which also have a distinctive color scheme on which I could discourse at length. Yes, I can talk the Tufte talk. But I think the commenters were really asking how I create the graphs rather than why they’re so elegant, and so I’m going to focus here on the practical programming problem.

Most of my experience in drawing pictures with a computer comes from the world of print publishing, where the final product is ink on paper rather than pixels on a screen. Compared with the online environment, print has some advantages, notably higher resolution (up to 1,000 dots per centimeter) and precise control over typography and color. But print also has obvious limitations: On a magazine page, there are no mouseovers or clickable buttons, and you can’t make a square knot twirl in 3D.

Thirty years ago, the big challenge for computer-generated illustrations was not how to draw the picture but how to get it out of the computer and onto the printing press. You couldn’t just export a PDF and place it in a Quark or InDesign document; none of those things existed. The only practical option was to print out the artwork, photograph it, and “strip” the negative into the page-size film that would be used to make the press plate. Because of this emphasis on printouts, most of the effort went into programming the printer rather than the computer.

The figure below is the first published computer-generated illustration I had a hand in creating. It appeared in Scientific American in 1983.

epson-freq-table.png

The array of 282 tiny bar graphs was produced with an Epson MX-80 dot-matrix printer, using escape codes to fire combinations of the eight pins in the printhead. Of course the MX-80 was a black-and-white device. The two-color illustration was created from two separate printouts. Also, the Epson letterforms were replaced with typeset characters.

The world of computer-generated illustrations changed dramatically with the arrival of PostScript, the “page description language” created by John Warnock and his colleagues at Adobe Systems (based in part on earlier work at Evans and Sutherland and Xerox PARC). PostScript was designed as a complete programming language rather than just a file format or a set of drawing commands. And something else set it apart as well: attention to details of graphic design. With most earlier software (such as programs based on the Apple Quickdraw library), trying to create publishable figures was an exercise in frustration. For example, the apparent weight of a line would vary depending on its orientation: lighter when vertical or horizontal, heavier when diagonal. PostScript allows very precise control over such niceties of presentation. To take another example, where lines meet the edge of a graph, you don’t want to have to choose between falling short and overshooting; PostScript provides the tools needed to make it look right.

edge-effects.png

(The version in the rightmost panel is created by allowing the colored lines to extend outside the background box, and then applying a clipping mask that cuts off all objects at the boundary of the box.)

Obsessing over minute details like these may seem comically fussy, but I believe that neatness counts in these matters. To some extent, illustration is an art of illusion. Graphs and diagrams work best when you can look through them rather than at them. The viewer should be seeing the underlying information or abstraction–the array of correlation coefficients, the function y = f(x), or whatever–rather than noticing the mechanics of how the drawing was constructed. A ragged edge is the kind of distraction that destroys the illusion.

Although PostScript was a giant step forward from the MX-80 command set, in the early years it was still just another printer language, not a computer language. The only way I could execute a PostScript program was to send it to a laser printer and wait to see what came out. Sometimes it was a long wait. I had no way of running a PostScript program on the computer itself. (Ghostscript came later.)

ChernoffFaces.pngMy first PostScript illustrations were created as hand-written PostScript programs; the same language was used both for doing the computations and for presenting the results. The faces at right were created in this way. (They were inspired by the work of Herman Chernoff and drawn to illustrate an American Scientist article by Robert Levine in 1990.) The dual role of the language caused me a moment of disorientation just now when I went looking for my records of this project. I found an EPS (encapsulated PostScript) file, which I knew was the finished illustration, but where was the source code? And then I remembered: It’s the same file! Open it up in Ghostscript or Adobe Illustrator and you see those silly faces smiling or scowling at you; open the same file in a text editor, and you see procedures for drawing elements of the faces:

   /draweyes
     { newpath
       dx dy eyewidth eyeheight 0 360 ellipse stroke
       ex ey eyewidth eyeheight 0 360 ellipse stroke
     } bind def
   /drawpupils
     { fx fy pupilsize pupilsize 0 360 ellipse fill
       gx gy pupilsize pupilsize 0 360 ellipse fill
     } bind def

Bill Casselman, the graphics editor of the Notices of the American Mathematical Society, still favors this direct-to-PostScript methodology. He has written an excellent guidebook, taking you from the basics of PostScript through an elaborate library for rendering three-dimensional objects.

But here I part company from Casselman; I’d rather not do all my computing in PostScript. It’s not that I have anything against the language itself, but the development environment is not to my taste. I therefore adopted the modus operandi of writing a program in my language of choice (usually some flavor of Lisp) and having that program write a PostScript program as its output. After doing this on an ad hoc basis a few times, it became clear that I should abstract out all the graphics-generating routines into a separate module. The result was a program I named lips (for Lisp-to-PostScript).

Most of what lips does is trivial syntactic translation, converting the parenthesized prefix notation of Lisp to the bracketless postfix of PostScript. Thus when I write (lineto x y) in Lisp, it comes out x y lineto in PostScript. The lips routines also take care of chores such as opening and closing files and writing the header and trailer lines required of a well-formed PostScript program.

But the lips interface is low-level, confined to drawing individual dots, line segments, rectangles and the like. Assembling a complete graph out of these primitives is tedious. For example, the grid of white lines in Figure 1 would have to be drawn one line at a time, with each line specified by a sequence of commands such as

    (newpath)
    (moveto u v)
    (lineto x y)
    (stroke)

Before you can issue those commands, you have to calculate u, v, x and y. Clearly, a higher-level front end is needed; like everyone else, I call mine plot.

At the core of any plotting program is a simple operation: mapping points from an abstract user space to coordinates in a rectangular pane, the page space. In Figure 1, the y axis runs from 0 to 5000; values in this range have to be scaled to the dimensions of the graph, which is about 300 PostScript points, or 11 centimeters. Mathematically, the transformation is straightforward. Indeed, if I wished I could leave all the arithmetic to the PostScript interpreter, simply passing in the appropriate matrix elements for scaling and translation. This is an attractive option; it would allow plot to work entirely in user space. But a few niggling details get in the way. Consider the tick marks along the y axis in Figure 1. Their vertical positions are conveniently expressed in user coordinates: one tick every 500 units. But what about the length of the ticks–their horizontal extent? This dimension is purely concerned with the appearance of the graph and has nothing to do with the content; it ought to be expressed in unscaled units of points or pixels.

Here’s a possible solution: Let everything inside the rectangular frame of the graph–the area with the putty-colored background in Figure 1–go through the scaling engine, but define everything outside the frame, including the tick marks and the axis labels, directly in page coordinates. If you think this is the final answer, take a look at Figure 2:

figure2.png

In this nonsensical graph (constructed just for this occasion), data points are indicated by stars, crosses and diamonds. The positions of those glyphs ought to be defined in user space, but the drawing commands that create the shapes are properly defined in page coordinates. If we tried to draw the glyphs in user space, their size and shape would vary with position in the graph.

What’s the best way to deal with this messy situation? Is there some tidy solution that will reconcile the two coordinate systems and allow all dimensions to be treated uniformly? I don’t believe so; it’s just in the nature of graphs to mix up elements from these two disparate realms. We look through a window into a world of data or mathematical abstractions, but we also draw our own little doodles on the window itself.

Of course there are solutions; they’re just not as pretty as I would like. My own strategy for coping is to attach extra information to each geometric point, indicating whether or not the x and y coordinates are to go through the scaling transformation. This is less troublesome than it might seem; from the user’s point of view, it’s almost always invisible.

In writing the lips and plot programs, I walk a path that is already worn smooth by many earlier footsteps. I don’t know who wrote the first computer program for plotting data, but it probably came soon after the first program for producing data. Today we have hundreds of clever, comprehensive, well-designed and well-maintained programs for plotting and graphing. Gnuplot is very capable; Grace is one I’ve never used but I’ve heard good things about it; Mathematica, Sage, R, MATLAB, Octave and the like all have elaborate graphics facilities built in; the Python world, as usual, has an overabundance of options; there are a few libraries for my beloved Lisp; you can even do dataviz online.

All of which raises the question of why I bother to roll my own. I’ll never keep up–or even catch up–with the efforts of major software companies or the huge community of open-source developers. In my own program, if I want something new–treemaps? vector fields? the third dimension?–nobody is going to code it for me. And, conversely, anything useful I might come up with will never benefit anyone but me.

The trouble is, every time I try working with an external graphics package, I run into a terrible impedance mismatch that gives me a headache. Getting what I want out of other people’s code turns out to be more work than writing my own. No doubt this reveals a character flaw: Does not play well with others.

In any case, the time for change is coming. My way of working is woefully out of date and out of fashion. PostScript is a technology that even Adobe seems to regard as outmoded. And making ultraprecise PostScript graphs is quite silly when their destination is the web; before I can put them online, I have to convert them to low-res PNG images. Furthermore, a PostScript-based workflow loses out on all the interactive richness of the web. These are deathly still images. How can I expect to earn any web cred when my work is not even clickable, much less multitouch-enabled?

If I continue in my stubborn, do-it-yourself mode, I could replace the PostScript back end with one that generates SVG. This wouldn’t be a major undertaking. But is SVG the right answer? It’s been around for more than a decade and you still don’t see much of it in the wild. And there are horrid browser incompatibilities. I suspect that Javascript (and JQuery) has a brighter future. And if I can get over my abreaction to libraries, there are plenty of options. Advice anyone?

Update 2010-03-13: Many thanks for all the thoughtful comments. Herewith a few comments on comments:

Ron Renaud’s graphs using Javascript in a canvas element are really very pretty, and they give me renewed hope that web graphics can measure up to a print standard. But is the world quite ready for the canvas? This is a blog, after all. Lots of people get to it with an RSS reader, not a web browser.

Zvika requests a link to a higher-resolution PNG. I don’t know how to do that. I can make a larger PNG, but the resolution–the dots per centimeter–is really determined by the screen you’re looking at. Which is not to say that larger illustrations wouldn’t be a good idea. When I redesign these pages, I want to allow more room for bigger pictures.

Gary Reuben suggests Python Matplotlib and the ggplot2 package for R. The latter is new to me, and very impressive. I want to go read more about it.

Several other readers favor SVG. I’m okay with that. It looks easy to change my current software to generate SVG output instead of (or in addition to) PostScript. The question remaining for me is whether SVG on the web is something that browsers (and, again, RSS readers) can swallow without choking.

John Haugland mentions PDF as the successor to PostScript. I couldn’t possibly survive without PDF these days, but I don’t see it as an ideal medium for illustrations embedded in web pages. Reading PDFs within a browser requires a plugin, which some people refuse to install. (I’m one of those people.) Furthermore, because PDF is a binary format based on a directory of offsets to tables, it’s more trouble to write PDF files than either PostScript of SVG.

Nate mentions Processing, the graphics and animation language created by Casey Reas and Ben Fry. I’ve written about this before at bit-player and elsewhere. I’m a big fan, but Processing is essentially a front end to Java, and I have reservations about embedding Java applets in web pages. John Ressig’s reimplementation of the language in Javascript overcomes that problem, and one of these days I’ll get around to doing something serious with it.

Finally, Marc asks for a look at my Lisp code. I’m always shy about sharing such unfinished things, but here it is.

Update 2010-03-28: When commenter Zvika asked for higher-resolution graphics, my reply was unhelpful; let me now try to say something more useful.

Even though I can’t do anything to improve screen resolution, I can, as Zvika notes, provide a larger version of each image. The Wikipedia way of doing this, which Zvika mentions, is to link to a separate HTML page, which replaces everything on the page you’re currently reading. As a test, I’ve done this with Figure 1 above; click on that figure and you’ll be whisked away to a supersized version of the graph. To get back here, you’ll have to use your browser’s “back” button.

Another approach, adopted by, for example, the New York Times, is a pop-up window. I’ve done this with Figure 2 above; click on it and a new window with a bigger version will open in front of this page.

Personally, I dislike both of these strategies. I want to see the words and the pictures at the same time in the same context. I suppose I could implement (or swipe) some glitzy Ajaxian solution that expands the artwork box within the window. But I still feel the right solution is to redesign the pages so that there’s room for larger illustrations in the first place. (Admittedly I’ve been meaning to do that for more than a year.)

There’s also a question about how best to make use of a more spacious picture box. In the supersized Figure 1, the transformation is much like an optical enlargement, projecting the same information at larger scale. The overall dimensions are doubled, and so is the width of each line, the size of the type, and so on. This is probably the answer for those troubled by visual deficiencies (including my own presbymyopic eyes).

In Figure 2 I have enlarged the illustration by 150 percent but kept all the individual graphic elements (line weights, labels, the glyphs that mark data points) the same size. This kind of rescaling allows information to be read from the graph with greater precision.

Perhaps a middle way is sensible here. Indeed, maybe type should scale as the square root of graph size?

Herbert R. J. Grosch, 1918-2010

26 February 2010

Today I learned of the death of Herb Grosch, proud provocateur and mischief-maker of the computing industry. Anybody who ever knew Herb, however slightly or briefly, has a story to tell, so here’s mine.

Grosch held senior positions at major household-name companies (IBM, General Electric) as well as in the U.S. government; he also spent time at MIT and Columbia and was editor of Computerworld. But through it all he cultivated the role of the outsider, or even the outcast; he saw himself as a lone wolf; at IBM they labeled him a “wild duck.” He boasted that he was the second scientist ever hired at IBM (after Wallace Eckert) and, more important, was the first employee with facial hair. Even when he was an insider he was an outsider. Grosch was elected president of the Association for Computing Machinery–but as a dissident candidate, opposing the slate annointed by the nominating committee.

The free-spirited maverick is a celebrated figure in American culture, but not always a well-rewarded one. Wild ducks don’t get tenure, or a pension.

My brief encounter with Herb came ten years ago, when I was working on an article (PDF) about the uses of ternary notation in computing. I read somewhere that Grosch had proposed a ternary architecture for the Whirlwind, Jay Forrester’s enormous early electronic computer. The idea of a base-three machine probably seems weirder today than it did in the early 1950s, but all the same the proposal was not adopted, and most histories of the Whirlwind project say little or nothing about the ternary option. I had no luck finding a copy of Grosch’s memo on the subject, so I tried getting in touch with him directly. With the help of friends I tracked him down, though not in the first place I looked. He was in Riga, Latvia.

Grosch couldn’t supply a copy of the memo; most of his papers, he said, were buried in a landfill in Switzerland. The best he could do was point me to a passage in his autobiography, summarizing a conversation with Forrester:

I said what might be genuinely gainful would be to store a ternary digit in each core, and calculate in base-three rather than binary fashion. There were materials–some kinds of permalloy, as I remember–that had north, south and neutral stable magnetic states. I told him I had taught my Poughkeepsie evening classes at IBM about a special kind of base-three arithmetic I called “signed ternary,” in which zero was in the middle of the number range. In this curious system there was no need for algebraic signs, no problem about the sign of zero, and you rounded perfectly by dropping digits.

Jay being a stiff type, I refrained from calling the ternary digits “tits,” a name which had been the source of much boyish amusement in the Poughkeepsie classes.

By the way, I’m still looking for a copy of that memo. Here’s the reference: Grosch, H. J. R. 1952. Signed ternary arithmetic. Memorandum M-1496, Digital Computer Laboratory, MIT.

Herb Grosch in Barcelona, circa 2001

Of course I had to ask Herb how and why he’d found his way to Latvia. It was a long story, involving a small NSF grant, archives of Datamation magazine, a collection of Soviet-era computer hardware in an attic at Latvia University, and a romance that Herb hoped would develop into his fifth marriage. But the grant ran out, the marriage plans faltered, and Herb was heading back to U.S. in dire need of employment. His hopes, he said, were “not much above fast-foodery: editing, or tech writing, or even clerking at Barnes and Noble.” He was 82 at the time.

In the end, he did not have to stoop quite so far as editing or tech writing, or Barnes and Noble; he became Adjunct Distinguished Professor of Computer Science at the University of Nevada Las Vegas. But that posting didn’t last long. A few years later he turned up in Toronto, teaching the history of computing, but by then I’d lost touch with him.

In a “Dear Everybody” message some years ago, Grosch wrote:

I begin this letter on the fourth of six recent Binary Days, 01.11.01…. The next Binary Day will be nine years from now, 01.01.10, and I will write again then if I survive. Shorter recipient list, I assume!

Herb did make it until 01.01.10. He died January 25th. For more on his life and works, see the ACM obituary and a web site at Columbia maintained by Frank da Cruz.

The teetotaler’s walk

22 February 2010

Writing about Pólya’s recurrence theorem led me to pick up Gerry Alexanderson’s book The Random Walks of George Pólya. He tells a sweet anecdote about the origin of the theorem.

The random walk is sometimes called the drunkard’s walk, but Pólya’s musings on the subject were not inspired by a disorienting pub crawl. On the contrary, the idea came to him while he was living at the Kurhaus Zürichberg, which Alexanderson notes was founded as a temperance hotel. (It seems to be a somewhat different establishment now, and probably beyond the means of a Privatdozent.)

Alexanderson describes the incident that led Pólya to the recurrence theorem:

There was a particular wooded area near the hotel where he enjoyed taking extended walks while thinking about mathematics. He would carry pencil and paper so he could jot down ideas as they came to him. Some students also lived at the Kurhaus and Pólya got to know some of them. One day while out on his walk he encountered one of these students strolling with his fiancée. Somewhat later their paths crossed again and even later he encountered them once again. He was embarrassed and worried that the couple would conclude that he was somehow arranging these encounters. This caused him to wonder how likely it was that walking randomly through paths in the woods, one would encounter others similarly engaged. This led to one of his most famous discoveries, his 1921 paper on random walk, a phrase used for the first time by Pólya.

It’s interesting that the question raised by those embarrassing woodland encounters differs from the usual statement of the recurrence theorem, which talks about a single walker’s return to the point of origin. The answer is the same, however. In an appendix to the Alexanderson book, K. L. Chung mentions an easy proof of this equivalence: Suppose two walkers begin together at A and run into each other again at B. Nothing about the statistics of a random walk changes if you reverse the direction of all the steps. Thus the probability of the rencontre at B is the same as that of one walker going from A to B and then back to A again.

The illustration below shows the paths of two random walkers that meet three times (green circles) after leaving the origin, supporting the plausibility of Pólya’s claim that he was not stalking those young lovers in the woods.

rencontres-1.png

(What’s not so plausible is that any non-inebriated walker would follow a truly random path.)

Gruenberger’s prime path

16 February 2010

Fred Gruenberger may well have been the first blogger on computational topics. When he was writing, back in the 1970s, there was no RSS, and so he distributed his musings in a monthly newsletter called Popular Computing. A typical issue was 16 or 20 typewritten pages–stapled, folded, stamped and delivered by mail. It was always worth reading.

Gruenberger had been working and playing with computers since the 1940s. For a long stretch he was at the RAND Corporation, the famous think tank in Santa Monica. Later he taught at Cal State Northridge. In addition to Popular Computing he was involved in the startup of Datamation magazine and published at least a dozen books. I haven’t been able to learn much about his later years; he died in 1998.

A slogan that appeared in some issues of Popular Computing proclaimed: “The way to learn computing is to compute.” I took this advice to heart, although I was hampered by a total lack of hardware. Later on I acquired a programmable calculator, which helped on some of the problems and exercises.

Problem 149, from Popular Computing Vol. 4 No. 12, December 1976

The problem reproduced above appeared in the December 1976 issue of Popular Computing (Vol. 4, No. 12). At the time, I made no attempt to work this one out, but evidently the problem seemed interesting enough to be worth filing away. When I came upon the old clipping recently, I gave it a closer look and realized I have no idea how to answer Gruenberger’s question, though the impediment now is not lack of hardware.

Gruenberger asks us to trace a planar path whose steps are indexed by the odd integers starting at 3. For each number N we turn right 90 degrees before taking a step if N is a prime congruent to 1 mod 6; we turn left 90 degrees before moving one unit if N is a prime congruent to −1 mod 6; otherwise we continue straight ahead in whatever direction we happen to be facing.

In his typewriter graphics, Gruenberger plotted the trajectory from N=3 through 97. Below I continue the path through N=199.

trail201b.png

But something’s amiss here. Gruenberger wrote:

Eventually the path will cross itself, so that the cell containing 111 will also contain 147. Similarly, one cell will contain both 91 and 179.

Those two self-intersections are nowhere to be found in the diagram. When I first noticed this discrepancy, I assumed I must have made a mistake somewhere. (This eagerness to blame myself is not mere knee-jerk humility; I have years of experience to back it up.) Eventually, though, I concluded that it was Gruenberger who had made the wrong turn. I believe he mistakenly went left at 127, as shown in the brown trail below:

trail201b-error.png

The brown continuation of the red path includes the two coincidences mentioned in Gruenberger’s problem statement. But the left turn at N=127 is incorrect, because 127 is a prime equal to (6×21)+1, and thus it should specify a right turn. The error is of no great consequence, but it does reveal something interesting: Gruenberger must have been plotting these paths by hand. Most likely he wrote a program to compute the series of residue classes, then traced out the trajectory on squared paper.

Setting aside this anomaly, Gruenberger was quite right that the path does intersect itself. Here’s the trail continued through N=1,001:

trail1001b.png

And if that’s not tangled enough, here’s what it looks like at N=10,001:

trail10001c.png

Gruenberger asks for “a list of the contents of those cells containing more than one number, arranged in the order of the smallest number in the cell.” It’s not hard to identify some cells that belong on such a list. The table below includes all multiply-occupied cells discovered when tracing the path up to N=1,001, sorted as Gruenberger requests:

                   x    y    values of N
                 -11   28    (137 337)
                 -15   27    (147 683)
                 -16   27    (149 349 685)
                 -18   26    (155 355)
                 -19   27    (159 691)
                 -19   28    (161 693)
                 -19   29    (163 695)
                 -17   31    (171 319)
                 -18   32    (175 315)
                 -19   32    (177 701)
                 -20   32    (179 703)
                 -22   31    (185 769)
                 -23   31    (187 771)
                 -24   31    (189 773)
                 -30   41    (245 269)
                 -30   42    (247 271)
                 -27   40    (281 733)
                 -26   40    (283 735)
                 -26   37    (289 725)
                 -23   35    (299 715)
                 -22   35    (301 761)
                 -21   35    (303 759)
                 -20   35    (305 757)
                 -17   27    (351 687)
                 -18   27    (353 689)
                 -17   24    (361 673)
                 -16   24    (363 675)
                 -15   24    (365 677)
                 -17   21    (379 667)
                 -17   22    (381 669)
                 -17   23    (383 671)
                 -20   22    (391 631)
                 -20   21    (393 633)
                 -20   20    (395 635)
                 -20   19    (397 637)
                 -22   19    (401 593)
                 -22   18    (403 591)
                 -22   17    (405 589)
                 -22   16    (407 587)
                 -27   15    (419 575)
                 -27   14    (421 573)
                 -28   14    (423 819)
                 -29   14    (425 549)
                 -32   14    (431 539)
                 -32   13    (433 537)
                 -26   10    (563 831)
                 -26   11    (565 829)
                 -27   13    (571 823)
                 -28   18    (607 811)
                 -22   32    (707 767)
                   4   -6    (923 971)
                   4   -7    (925 969)
                   4   -8    (927 967)
                   4   -9    (929 989)
                   5   -9    (931 991)

Is this list the answer to Gruenberger’s question? No, it’s not, because there’s no reason to stop at an arbitrary limit such as N=1,001. Indeed, the list above is not even a prefix of the complete answer. The smallest value of N appearing in the list is 137, but the trail will eventually revisit cells occupied by smaller values of N. For example, continuing the experiment to N=10,001 reveals a bunch of intersections quite close to the beginning of the path, including a site that’s visited five times:

                   x    y    values of N
                   1    0    (5 1621)
                   1    1    (7 1623)
                   2    1    (9 4725)
                   3    1    (11 1263)
                   3    2    (13 1265)
                   5    3    (19 1635)
                   6    3    (21 1637)
                   7    4    (25 7537)
                   7    5    (27 7319 7539)
                   7    6    (29 7505 7541)
                   6    6    (31 1643 7323 7503 7543)
                   6    7    (33 1645 7325)
                   6    8    (35 1647 7327)
                   6    9    (37 1649 7329)

One point still missing from this list is the origin–the site at x=0, y=0, N=3. Does the path ever revisit its starting point? If so, at what value (or values) of N does it come back home? Since I don’t know the answer to this question, I guess I’ll have to leave it as an exercise for the reader.

I suspect that the problem Gruenberger meant to pose (or thought he was posing) was to generate a list of self-intersection sites arranged in their natural order of occurrence–that is, the order in which the crossings are created when you construct the path starting from the origin. This natural-order list is not at all the same as a list “arranged in the order of the smallest number in the cell.” The natural-order list is easy to generate step by step. All you need to do is obey the left/right/straight rules, plot the resulting sequence of positions on the xy lattice, and leave behind a trail of breadcrumbs so you can check at each step to see if the site has been visited before. This task is a matter of straightforward computation–just the kind of assignment that Gruenberger favored. The natural-order list begins:

                   x    y    values of N
                 -30   41    (269 245)
                 -30   42    (271 247)
                 -18   32    (315 175)
                 -17   31    (319 171)
                 -11   28    (337 137)
                 -16   27    (349 149)
                 -18   26    (355 155)
                 -32   13    (537 433)
                 -32   14    (539 431)
                 -29   14    (549 425)
                 -27   14    (573 421)

Thus the prime path first crosses itself when N=269, a value that shares the same coordinates as N=245, namely x=−30, y=41. There are 56 such crossings up to N=1,001, and 112,988 self-intersections up to N=106.

*     *     *

There is a wilder, conjectural answer to Gruenberger’s challenge–which I’m pretty sure he did not have in mind. It goes like this: Maybe the complete list of revisited values of N is simply the list of all N. In other words, maybe the Gruenberger prime path fills up the entire lattice of integers, crossing over itself everywhere many times.

In 1921 George Pólya published a celebrated proof that a random walk on the lattice of integers is recurrent in one or two dimensions, though not in higher dimensions. Recurrent means that the walk returns to each point along its length with probability 1, and indeed visits every point in its domain infinitely often. Is it possible that the prime path is also recurrent?

Pólya’s theorem is one of those mind-expanding results that seem impossible on first acquaintance, and then inevitable, and finally just so amazing that you want to go kiss a mathematician. I have to confess that I’ve never gotten all the way through Pólya’s original paper (it’s not long, but it’s in German). On the other hand, I can highly recommend a little book by Peter Doyle and Laurie Snell, Random Walks and Electric Networks, which gives several alternative proofs of the theorem; it was published in the MAA’s Carus Monograph series, and there’s a postprint available on the arXiv.

The key insight underlying Pólya’s result, as I understand it, is this: If you never revisit a former home, then you must be spending eternity somewhere else, and you can do that only if your universe has enough somewhere elses that you’ll never run out of new territories to visit. Suppose that, some eons after starting your journey, you find yourself at distance r from the origin. If you’re living in a one-dimensional universe, then there are just two places you could be at that moment, namely at +r or −r. It doesn’t matter how far you run; there are still just two points at any given distance from the origin. In two dimensions, a fugitive at distance r has a little more room to maneuver; the number of available points grows in proportion to r, forming a circle of radius r. But this is still not enough room to get lost in. Only in three dimensions or more is there a nonzero probability of escape. In three dimensions, the space available at radius r is proportional to r2. In this three-dimensional world, the volume of empty space grows faster than a random walker’s expected distance from home.

What does all this have to do with Gruenberger’s prime path? Well, it’s no secret that the distribution of prime numbers looks convincingly random–if you look at it in just the right way. And in particular the distribution of primes in various residue classes, such as 6K+1 and 6K−1, seems to behave at least approximately like a random variable. All this suggests we might consider viewing the Gruenberger prime trail as if it were a random walk through the two-dimensional lattice of integers. Because the space is two-dimensional, it’s a good guess that the walk should be recurrent.

The original recurrence results of Pólya refer to a simple random walk, where at each step the walker chooses randomly among the available directions and then moves one unit in that direction. For example, in the two-dimensional lattice of integers there are four possible directions: north, south, east, west. The simple random walk is not the best model of the Gruenberger process, which is more like a nonreversing random walk–a path where on each step the walker can turn left or turn right or go straight ahead but can never make a 180-degree about-face. We can further refine the random-walk model of the Gruenberger process by biasing the choice made at each step to reflect the changing abundance of prime numbers. Primes grow scarcer as their magnitude increases; in the vicinity of a given value of N, the probability that a randomly chosen number is prime is approximately 1/log N. Since the Gruenberger path goes straight for all composite numbers and turns only when N is prime, the trail will have longer and longer straight segments, and rarer turns, as N increases. A random walk can mimic this behavior by choosing an action at each step according to this logic:

    if random(1.0) > 2/log(N)
       then go straight
       elseif randomboolean()
           then turn left
           else turn right

(The proportion of primes is given as 2/log(N) rather than 1/log(N) because the Gruenberger process is defined on odd numbers only, which immediately eliminates half of the composites.)

One way to compare the various kinds of random walks is to measure the root-mean-square displacement–the distance from the origin to the final position of the walker, averaged over many realizations of the random process. For a simple random walk, the RMS displacement for an N-step walk converges to \(\sqrt{N}\); for the nonreversing random walk the average displacement is \(\sqrt{2N}\). The biased random walk based on the distribution of primes also appears to yield an RMS distance proportional to the square root of the number of steps; numerically the curve looks something like \(\sqrt{10N}\). I’m not entirely sure that’s the true form of the curve, but the geometric details don’t really make much difference. If I understand correctly, all three of these random processes should be recurrent in the sense of Pólya.

rms-graph5.png

Does the same reasoning apply to the Gruenberger prime path? There are two sides to this question.

The naysayer points out that Pólya’s theorem applies to random walks, but there’s nothing truly random about the sequence of primes. After all, we have a straightforward, deterministic algorithm for generating primes, as well as an efficient algorithm for testing whether any given integer is prime or composite. The essence of a random process is that every time you run it you get a different result, but there’s only one sequence of prime numbers, and so the Gruenberger prime path will come out exactly the same every time. According to this view of things, the kind of probabilistic reasoning that goes into the proof of Pólya’s recurrence theorem is out of bounds here. For randomness to make any sense, you need to average over some ensemble of independent instances. For example, you could average over the 50 salmon-pink paths in the graph below, which represent 50 independent realizations of a biased random walk; you can’t average over the prime path itself (green), because there’s only that one path.

random-prime-paths-51.png

The yeasayer retorts that a single path is all you need–if the path is infinitely long. Indeed, the salmon-colored trails above could be interpreted not as 50 distinct runs of a random process but as 50 segments of a single long path, which repeatedly loops around through the point at x=0, y=0, wanders off in various directions, and then comes back home yet again. In essence, everything that could possibly happen in an infinite set of random paths happens somewhere within a single infinite path; all possible variety is already present there.

I’m not sure how to settle this dispute between Dr. Yea and Professor Nay. When an argument hinges on the nature of randomness, the meaning of infinity and patterns in the distribution of the primes, I known I’m in over my head.

So I’ll leave that deep question unresolved and say a final word about a lesser curiosity. In the Gruenberger process, we’re using the congruence classes of prime numbers mod 6 as a kind of coin flip to decide which way to turn. Is it a fair coin flip? For small values of N, it certainly doesn’t look fair:

                               6x+1      6x−1
       primes <     100         11        12
       primes <    1000         80        86
       primes <   10000        611       616
       primes <  100000       4784      4806
       primes < 1000000      39231     39265

There’s a persistent excess of −1 primes, and the imbalance seems to be getting steadily larger. As a result, the prime path has a “winding number” that reaches 8.5 at N=106; that is, the path makes eight and a half net counterclockwise revolutions. Does the windup continue with still larger N? I gather that the definitive answer is “Yes and No.” For more see the masterful paper by Andrew Granville and Greg Martin cited above.

[Correction 2010-02-19: reflected the accent on Pólya.]

Yet another spam update

7 February 2010

spam-2008-02-to-2010-01.png

It seems the great spam tide of 2009 is ebbing. The graph records numbers of spam messages per month received by my email accounts; obviously this is just a personal tally and your milage may vary. The percentage of Russian-language spam in my inbox has fallen off somewhat, from more than half last fall to less than 40 percent in January. (For context see my earlier bit-player spam reports: Aug 2009, May 2009, Mar 2009, Nov 2008, Oct 2008, Jun 2008. Also two American Scientist columns: Jul-Aug 2007 and May-Jun 2003.)

I would also like to draw attention to a recent article on the economics of spam: “Spamalytics: An Empirical Analysis of Spam Marketing Conversion,” by Chris Kanich, Christian Kreibich, Kirill Levchenko, Brandon Enright, Geoffrey M. Voelker, Vern Paxson and Stefan Savage, Communications of the ACM 52(9):99-107 (available online if you or your library is a subscriber). (Preprint link, thanks to Arvind Narayanan in the comments.) Kanich et al. adopt a clever and mildly controversial strategy: They parasitize an existing spam botnet and alter outgoing emails so that embedded links point back to the investigators’ own web servers, rather than those of the spammers. They write:

Using this methodology, we have documented three spam campaigns comprising over 469 million emails. We identified how much of this spam is successfully delivered, how much is filtered by popular antispam solutions, and, most importantly, how many users “click-through” to the site being advertised (response rate) and how many of those progress to a “sale” or “infection” (conversion rate).

Of the 469 million messages, 347 million were part of a campaign advertising pharmaceuticals. The bottom line: 28 “sales” (no money actually changed hands, and no products were delivered) with an average purchase price of about $100. This is a conversion rate of less than 1 in 10 million, which leaves some doubt about the profitability of the operation. Kanich et al. note that spam-for-hire services would charge roughly $25,000 to send 350 million emails, an order of magnitude more than the revenue generated in this campaign. Yet the mere continued existence of spam argues that the actual costs must be much lower. Kanich et al. speculate that the spammers are a vertically-integrated enterprise: They own both the botnet and the pharmacy.

Finally, I want to update my comment on comment spam. Three months ago I installed the Akismet filter to intercept spam comments submitted to bit-player. The filter has been performing fairly well, blocking about 200 spam comments so far, letting a dozen or so slip through, and falsely imprisoning a couple of legitimate comments. (Advice to commenters: Avoid ALL CAPS.)

Looking through the archive of impounded messages, I am more perplexed than ever about the social and economic basis of this phenomenon. The URLs that constitute the payload of the spam point to a weird miscellany of topics. Someone out there thinks that the readers of this blog are interested in hypnotism, in bathroom faucets and shower enclosures, in garage doors and currency conversion. Most of all we are a musical bunch: Fully a third of the spam comments promote merchandise related to pianos.

Unlike email spam, the comment spams are not generated by an automaton; there are living human beings at the other end of this communication channel. And apparently the spam writers are not hired merely for high-speed, wholesale captcha-solving. In many cases there is ample evidence that the commenter has read the article and understood it, and may even have something interesting to say about it.

It occurs to me that this circumstance gives me an opportunity to open a dialogue. So I now address myself directly to the comment spammers: If you are reading this post because you’ve been hired to plant comments and URLs here, or if you’re doing it for your own commercial gain, I have a proposition for you. Please leave a comment attached to this post, explaining what you’re doing and why–and, if possible, for whom. Tell us what a successfully planted link is worth, and how much you get out of it. If your comment is sufficiently interesting and informative, and if it seems genuine, I’ll let it through the Akismet filter, and you’ll have your chance to sell my readers a piano or a garage door. (This offer applies only to comments attached to this article, and I’ll be the sole judge of what’s interesting, informative and genuine.)

If you’d rather communicate privately, send me an email at brian@bit-player.org.