Home-baked graphics

9 March 2010

A couple of commenters have asked what software package I use to create the graphs that appear in bit-player posts–illustrations like the one below, which is a slightly improved version of something I posted last week. Let’s call it Figure 1.

rms-graph2-revised.png

Prompted by these inquiries, I immodestly ask myself: Why do my graphs look so darn good? I immodestly answer: It’s not because of any packaged software! I don’t need a cake mix, or even a recipe. These are home-baked graphs, made from scratch out of locally grown organic pixels.

I have strong opinions about the aesthetics of scientific illustrations, and I could certainly spout off about the design elements of Figure 1, such as that putty-colored background, just dark enough to allow drop-out white grid lines, yet neutral enough to avoid competing with the data curves, which also have a distinctive color scheme on which I could discourse at length. Yes, I can talk the Tufte talk. But I think the commenters were really asking how I create the graphs rather than why they’re so elegant, and so I’m going to focus here on the practical programming problem.

Most of my experience in drawing pictures with a computer comes from the world of print publishing, where the final product is ink on paper rather than pixels on a screen. Compared with the online environment, print has some advantages, notably higher resolution (up to 1,000 dots per centimeter) and precise control over typography and color. But print also has obvious limitations: On a magazine page, there are no mouseovers or clickable buttons, and you can’t make a square knot twirl in 3D.

Thirty years ago, the big challenge for computer-generated illustrations was not how to draw the picture but how to get it out of the computer and onto the printing press. You couldn’t just export a PDF and place it in a Quark or InDesign document; none of those things existed. The only practical option was to print out the artwork, photograph it, and “strip” the negative into the page-size film that would be used to make the press plate. Because of this emphasis on printouts, most of the effort went into programming the printer rather than the computer.

The figure below is the first published computer-generated illustration I had a hand in creating. It appeared in Scientific American in 1983.

epson-freq-table.png

The array of 282 tiny bar graphs was produced with an Epson MX-80 dot-matrix printer, using escape codes to fire combinations of the eight pins in the printhead. Of course the MX-80 was a black-and-white device. The two-color illustration was created from two separate printouts. Also, the Epson letterforms were replaced with typeset characters.

The world of computer-generated illustrations changed dramatically with the arrival of PostScript, the “page description language” created by John Warnock and his colleagues at Adobe Systems (based in part on earlier work at Evans and Sutherland and Xerox PARC). PostScript was designed as a complete programming language rather than just a file format or a set of drawing commands. And something else set it apart as well: attention to details of graphic design. With most earlier software (such as programs based on the Apple Quickdraw library), trying to create publishable figures was an exercise in frustration. For example, the apparent weight of a line would vary depending on its orientation: lighter when vertical or horizontal, heavier when diagonal. PostScript allows very precise control over such niceties of presentation. To take another example, where lines meet the edge of a graph, you don’t want to have to choose between falling short and overshooting; PostScript provides the tools needed to make it look right.

edge-effects.png

(The version in the rightmost panel is created by allowing the colored lines to extend outside the background box, and then applying a clipping mask that cuts off all objects at the boundary of the box.)

Obsessing over minute details like these may seem comically fussy, but I believe that neatness counts in these matters. To some extent, illustration is an art of illusion. Graphs and diagrams work best when you can look through them rather than at them. The viewer should be seeing the underlying information or abstraction–the array of correlation coefficients, the function y = f(x), or whatever–rather than noticing the mechanics of how the drawing was constructed. A ragged edge is the kind of distraction that destroys the illusion.

Although PostScript was a giant step forward from the MX-80 command set, in the early years it was still just another printer language, not a computer language. The only way I could execute a PostScript program was to send it to a laser printer and wait to see what came out. Sometimes it was a long wait. I had no way of running a PostScript program on the computer itself. (Ghostscript came later.)

ChernoffFaces.pngMy first PostScript illustrations were created as hand-written PostScript programs; the same language was used both for doing the computations and for presenting the results. The faces at right were created in this way. (They were inspired by the work of Herman Chernoff and drawn to illustrate an American Scientist article by Robert Levine in 1990.) The dual role of the language caused me a moment of disorientation just now when I went looking for my records of this project. I found an EPS (encapsulated PostScript) file, which I knew was the finished illustration, but where was the source code? And then I remembered: It’s the same file! Open it up in Ghostscript or Adobe Illustrator and you see those silly faces smiling or scowling at you; open the same file in a text editor, and you see procedures for drawing elements of the faces:

   /draweyes
     { newpath
       dx dy eyewidth eyeheight 0 360 ellipse stroke
       ex ey eyewidth eyeheight 0 360 ellipse stroke
     } bind def
   /drawpupils
     { fx fy pupilsize pupilsize 0 360 ellipse fill
       gx gy pupilsize pupilsize 0 360 ellipse fill
     } bind def

Bill Casselman, the graphics editor of the Notices of the American Mathematical Society, still favors this direct-to-PostScript methodology. He has written an excellent guidebook, taking you from the basics of PostScript through an elaborate library for rendering three-dimensional objects.

But here I part company from Casselman; I’d rather not do all my computing in PostScript. It’s not that I have anything against the language itself, but the development environment is not to my taste. I therefore adopted the modus operandi of writing a program in my language of choice (usually some flavor of Lisp) and having that program write a PostScript program as its output. After doing this on an ad hoc basis a few times, it became clear that I should abstract out all the graphics-generating routines into a separate module. The result was a program I named lips (for Lisp-to-PostScript).

Most of what lips does is trivial syntactic translation, converting the parenthesized prefix notation of Lisp to the bracketless postfix of PostScript. Thus when I write (lineto x y) in Lisp, it comes out x y lineto in PostScript. The lips routines also take care of chores such as opening and closing files and writing the header and trailer lines required of a well-formed PostScript program.

But the lips interface is low-level, confined to drawing individual dots, line segments, rectangles and the like. Assembling a complete graph out of these primitives is tedious. For example, the grid of white lines in Figure 1 would have to be drawn one line at a time, with each line specified by a sequence of commands such as

    (newpath)
    (moveto u v)
    (lineto x y)
    (stroke)

Before you can issue those commands, you have to calculate u, v, x and y. Clearly, a higher-level front end is needed; like everyone else, I call mine plot.

At the core of any plotting program is a simple operation: mapping points from an abstract user space to coordinates in a rectangular pane, the page space. In Figure 1, the y axis runs from 0 to 5000; values in this range have to be scaled to the dimensions of the graph, which is about 300 PostScript points, or 11 centimeters. Mathematically, the transformation is straightforward. Indeed, if I wished I could leave all the arithmetic to the PostScript interpreter, simply passing in the appropriate matrix elements for scaling and translation. This is an attractive option; it would allow plot to work entirely in user space. But a few niggling details get in the way. Consider the tick marks along the y axis in Figure 1. Their vertical positions are conveniently expressed in user coordinates: one tick every 500 units. But what about the length of the ticks–their horizontal extent? This dimension is purely concerned with the appearance of the graph and has nothing to do with the content; it ought to be expressed in unscaled units of points or pixels.

Here’s a possible solution: Let everything inside the rectangular frame of the graph–the area with the putty-colored background in Figure 1–go through the scaling engine, but define everything outside the frame, including the tick marks and the axis labels, directly in page coordinates. If you think this is the final answer, take a look at Figure 2:

figure2.png

In this nonsensical graph (constructed just for this occasion), data points are indicated by stars, crosses and diamonds. The positions of those glyphs ought to be defined in user space, but the drawing commands that create the shapes are properly defined in page coordinates. If we tried to draw the glyphs in user space, their size and shape would vary with position in the graph.

What’s the best way to deal with this messy situation? Is there some tidy solution that will reconcile the two coordinate systems and allow all dimensions to be treated uniformly? I don’t believe so; it’s just in the nature of graphs to mix up elements from these two disparate realms. We look through a window into a world of data or mathematical abstractions, but we also draw our own little doodles on the window itself.

Of course there are solutions; they’re just not as pretty as I would like. My own strategy for coping is to attach extra information to each geometric point, indicating whether or not the x and y coordinates are to go through the scaling transformation. This is less troublesome than it might seem; from the user’s point of view, it’s almost always invisible.

In writing the lips and plot programs, I walk a path that is already worn smooth by many earlier footsteps. I don’t know who wrote the first computer program for plotting data, but it probably came soon after the first program for producing data. Today we have hundreds of clever, comprehensive, well-designed and well-maintained programs for plotting and graphing. Gnuplot is very capable; Grace is one I’ve never used but I’ve heard good things about it; Mathematica, Sage, R, MATLAB, Octave and the like all have elaborate graphics facilities built in; the Python world, as usual, has an overabundance of options; there are a few libraries for my beloved Lisp; you can even do dataviz online.

All of which raises the question of why I bother to roll my own. I’ll never keep up–or even catch up–with the efforts of major software companies or the huge community of open-source developers. In my own program, if I want something new–treemaps? vector fields? the third dimension?–nobody is going to code it for me. And, conversely, anything useful I might come up with will never benefit anyone but me.

The trouble is, every time I try working with an external graphics package, I run into a terrible impedance mismatch that gives me a headache. Getting what I want out of other people’s code turns out to be more work than writing my own. No doubt this reveals a character flaw: Does not play well with others.

In any case, the time for change is coming. My way of working is woefully out of date and out of fashion. PostScript is a technology that even Adobe seems to regard as outmoded. And making ultraprecise PostScript graphs is quite silly when their destination is the web; before I can put them online, I have to convert them to low-res PNG images. Furthermore, a PostScript-based workflow loses out on all the interactive richness of the web. These are deathly still images. How can I expect to earn any web cred when my work is not even clickable, much less multitouch-enabled?

If I continue in my stubborn, do-it-yourself mode, I could replace the PostScript back end with one that generates SVG. This wouldn’t be a major undertaking. But is SVG the right answer? It’s been around for more than a decade and you still don’t see much of it in the wild. And there are horrid browser incompatibilities. I suspect that Javascript (and JQuery) has a brighter future. And if I can get over my abreaction to libraries, there are plenty of options. Advice anyone?

Herbert R. J. Grosch, 1918-2010

26 February 2010

Today I learned of the death of Herb Grosch, proud provocateur and mischief-maker of the computing industry. Anybody who ever knew Herb, however slightly or briefly, has a story to tell, so here’s mine.

Grosch held senior positions at major household-name companies (IBM, General Electric) as well as in the U.S. government; he also spent time at MIT and Columbia and was editor of Computerworld. But through it all he cultivated the role of the outsider, or even the outcast; he saw himself as a lone wolf; at IBM they labeled him a “wild duck.” He boasted that he was the second scientist ever hired at IBM (after Wallace Eckert) and, more important, was the first employee with facial hair. Even when he was an insider he was an outsider. Grosch was elected president of the Association for Computing Machinery–but as a dissident candidate, opposing the slate annointed by the nominating committee.

The free-spirited maverick is a celebrated figure in American culture, but not always a well-rewarded one. Wild ducks don’t get tenure, or a pension.

My brief encounter with Herb came ten years ago, when I was working on an article (PDF) about the uses of ternary notation in computing. I read somewhere that Grosch had proposed a ternary architecture for the Whirlwind, Jay Forrester’s enormous early electronic computer. The idea of a base-three machine probably seems weirder today than it did in the early 1950s, but all the same the proposal was not adopted, and most histories of the Whirlwind project say little or nothing about the ternary option. I had no luck finding a copy of Grosch’s memo on the subject, so I tried getting in touch with him directly. With the help of friends I tracked him down, though not in the first place I looked. He was in Riga, Latvia.

Grosch couldn’t supply a copy of the memo; most of his papers, he said, were buried in a landfill in Switzerland. The best he could do was point me to a passage in his autobiography, summarizing a conversation with Forrester:

I said what might be genuinely gainful would be to store a ternary digit in each core, and calculate in base-three rather than binary fashion. There were materials–some kinds of permalloy, as I remember–that had north, south and neutral stable magnetic states. I told him I had taught my Poughkeepsie evening classes at IBM about a special kind of base-three arithmetic I called “signed ternary,” in which zero was in the middle of the number range. In this curious system there was no need for algebraic signs, no problem about the sign of zero, and you rounded perfectly by dropping digits.

Jay being a stiff type, I refrained from calling the ternary digits “tits,” a name which had been the source of much boyish amusement in the Poughkeepsie classes.

By the way, I’m still looking for a copy of that memo. Here’s the reference: Grosch, H. J. R. 1952. Signed ternary arithmetic. Memorandum M-1496, Digital Computer Laboratory, MIT.

Herb Grosch in Barcelona, circa 2001

Of course I had to ask Herb how and why he’d found his way to Latvia. It was a long story, involving a small NSF grant, archives of Datamation magazine, a collection of Soviet-era computer hardware in an attic at Latvia University, and a romance that Herb hoped would develop into his fifth marriage. But the grant ran out, the marriage plans faltered, and Herb was heading back to U.S. in dire need of employment. His hopes, he said, were “not much above fast-foodery: editing, or tech writing, or even clerking at Barnes and Noble.” He was 82 at the time.

In the end, he did not have to stoop quite so far as editing or tech writing, or Barnes and Noble; he became Adjunct Distinguished Professor of Computer Science at the University of Nevada Las Vegas. But that posting didn’t last long. A few years later he turned up in Toronto, teaching the history of computing, but by then I’d lost touch with him.

In a “Dear Everybody” message some years ago, Grosch wrote:

I begin this letter on the fourth of six recent Binary Days, 01.11.01…. The next Binary Day will be nine years from now, 01.01.10, and I will write again then if I survive. Shorter recipient list, I assume!

Herb did make it until 01.01.10. He died January 25th. For more on his life and works, see the ACM obituary and a web site at Columbia maintained by Frank da Cruz.

The teetotaler’s walk

22 February 2010

Writing about Pólya’s recurrence theorem led me to pick up Gerry Alexanderson’s book The Random Walks of George Pólya. He tells a sweet anecdote about the origin of the theorem.

The random walk is sometimes called the drunkard’s walk, but Pólya’s musings on the subject were not inspired by a disorienting pub crawl. On the contrary, the idea came to him while he was living at the Kurhaus Zürichberg, which Alexanderson notes was founded as a temperance hotel. (It seems to be a somewhat different establishment now, and probably beyond the means of a Privatdozent.)

Alexanderson describes the incident that led Pólya to the recurrence theorem:

There was a particular wooded area near the hotel where he enjoyed taking extended walks while thinking about mathematics. He would carry pencil and paper so he could jot down ideas as they came to him. Some students also lived at the Kurhaus and Pólya got to know some of them. One day while out on his walk he encountered one of these students strolling with his fiancée. Somewhat later their paths crossed again and even later he encountered them once again. He was embarrassed and worried that the couple would conclude that he was somehow arranging these encounters. This caused him to wonder how likely it was that walking randomly through paths in the woods, one would encounter others similarly engaged. This led to one of his most famous discoveries, his 1921 paper on random walk, a phrase used for the first time by Pólya.

It’s interesting that the question raised by those embarrassing woodland encounters differs from the usual statement of the recurrence theorem, which talks about a single walker’s return to the point of origin. The answer is the same, however. In an appendix to the Alexanderson book, K. L. Chung mentions an easy proof of this equivalence: Suppose two walkers begin together at A and run into each other again at B. Nothing about the statistics of a random walk changes if you reverse the direction of all the steps. Thus the probability of the rencontre at B is the same as that of one walker going from A to B and then back to A again.

The illustration below shows the paths of two random walkers that meet three times (green circles) after leaving the origin, supporting the plausibility of Pólya’s claim that he was not stalking those young lovers in the woods.

rencontres-1.png

(What’s not so plausible is that any non-inebriated walker would follow a truly random path.)

Gruenberger’s prime path

16 February 2010

Fred Gruenberger may well have been the first blogger on computational topics. When he was writing, back in the 1970s, there was no RSS, and so he distributed his musings in a monthly newsletter called Popular Computing. A typical issue was 16 or 20 typewritten pages–stapled, folded, stamped and delivered by mail. It was always worth reading.

Gruenberger had been working and playing with computers since the 1940s. For a long stretch he was at the RAND Corporation, the famous think tank in Santa Monica. Later he taught at Cal State Northridge. In addition to Popular Computing he was involved in the startup of Datamation magazine and published at least a dozen books. I haven’t been able to learn much about his later years; he died in 1998.

A slogan that appeared in some issues of Popular Computing proclaimed: “The way to learn computing is to compute.” I took this advice to heart, although I was hampered by a total lack of hardware. Later on I acquired a programmable calculator, which helped on some of the problems and exercises.

Problem 149, from Popular Computing Vol. 4 No. 12, December 1976

The problem reproduced above appeared in the December 1976 issue of Popular Computing (Vol. 4, No. 12). At the time, I made no attempt to work this one out, but evidently the problem seemed interesting enough to be worth filing away. When I came upon the old clipping recently, I gave it a closer look and realized I have no idea how to answer Gruenberger’s question, though the impediment now is not lack of hardware.

Gruenberger asks us to trace a planar path whose steps are indexed by the odd integers starting at 3. For each number N we turn right 90 degrees before taking a step if N is a prime congruent to 1 mod 6; we turn left 90 degrees before moving one unit if N is a prime congruent to −1 mod 6; otherwise we continue straight ahead in whatever direction we happen to be facing.

In his typewriter graphics, Gruenberger plotted the trajectory from N=3 through 97. Below I continue the path through N=199.

trail201b.png

But something’s amiss here. Gruenberger wrote:

Eventually the path will cross itself, so that the cell containing 111 will also contain 147. Similarly, one cell will contain both 91 and 179.

Those two self-intersections are nowhere to be found in the diagram. When I first noticed this discrepancy, I assumed I must have made a mistake somewhere. (This eagerness to blame myself is not mere knee-jerk humility; I have years of experience to back it up.) Eventually, though, I concluded that it was Gruenberger who had made the wrong turn. I believe he mistakenly went left at 127, as shown in the brown trail below:

trail201b-error.png

The brown continuation of the red path includes the two coincidences mentioned in Gruenberger’s problem statement. But the left turn at N=127 is incorrect, because 127 is a prime equal to (6×21)+1, and thus it should specify a right turn. The error is of no great consequence, but it does reveal something interesting: Gruenberger must have been plotting these paths by hand. Most likely he wrote a program to compute the series of residue classes, then traced out the trajectory on squared paper.

Setting aside this anomaly, Gruenberger was quite right that the path does intersect itself. Here’s the trail continued through N=1,001:

trail1001b.png

And if that’s not tangled enough, here’s what it looks like at N=10,001:

trail10001c.png

Gruenberger asks for “a list of the contents of those cells containing more than one number, arranged in the order of the smallest number in the cell.” It’s not hard to identify some cells that belong on such a list. The table below includes all multiply-occupied cells discovered when tracing the path up to N=1,001, sorted as Gruenberger requests:

                   x    y    values of N
                 -11   28    (137 337)
                 -15   27    (147 683)
                 -16   27    (149 349 685)
                 -18   26    (155 355)
                 -19   27    (159 691)
                 -19   28    (161 693)
                 -19   29    (163 695)
                 -17   31    (171 319)
                 -18   32    (175 315)
                 -19   32    (177 701)
                 -20   32    (179 703)
                 -22   31    (185 769)
                 -23   31    (187 771)
                 -24   31    (189 773)
                 -30   41    (245 269)
                 -30   42    (247 271)
                 -27   40    (281 733)
                 -26   40    (283 735)
                 -26   37    (289 725)
                 -23   35    (299 715)
                 -22   35    (301 761)
                 -21   35    (303 759)
                 -20   35    (305 757)
                 -17   27    (351 687)
                 -18   27    (353 689)
                 -17   24    (361 673)
                 -16   24    (363 675)
                 -15   24    (365 677)
                 -17   21    (379 667)
                 -17   22    (381 669)
                 -17   23    (383 671)
                 -20   22    (391 631)
                 -20   21    (393 633)
                 -20   20    (395 635)
                 -20   19    (397 637)
                 -22   19    (401 593)
                 -22   18    (403 591)
                 -22   17    (405 589)
                 -22   16    (407 587)
                 -27   15    (419 575)
                 -27   14    (421 573)
                 -28   14    (423 819)
                 -29   14    (425 549)
                 -32   14    (431 539)
                 -32   13    (433 537)
                 -26   10    (563 831)
                 -26   11    (565 829)
                 -27   13    (571 823)
                 -28   18    (607 811)
                 -22   32    (707 767)
                   4   -6    (923 971)
                   4   -7    (925 969)
                   4   -8    (927 967)
                   4   -9    (929 989)
                   5   -9    (931 991)

Is this list the answer to Gruenberger’s question? No, it’s not, because there’s no reason to stop at an arbitrary limit such as N=1,001. Indeed, the list above is not even a prefix of the complete answer. The smallest value of N appearing in the list is 137, but the trail will eventually revisit cells occupied by smaller values of N. For example, continuing the experiment to N=10,001 reveals a bunch of intersections quite close to the beginning of the path, including a site that’s visited five times:

                   x    y    values of N
                   1    0    (5 1621)
                   1    1    (7 1623)
                   2    1    (9 4725)
                   3    1    (11 1263)
                   3    2    (13 1265)
                   5    3    (19 1635)
                   6    3    (21 1637)
                   7    4    (25 7537)
                   7    5    (27 7319 7539)
                   7    6    (29 7505 7541)
                   6    6    (31 1643 7323 7503 7543)
                   6    7    (33 1645 7325)
                   6    8    (35 1647 7327)
                   6    9    (37 1649 7329)

One point still missing from this list is the origin–the site at x=0, y=0, N=3. Does the path ever revisit its starting point? If so, at what value (or values) of N does it come back home? Since I don’t know the answer to this question, I guess I’ll have to leave it as an exercise for the reader.

I suspect that the problem Gruenberger meant to pose (or thought he was posing) was to generate a list of self-intersection sites arranged in their natural order of occurrence–that is, the order in which the crossings are created when you construct the path starting from the origin. This natural-order list is not at all the same as a list “arranged in the order of the smallest number in the cell.” The natural-order list is easy to generate step by step. All you need to do is obey the left/right/straight rules, plot the resulting sequence of positions on the xy lattice, and leave behind a trail of breadcrumbs so you can check at each step to see if the site has been visited before. This task is a matter of straightforward computation–just the kind of assignment that Gruenberger favored. The natural-order list begins:

                   x    y    values of N
                 -30   41    (269 245)
                 -30   42    (271 247)
                 -18   32    (315 175)
                 -17   31    (319 171)
                 -11   28    (337 137)
                 -16   27    (349 149)
                 -18   26    (355 155)
                 -32   13    (537 433)
                 -32   14    (539 431)
                 -29   14    (549 425)
                 -27   14    (573 421)

Thus the prime path first crosses itself when N=269, a value that shares the same coordinates as N=245, namely x=−30, y=41. There are 56 such crossings up to N=1,001, and 112,988 self-intersections up to N=106.

*     *     *

There is a wilder, conjectural answer to Gruenberger’s challenge–which I’m pretty sure he did not have in mind. It goes like this: Maybe the complete list of revisited values of N is simply the list of all N. In other words, maybe the Gruenberger prime path fills up the entire lattice of integers, crossing over itself everywhere many times.

In 1921 George Pólya published a celebrated proof that a random walk on the lattice of integers is recurrent in one or two dimensions, though not in higher dimensions. Recurrent means that the walk returns to each point along its length with probability 1, and indeed visits every point in its domain infinitely often. Is it possible that the prime path is also recurrent?

Pólya’s theorem is one of those mind-expanding results that seem impossible on first acquaintance, and then inevitable, and finally just so amazing that you want to go kiss a mathematician. I have to confess that I’ve never gotten all the way through Pólya’s original paper (it’s not long, but it’s in German). On the other hand, I can highly recommend a little book by Peter Doyle and Laurie Snell, Random Walks and Electric Networks, which gives several alternative proofs of the theorem; it was published in the MAA’s Carus Monograph series, and there’s a postprint available on the arXiv.

The key insight underlying Pólya’s result, as I understand it, is this: If you never revisit a former home, then you must be spending eternity somewhere else, and you can do that only if your universe has enough somewhere elses that you’ll never run out of new territories to visit. Suppose that, some eons after starting your journey, you find yourself at distance r from the origin. If you’re living in a one-dimensional universe, then there are just two places you could be at that moment, namely at +r or −r. It doesn’t matter how far you run; there are still just two points at any given distance from the origin. In two dimensions, a fugitive at distance r has a little more room to maneuver; the number of available points grows in proportion to r, forming a circle of radius r. But this is still not enough room to get lost in. Only in three dimensions or more is there a nonzero probability of escape. In three dimensions, the space available at radius r is proportional to r2. In this three-dimensional world, the volume of empty space grows faster than a random walker’s expected distance from home.

What does all this have to do with Gruenberger’s prime path? Well, it’s no secret that the distribution of prime numbers looks convincingly random–if you look at it in just the right way. And in particular the distribution of primes in various residue classes, such as 6K+1 and 6K−1, seems to behave at least approximately like a random variable. All this suggests we might consider viewing the Gruenberger prime trail as if it were a random walk through the two-dimensional lattice of integers. Because the space is two-dimensional, it’s a good guess that the walk should be recurrent.

The original recurrence results of Pólya refer to a simple random walk, where at each step the walker chooses randomly among the available directions and then moves one unit in that direction. For example, in the two-dimensional lattice of integers there are four possible directions: north, south, east, west. The simple random walk is not the best model of the Gruenberger process, which is more like a nonreversing random walk–a path where on each step the walker can turn left or turn right or go straight ahead but can never make a 180-degree about-face. We can further refine the random-walk model of the Gruenberger process by biasing the choice made at each step to reflect the changing abundance of prime numbers. Primes grow scarcer as their magnitude increases; in the vicinity of a given value of N, the probability that a randomly chosen number is prime is approximately 1/log N. Since the Gruenberger path goes straight for all composite numbers and turns only when N is prime, the trail will have longer and longer straight segments, and rarer turns, as N increases. A random walk can mimic this behavior by choosing an action at each step according to this logic:

    if random(1.0) > 2/log(N)
       then go straight
       elseif randomboolean()
           then turn left
           else turn right

(The proportion of primes is given as 2/log(N) rather than 1/log(N) because the Gruenberger process is defined on odd numbers only, which immediately eliminates half of the composites.)

One way to compare the various kinds of random walks is to measure the root-mean-square displacement–the distance from the origin to the final position of the walker, averaged over many realizations of the random process. For a simple random walk, the RMS displacement for an N-step walk converges to \(\sqrt{N}\); for the nonreversing random walk the average displacement is \(\sqrt{2N}\). The biased random walk based on the distribution of primes also appears to yield an RMS distance proportional to the square root of the number of steps; numerically the curve looks something like \(\sqrt{10N}\). I’m not entirely sure that’s the true form of the curve, but the geometric details don’t really make much difference. If I understand correctly, all three of these random processes should be recurrent in the sense of Pólya.

rms-graph5.png

Does the same reasoning apply to the Gruenberger prime path? There are two sides to this question.

The naysayer points out that Pólya’s theorem applies to random walks, but there’s nothing truly random about the sequence of primes. After all, we have a straightforward, deterministic algorithm for generating primes, as well as an efficient algorithm for testing whether any given integer is prime or composite. The essence of a random process is that every time you run it you get a different result, but there’s only one sequence of prime numbers, and so the Gruenberger prime path will come out exactly the same every time. According to this view of things, the kind of probabilistic reasoning that goes into the proof of Pólya’s recurrence theorem is out of bounds here. For randomness to make any sense, you need to average over some ensemble of independent instances. For example, you could average over the 50 salmon-pink paths in the graph below, which represent 50 independent realizations of a biased random walk; you can’t average over the prime path itself (green), because there’s only that one path.

random-prime-paths-51.png

The yeasayer retorts that a single path is all you need–if the path is infinitely long. Indeed, the salmon-colored trails above could be interpreted not as 50 distinct runs of a random process but as 50 segments of a single long path, which repeatedly loops around through the point at x=0, y=0, wanders off in various directions, and then comes back home yet again. In essence, everything that could possibly happen in an infinite set of random paths happens somewhere within a single infinite path; all possible variety is already present there.

I’m not sure how to settle this dispute between Dr. Yea and Professor Nay. When an argument hinges on the nature of randomness, the meaning of infinity and patterns in the distribution of the primes, I known I’m in over my head.

So I’ll leave that deep question unresolved and say a final word about a lesser curiosity. In the Gruenberger process, we’re using the congruence classes of prime numbers mod 6 as a kind of coin flip to decide which way to turn. Is it a fair coin flip? For small values of N, it certainly doesn’t look fair:

                               6x+1      6x−1
       primes <     100         11        12
       primes <    1000         80        86
       primes <   10000        611       616
       primes <  100000       4784      4806
       primes < 1000000      39231     39265

There’s a persistent excess of −1 primes, and the imbalance seems to be getting steadily larger. As a result, the prime path has a “winding number” that reaches 8.5 at N=106; that is, the path makes eight and a half net counterclockwise revolutions. Does the windup continue with still larger N? I gather that the definitive answer is “Yes and No.” For more see the masterful paper by Andrew Granville and Greg Martin cited above.

[Correction 2010-02-19: reflected the accent on Pólya.]

Yet another spam update

7 February 2010

spam-2008-02-to-2010-01.png

It seems the great spam tide of 2009 is ebbing. The graph records numbers of spam messages per month received by my email accounts; obviously this is just a personal tally and your milage may vary. The percentage of Russian-language spam in my inbox has fallen off somewhat, from more than half last fall to less than 40 percent in January. (For context see my earlier bit-player spam reports: Aug 2009, May 2009, Mar 2009, Nov 2008, Oct 2008, Jun 2008. Also two American Scientist columns: Jul-Aug 2007 and May-Jun 2003.)

I would also like to draw attention to a recent article on the economics of spam: “Spamalytics: An Empirical Analysis of Spam Marketing Conversion,” by Chris Kanich, Christian Kreibich, Kirill Levchenko, Brandon Enright, Geoffrey M. Voelker, Vern Paxson and Stefan Savage, Communications of the ACM 52(9):99-107 (available online if you or your library is a subscriber). (Preprint link, thanks to Arvind Narayanan in the comments.) Kanich et al. adopt a clever and mildly controversial strategy: They parasitize an existing spam botnet and alter outgoing emails so that embedded links point back to the investigators’ own web servers, rather than those of the spammers. They write:

Using this methodology, we have documented three spam campaigns comprising over 469 million emails. We identified how much of this spam is successfully delivered, how much is filtered by popular antispam solutions, and, most importantly, how many users “click-through” to the site being advertised (response rate) and how many of those progress to a “sale” or “infection” (conversion rate).

Of the 469 million messages, 347 million were part of a campaign advertising pharmaceuticals. The bottom line: 28 “sales” (no money actually changed hands, and no products were delivered) with an average purchase price of about $100. This is a conversion rate of less than 1 in 10 million, which leaves some doubt about the profitability of the operation. Kanich et al. note that spam-for-hire services would charge roughly $25,000 to send 350 million emails, an order of magnitude more than the revenue generated in this campaign. Yet the mere continued existence of spam argues that the actual costs must be much lower. Kanich et al. speculate that the spammers are a vertically-integrated enterprise: They own both the botnet and the pharmacy.

Finally, I want to update my comment on comment spam. Three months ago I installed the Akismet filter to intercept spam comments submitted to bit-player. The filter has been performing fairly well, blocking about 200 spam comments so far, letting a dozen or so slip through, and falsely imprisoning a couple of legitimate comments. (Advice to commenters: Avoid ALL CAPS.)

Looking through the archive of impounded messages, I am more perplexed than ever about the social and economic basis of this phenomenon. The URLs that constitute the payload of the spam point to a weird miscellany of topics. Someone out there thinks that the readers of this blog are interested in hypnotism, in bathroom faucets and shower enclosures, in garage doors and currency conversion. Most of all we are a musical bunch: Fully a third of the spam comments promote merchandise related to pianos.

Unlike email spam, the comment spams are not generated by an automaton; there are living human beings at the other end of this communication channel. And apparently the spam writers are not hired merely for high-speed, wholesale captcha-solving. In many cases there is ample evidence that the commenter has read the article and understood it, and may even have something interesting to say about it.

It occurs to me that this circumstance gives me an opportunity to open a dialogue. So I now address myself directly to the comment spammers: If you are reading this post because you’ve been hired to plant comments and URLs here, or if you’re doing it for your own commercial gain, I have a proposition for you. Please leave a comment attached to this post, explaining what you’re doing and why–and, if possible, for whom. Tell us what a successfully planted link is worth, and how much you get out of it. If your comment is sufficiently interesting and informative, and if it seems genuine, I’ll let it through the Akismet filter, and you’ll have your chance to sell my readers a piano or a garage door. (This offer applies only to comments attached to this article, and I’ll be the sole judge of what’s interesting, informative and genuine.)

If you’d rather communicate privately, send me an email at brian@bit-player.org.

A molecular millisecond

6 February 2010

It was not quite a century ago that we got our first glimpse of molecules. William Lawrence Bragg, with a little help from his dad, figured out how to get molecules to sit still long enough for a portrait. First you had to crystallize the substance, then shoot x-rays through the carefully mounted crystal, then record the lace-doily pattern of diffracted rays on photographic film.

After all that lab work came the really hard part: analyzing the pattern of bright dots in order to reconstruct the positions of atoms in three-dimensional space. This was a difficult inverse problem, something like deducing the shape of a musical instrument from the sounds it emits. (The EDSAC, the first working stored-program computer, was put to work deciphering x-ray diffraction patterns circa 1950.)

Finally it came time to build a model of the molecule–and in those days a model occupied physical rather than virtual reality. In the 1970s I visited Max Perutz, who had by then spent more than 30 years working out the structure of the hemoglobin molecule. His offices were cluttered with modeling artifacts: stacked sheets of transparent plastic, marked with hand-drawn contour maps of electron density, lumpy clay and plaster extrusions showing the overall form of the protein at low resolution, and the now-familiar tinker-toy assemblies of balls and sticks.

It was hard-won knowledge, and I thought it heroic science. I still do. And yet everyone knew all along–Perutz more emphatically than anyone else–that those rigid, static models of proteins were highly misleading. In the living cell, biological macromolecules do not sit immobile like bronze statues. They are machines with moving parts; they continually flex and wiggle, mesh and then disengage, spin, flap, bend, stretch; all day long they do a hyperkinetic hokey-pokey.

I have now seen a remarkable performance of that molecular dance. In a talk at Harvard earlier this week David E. Shaw showed two videos, each portraying about a millisecond in the life of a single protein molecule. A millisecond may not sound like much, but the video was created by computing atomic motions at roughly one step per femtosecond. That’s 1012 steps in all. (If you included all the steps in the video, and displayed them at 60 frames per second, the show would go on for 500 years.)

Shaw was once a computer scientist at Columbia, then he went off to make some billions on Wall Street. (He was introduced to the Harvard audience as “King Quant.”) He has now turned to computational molecular biology, setting up his own lab and building a series of special-purpose computers designed for molecular-dynamics simulations. The machines are called Anton, in honor of Leeuwenhoek. Shaw’s group has built eight of them so far, each with 512 processors. A kiloprocessor model is expected to come on line in a few weeks.

The basic idea behind the computations is simple. Start with the initial positions and velocities of all the atoms. Calculate the force that each atom exerts on every other atom, and the resulting acceleration. Wash, rinse, repeat. For a system of N atoms, the naive version of this algorithm has performance proportional to N2; this quadratic growth is a bit of a problem, because the model includes not only several hundred atoms in the protein itself but also up to 50,000 atoms in the surrounding solvent. So Anton takes some shortcuts. The big one is to do a full accounting of pairwise interactions only for atoms within a limited radius; the distribution of more-distant atoms is remapped to a mesh of discrete points. But even after this winnowing of the problem, the calculation of pairwise forces remains the principal bottleneck. It is solved by throwing hardware at it: 32 × 512 parallel pipelines implemented on custom silicon. There’s more on Anton’s architecture and algorithms here; the Shaw Research web site lists lots of other publications as well, but most of them are not accessible without payment.

As far as I can tell, the videos of proteins in motion are not yet available anywhere, and that’s really too bad. They might well be the next dance sensation on YouTube. Watching them in the lecture hall, I was so bedazzled that I neglected to note the identity of the molecules. One was an ion channel, a protein that spans the width of a membrane and controls the passage of some specific ion (potassium, I think, in this case). We watched the six polypeptide strands twisting closed like the blades of a camera iris, shutting off the channel. Another simulation showed an even more dramatic reconfiguration. For many microseconds of biological time, and perhaps half a minute of wall-clock time, the protein sat nervously quivering and fidgeting, hunched up in a compact globule, with occasional minor adjustments to various loops and corners. And then suddenly the whole molecule opened up like a flower blooming; a moment later it closed again. If I understand correctly what Shaw was telling us, the existence of this alternative state had been known from experimental evidence, but the transformation had never been seen before. And, as he remarked early in the talk, “seeing what it looks like” brings a level of understanding that would be hard to achieve by more analytic methods.

Which brings me to my one gripe. The truth is, we still don’t know what a protein really looks like, and we never will, because “looking” is not a well-defined notion for objects smaller than the wavelength of light. Color, for example, is just not meaningful in this realm, and surface texture is also problematic. Thus schemes for depicting molecules are necessarily a matter of convention. It’s worth giving some careful thought to those conventions, choosing graphic forms that convey as much as possible about what we do know without inviting spurious inferences about what we don’t know (such as color and texture).

Some of Shaw’s illustrations use a ribbon-and-sheet scheme invented 30 years ago by Jane Richardson, which still seems to work well for showing the overall architecture of a protein. But other diagrams and videos use a ball-and-stick model to represent atomic detail, and this strikes me as a less-happy choice. Watching that jiggling assembly of balls and sticks (black for carbon, red for oxygen, etc.), I kept seeing a shiny, brittle, plastic model of a protein rather than the protein itself. Surely there are better graphic devices.

Update: Thanks to Ron Dror of D. E. Shaw Research for pointing out an error in my description of the Anton algorithm: Distant charges are not mapped to a continuum distribution but to a mesh of discrete points. (I’ve made a correction above.) Ron also notes that an article on Anton in Communications of the ACM is available in the CACM digital edition.

17 x 17: A nonprogress report

22 December 2009

The question again: Is there a four-coloring of the 17-by-17 grid in which none of the 18,496 rectangles have the same color at all four corners? As I said last time, Bill Gasarch would not have put a bounty on this problem if it had an easy solution. Over the past couple of weeks I’ve invested some 1014 CPU cycles in the search, and a few neural cycles too. I have nothing to show for the effort, except maybe a slightly clearer intuition about the nature of the problem.

If you generate a bunch of four-colored n-by-n grids at random, the average number of monochromatic rectangles per grid increases quite smoothly with n:

random-grid-stats.png

This gradual progression might lead you to suspect that the difficulty of finding or producing an n-by-n grid that is totally devoid of monochrome rectangles would also be a smooth function of n. The truth is quite different.

success-graph.png

Finding a solution is easy for square grids of any size up to 15 by 15. The task suddenly becomes very hard at size 16 by 16. As for 17 by 17, it’s much harder still–and indeed is not yet known not to be impossible. (Details on the data behind this graph: For each size class from n=8 to n=17 I started with 1,000 randomly four-colored n-by-n grids. Then I applied a simple heuristic search (the first of the algorithms listed below) to each grid, running the program for 1,000 × n2 steps. The graph records the number of times this procedure succeeded–i.e., produced a grid with no monochrome rectangles–at each grid size. Up to n=14, the search never failed; at n=16 and beyond, it never succeeded.)

This kind of sudden transition from easy to hard is a familiar feature in the realm of constraint satisfaction. Well-known intractable problems such as graph coloring and boolean satisfiability have the same structure. That doesn’t bode well for any of the simple-minded computational methods I’ve tried. Here’s a brief catalog of my failures. These are algorithms that I’m pretty sure are never going to pay off.

  • Biased random walk. Start with a randomly colored grid. Repeatedly choose a site at random, then try changing its color; accept the move if it reduces the overall number of monochrome rectangles. This is the simplest of all the algorithms. None of the more-elaborate schemes is decisively better.
  • Whack-a-mole. Find all the monochrome rectangles in the grid; choose one of them and alter the color of one corner, thereby eliminating the rectangle. In the simplest version of this algorithm, you choose the rectangle and the corner and the new color at random; in more sophisticated versions, you might evaluate the alternatives and take the one that offers the greatest benefit.
  • Steepest descent. Examine all possible moves (for the 17-by-17 grid there are 867 of them) and choose one that minimizes the rectangle count.
  • Lookahead steepest descent. Examine all possible moves, and then all possible sequels to each such move (for the 17-by-17 grid there are 751,689 two-move sequences); choose a sequence that minimizes the rectangle count. In principle this method could be extended to chains of three or more moves, but the cost soon gets out of hand. (The lookahead technique is the mirror image of backtrack search; it explores the tree of possible moves breadth-first instead of depth-first.)
  • Color-balanced search. Allow only moves that maintain the overall balance of colors in the grid. For example, in a 16-by-16 four-colored grid, color balance implies 16 sites in each color. One way to maintain balance is to make moves that swap the colors of two sites. (There is no reason to think that a rectangle-free grid will have exact color balance; on the other hand, a solution for a large grid cannot depart too far from perfect balance. Thus a color-balanced search might be an effective trick for finding a neighborhood where solutions are more common.)
  • Row-and-column-balanced search. Allow only moves that maintain the balance of colors within each row and each column of the grid. In a 16-by-16 grid, each row and column should have four sites in each of four colors. A simple way to maintain this detailed color balance is to search for “harlequin rectangles” with the color pattern \(\begin{array}{cc}a & b\\b & a\end{array}\) and permute them to \(\begin{array}{cc}b & a\\a & b\end{array}\).

Most of these techniques are greedy: At each stage the algorithm chooses an action that maximizes some measure of progress. On hard instances, a pure greedy strategy almost always fails; the search gets stuck in some local optimum. Thus it’s usually best to temper the greediness to some extent, occasionally choosing a move other than the one that yields the best immediate return. (The family of methods known as simulated annealing are more elaborate variations on this idea, based on insights from thermal physics.)

greediness-traces.png

Here we see traces of three runs of the algorithm identified above as steepest descent, with differing values of a greediness parameter m. (The grid size is 15 &times 15.) At m=0 (no greediness at all), all moves are equally likely to be chosen, and the algorithm executes a random walk on the space of grid colorings. At m=1 (maximum greediness), the program always chooses the highest-ranked move, which works well until the system stumbles into a state where no move can reduce the rectangle count. A value of m=0.3 seems to be a good compromise. (I’ll say a little more below on the greediness parameter; indeed, I have a question about how best to define and implement it.)

After all this fussing with a dozen variations on local-search algorithms, I’m afraid the outlook for success is not promising. With a little patience and some tuning of parameters, any one of these algorithms can solve grids up to 15 &times 15. With a lot more patience and tuning, they’ll eventually yield answers for 16 &times 16. But none of the algorithms come even close to cracking the 17 &times 17 barrier. Solving that one is going to require a fundamentally new idea. Perhaps someone will find an analytic approach to constructing a solution, rather than blindly searching for one. Or perhaps someone will prove that no solution exists.

On the computational front, I suspect the best hope is a family of algorithms known in various contexts as belief propagation, survey propagation and the cavity method. I’ve been hoping that friends who are expert in these techniques might swoop in and solve the problem for me, but if not I may have to give it a try myself.

In the meantime, here’s the thing about greediness (an apt subject for this time of year?). We want to define a function greedy whose arguments are a vector of alternative moves ranked from best to worst and a number m such that 0 ≤ m ≤ 1. If the greediness parameter m is 0, the function returns a random element of the vector. If m = 1, the returned value is always the first (highest-ranked) move. Otherwise, we must somehow interpolate between these behaviors. One attractive notion is to return the first element of the vector with probability m, the second choice with probability m(1 − m), and so on. Thus for m = 1/2 the series of probabilities would begin 1/2, 1/4, 1/8…. For m = 1/3 the first few values are 1/3, 2/9, 4/27….

This scheme works just fine for a vector of infinite length, but there’s a problem with shorter vectors. Consider what happens with the procedure call greedy(v=[1, 2, 3, 4], m=0.5). We have the following table of probabilities:

     1 --> 1/2
     2 --> 1/4
     3 --> 1/8
     4 --> 1/16

But on adding up those values, we come up 1/16th short of 1. What happens to the missing probability? I took an easy way out, distributing the “extra” probability equally over all the elements of the vector. The code looks something like this:

function greedy(v, m)
  for i=0 to length(v)
     if (i==length(v))
        return v[random(length(v))]
     elseif (random(1.0) < m)
        return v[i]

This procedure seems to give sensible results, but I wonder if there might be a better or more natural definition of greedy probabilities. Also, the running time for my code is logarithmic in the length of the vector (assuming m < 1). Is there a constant-time algorithm that gives the same results? (We don’t know the length of the vector in advance, so merely precomputing the table of probabilities is not an option.)

The 17×17 challenge

5 December 2009

William Gasarch is not the Clay Mathematics Institute. He isn’t paying a million bucks for proofs of famous conjectures. But Gasarch is putting up 172 of his own dollars for the solution to an intriguing little stumper. And the prize problem appears to be somewhat easier than the Riemann hypothesis or the P=NP question. (Unless it’s impossible!)

Gasarch sets forth his prize challenge in a blog post, with further background in a paper and in the slides from a talk. All of those works are well worth reading, but for those who don’t want to chase down the references, here’s the gist. Our mission, should we choose to accept it, is to color the nodes of an n-by-m grid, using only a specified number of colors, and observing a particular constraint: Nowhere in the grid may the four corners of a rectangle all have the same color. (Only rectangles with sides parallel to the x and y axes are considered.) For example, here is a four-colored 15-by-15 grid that satisfies the no-monochromatic-rectangles constraint:

grid15r0a.png

In this array of dots there are \( {15 \choose 2}^2=11025\) distinct rectangles. If you care to check through all of them one by one, you’ll find that in no case do all four corners have the same color. In contrast, here is a 16-by-16 grid that is almost but not quite rectangle-free:

grid16r2a.png

Gasarch offers his $289 bounty for any four-colored 17-by-17 grid with no monochromatic rectangles. Why is that grid of particular interest? It’s a border case. Among square grids, all those up through 16-by-16 have been shown to have rectangle-free four-colorings. For the 18-by-18 grid and all larger squares, rectangle-free four-coloring has been proved impossible. For squares larger than 18-by-18, four-coloring has been proved impossible. The status of the 17-by-17 and 18-by-18 grids remains unsettled, but Gasarch believes that both are four-colorable.

Gasarch has much more to say about the mathematics behind this problem. Here I would like to muse on some computational aspects of searching for a 17-by-17 four-coloring.

To state the obvious first, this is not a problem we can expect to solve by exhaustive enumeration. There are 4289 possible colorings of the grid. Casting out symmetries brings that down only to 2 × 4287. There’s not world enough or time for checking them all.

Testing grids at random is also hopeless in the 17-by-17 case. This nonstrategy actually works quite well for small grids. For example, you can readily find a four-coloring of an 8-by-8 grid just by generating a few hundred thousand random colorings. But the method fails for larger grids because the proportion of all colorings that are rectangle-free falls steeply with grid size. (Consider the 2-by-2 grid: There are 256 four-colorings, and all but four of them are rectangle-free.)

To make any progress toward the 17-by-17 case, we’ll have to do at least a little thinking, rather than expecting the computer to do all the work. Here’s one idea that’s very easy to implement: Find a monochrome rectangle somewhere within the grid, change the color of one of its corners, and repeat until you can’t find any more rectangles. This algorithm works reasonably well for grids up to about 12-by-12, but then it runs out of steam. On larger grids, changing the color of a node to eliminate one rectangle is likely to create another rectangle elsewhere (or several more of them). As a result, the system merely takes a random walk, with trendless fluctuations in the number of rectangles at any given moment. You discover a rectangle-free coloring only if the walk happens to stumble on the zero point.

I found the 15-by-15 four-coloring shown above with an algorithm that’s a little more effective even thought it’s no more sophisticated than the corner-twiddling method. The program repeatedly chooses a node at random and tries assigning it all four possible colors, tallying up the number of rectangles for each color choice. Some color or set of colors must minimize the rectangle count; from among these optimal colors the program chooses one at random and sets the node to that color before repeating the loop. This is a “greedy” method: At each step the number of rectangles can decrease or remain constant but can never increase. Greedy methods are notorious for getting stuck in local optima that are inferior to the global optimum. Maybe that’s what happens to my program when I try it on 16-by-16 and 17-by-17 grids. Or maybe the search space is just too large. In any case, when I woke up this morning and checked the results of an overnight run, I did not find a rectangle-free four-colored 17-by-17 grid awaiting me.

Of course I really didn’t have to do any algorithm analysis at all to know that I wasn’t going to win $289 and eternal fame with a day or two of idle hacking. If the problem were that easy, Gasarch and his students would have solved it for themselves long ago.

In spite of these various failures and frustrations, the grid-coloring problem still looks tantalizingly solvable. If a four-coloring of the Gasarch grid exists, it seems like we should be able to find it by some practical computation.

There are certainly lots of approaches more powerful than the blind dart game I’ve been playing. For example, if local optima are the major impediment, some variation on simulated annealing might help.

A more radical possibility is to try to construct an instance rather than merely search for it. If we assume that the four colors are represented as evenly as possible, then the 17-by-17 grid must have 72 nodes in each of three colors and 73 nodes in the fourth color. Starting from a blank grid, it’s easy to mark off 73 nodes in a single color without creating a forbidden rectangle. Adding 72 nodes of a second color is only a little harder. But then the job gets tricky. When you try to fill in a third color, you also by default choose nodes for the fourth color at the same time, and conflicts pile up in a hurry. Some kind of backtracking approach is probably needed here. Gasarch links to a paper by Elizabeth Kupin of Rutgers that explores these ideas in more detail. (If you want to prove the nonexistence of a four-coloring, this is presumably the way to go.)

Gasarch mentions two other promising avenues: integer programming (the discrete variant of linear programming) and SAT solvers–algorithms for the satisfiability problem. Having spent some time hanging out with a few master SAT solvers, I’m intrigued by the latter possibility. You can almost encode the grid-coloring problem as an instance of NAE-SAT, or not-all-equal SAT. Each node of the grid is represented by a variable that can take on any of four values. We group subsets of variables four at a time into clauses, where each clause includes the variables representing the four corners of a rectangle somewhere in the grid. For the 17-by-17 grid there are \({17 \choose 2}^2=18496\) clauses of this kind. The entire formula is satisfied if we can assign values to the 289 variables in such a way that none of the 18496 clauses has all four of its variables with the same value. After 40 years of work on SAT, there’s a highly developed technology for solving such problems. However, there’s a hitch. SAT problems are formulated in terms of boolean variables, with just two values each, but the grid-color variables have four values. Thus a further layer of encoding is likely to be needed, bringing a further explosion in the size of the problem instance.

One final hackerish note: What’s the best way to detect the presence of a monochromatic rectangle in a grid? My candidate goes like this. We encode the rows of the grid in a set of bit vectors–four vectors for each row, representing the four possible colors. For example, the red vector for a row has a 1 at each position where the corresponding node is red, and zeroes elsewhere. The blue vector has 1s for blue nodes, and so forth. Now we can detect a rectangle merely by taking the logical AND of two rows (an operation that could be a single machine instruction). A rectangle exists if and only if the result of the AND is nonzero. at least two bits are set in the resulting vector.

[Thanks to all the commenters for corrections and elaborations.]

El Farol Highway

27 November 2009

traffic-jam-9146.jpg

I got caught Wednesday night in the national pre-Thanksgiving traffic jam. As I was approaching Baltimore, an electronic signboard announced:

Delays on I-95 and I-895

Suggest I-695

Of course I immediately thought: But everyone will read the sign and take I-695, so that road will be jammed too. Then I thought: But everyone who reads the sign will realize that everyone else will also read the sign, and so they’ll not choose 695. Then I thought….

Hmm. There’s something familiar about this problem.

I’m not telling which road I took, but I can report that it was the only congestion-free segment of my trip.

The birth of the giant component

20 November 2009

The waste product of my document scanning project is a slag heap of extracted staples:

staples-in-bowl-2667.jpg

The other day I made a discovery: If you grab one of the discarded staples and lift it, the whole ball of tangled, mangled metal comes along, leaving behind only a few stragglers in the bottom of the bowl.

ball-of-old-staples-closer-2691.jpg

When I noticed this, my first thought was “Hmm, that’s funny.” My second thought was “Oh, of course: Erdős-Rényi.” And my third thought–well, I’m still working on my third thought, as well as thoughts four, five and six.

Erdős and Rényi are Paul Erdős and Alfréd Rényi, who wrote a big paper on “The Evolution of Random Graphs” 50 years ago (Publications of the Mathematical Institute of the Hungarian Academy of Sciences, 1960, 5:17–61). Their paper wasn’t quite the debut appearance of random graphs in the mathematical literature, but it’s usually cited as the theory’s point of origin.

In one version of the Erdős-Rényi process, you start with a set of n isolated vertices and then add random edges one at a time; specifically, at each stage you choose two vertices at random from among all pairs that are not already connected, then you draw an edge between them. It turns out there’s a dramatic change in the nature of the graph when the number of edges reaches n/2. Below this threshold, the graph consists of many small, isolated components; above n/2, the fragments coalesce into one giant component that includes almost all the vertices. “The Birth of the Giant Component” was later described in greater detail in an even bigger paper–it filled an entire issue of Random Structures and Algorithms (1993, 4:233–358)–by Svante Janson, Donald E. Knuth, Tomasz Luczak and Boris Pittel.

What made me think of a connection between Erdős-Rényi graphs and my hairball of staples? What I had in mind was something like this: The long, straight, middle part of a staple corresponds to an edge of a graph, and the bent end pieces, which can grab on to each other, are the vertices. Thus a single staple is a graph consisting of two vertices connected by one edge. When two staples hook up, two of their vertices merge and we’re left with a connected graph of three vertices and two edges. Since each staple contributes one edge and at most two vertices to the graph, the number of edges must be at least half the number of vertices. Thus the graph is always over the threshold for forming a giant component, according to Erdős and Rényi.

The counting part of this analysis seems okay, but I’m afraid the rest of it doesn’t hold up very well. Whatever is going on in the staple ball, the evolution of the system is not well modeled by the Erdős-Rényi process of adding edges to a fixed set of vertices. Instead, each staple brings both an edge and two vertices. The crucial event that makes the cluster hang together is the merging of vertices when staples link hands; this merging has no counterpart in the Erdős-Rényi process.

The underlying problem here is that Erdős-Rényi graphs are purely topological–there’s no concept of distance, and any two vertices are equally likely to have an edge joining them. But the staple graph has important geometric constraints. Two vertices can be joined by an edge only if the distance between them is approximately equal to the length of a staple.

The geometric structure suggests trying a different kind of model–perhaps the kind that describes the molecular structure of liquids and solids. Water molecules, for example, are linked together by a network of hydrogen bonds; each hydrogen atom in one molecule can bond to the oxygen atom in another molecule. But the bonds cannot extend over arbitrary distances; they reach only between neighboring molecules. In the resulting three-dimensional structure the basic motif is a tetrahedron with an oxygen atom at its center and hydrogen atoms at the four corners. (There’s also a schematic two-dimensional model known as square ice.) We might imagine something similar going on with the staples, where the two bent ends can form hydrogen-bond-like links to other nearby staples.

But there’s a problem with such chemical models as well. Atoms have a fixed valence (more or less–let’s not quibble); old-staples-detail-2697.jpgin water, for example, each hydrogen atom can form a hydrogen bond with only one oxygen atom. But we have no reason to suppose that the hooked ends of a staple can attach to only one other staple. As a matter of fact, if such a restriction were enforced, then the staples could form only chains and rings, not dense clusters. In a close look at the actual clusters, it’s easy to find places where three or more staples are all hooked together at the same point. By carefully teasing apart the cluster, I have spotted vertices that appear to have a degree of at least six.

Yet another physical process that might provide a model for the staple graph is diffusion-limited aggregation. This is the mechanism responsible for the filigree pattern in the banner at the top of the bit-player web page. It is generated by sticky particles that drift at random until they wander onto the substrate or touch another particle that is already in contact (directly or indirectly) with the substrate. For staples, I suppose the drifters would be tumbling dumbbells with sticky ends–somewhat harder to simulate.

Another factor to keep in mind is that spatial dimensions are surely important here. For one thing, there’s just more room to maneuver in three dimensions, with more opportunities to glom onto a neighbor. But in the specific case of staples, there’s another reason: Confined to a plane, they have a hard time linking up:

staples-on-plate-detail-2681.jpg

Dispersed on a flat plate, they refuse to coagulate even when swirled vigorously. The reason, presumably, is that secure links form only when the staples can turn 90 degrees and interlock.

In this connection it would seem significant that these are used staples, somewhat varied in shape, with hooked ends that had been bent approximately 180 degrees in the process of stapling and that mostly retained an angle greater than 90 degrees after being pried out the papers. I wondered how shiny new staples would behave, and so I tried the experiment. (Materials and methods: 630 Stanley Bostitch chisel-point staples, model SBS191/4CP, freshly dispensed from an open-jaw Swingline stapler.)

ball-of-new-staples-crop-2700.jpg

I was mildly surprised at the result. Although the aggregation was somewhat looser and more delicate, it really wasn’t that much different. Again we witness the birth of a giant component.