Archive for the ‘computing’ Category

The Library of Babble

Saturday, April 23rd, 2011

The new issue of American Scientist is out, both on newsstands and on the web. My “Computing Science” column takes up a topic I’ve already written about here on bit-player: the huge corpus of “n-grams” extracted from the Google Books scanning project and released to the public by a team from Harvard and Google. (The earlier bit-player items are titled “Googling the Lexicon” and “3.14.”)

After two blog posts and a magazine column, my faithful readers may have had enough of this subject—but not me. I can’t seem to get it out of my system. So I’m going to take this opportunity to publish some of the overflow matter that wouldn’t fit in the column. (And even this won’t be the end of the story. Stay tuned for still more n-grams.)

For the benefit of readers who have not been doting on my every word, here’s a precis. Google aims to digitize all the world’s printed books, and so far they have scanned about 15 million volumes (which is probably about one-eighth of the total). At Harvard, Erez Lieberman Aiden and Jean-Baptiste Michel, with a dozen collaborators, have been working with digitized text from a subset of 5,195,769 Google book scans. Because of copyright restrictions they cannot release the full text, but they have extracted lists of n-grams, or phrases of n words each, for values of n between 1 and 5. The 1-grams are individual words (or other character strings, such as numbers and punctuation marks); the 2-grams are two-word phrases, and so on. Each n-gram is accompanied by a time series giving the number of occurrences of that n-gram in each year from 1520 to 2008. (For more background and technical detail, see the Science article by Michel, et al.)

If you want to trace changes in the frequency of a specific word over time, Google has set up an online Ngram Viewer that makes this easy. But you can also download the data set, which allows for many other kinds of exploration. The cost is some heavy lifting of multigigabyte files. So far I have worked only with English text (six other languages are also covered) and only with 1-grams, which form the smallest part of the data set.

Here are some basic facts and figures on the English 1-grams:

uncompressed file size (bytes) 9,672,200,350
number of distinct 1-grams 7,380,256
total 1-gram occurrences 359,675,008,445
number of distinct character codes 484
total character occurrences 1,515,454,264,550

That’s a lot of verbiage.

It’s worth pausing for a comment on those 484 distinct character codes. We tend to think of English as being written with an alphabet of just 26 letters, or 52 if you count upper case and lower case separately. Then there are numbers and marks of punctuation, and miscellaneous symbols such as $ and +. The original ASCII code had 95 printable characters. How do you get up to 484? Well, even though these files are derived from English-language books, a fair amount of non-English turns up in them. There’s Greek and Cyrillic and a smattering of Asian languages, as well as all the accented versions of Latin characters seen in Romance and Germanic and Slavic languages. There’s even a little mathematical notation. It’s actually surprising that the data set spans only 484 symbols; this is a small subset of the full Unicode spectrum.

Here is the length distribution of the 7 million distinct 1-grams:

graph of abundance vs. word length for the 7 million distinct 1-grams

Note that the abundance of shorter words is combinatorially limited. With an “alphabet” of 484 symbols, there cannot possibly be more than 484 single-character words, or 4842 two-character words. But this constraint becomes unimportant beyond length 3; for words of five or six letters, only a tiny fraction of all possible combinations are actually observed.

The corresponding distribution for the 359 billion word occurrences looks rather different:

graph of abundance vs. word length for the 359 billion word occurrences

This is roughly what you’d expect to see for a language that encodes information efficiently. As in a Huffman coding, the shortest words are very common, and the sesquipedalian ones are rare. The overall trend is generally linear, and it is remarkably so in the range of lengths from 5 to 11 or 12. Is this a well-known fact? (It is not the shape I would have guessed.) Words of length three and four stand out above the linear trend line. And I should mention that the abundance of single-character words is boosted in this tabulation by the inclusion of punctuation marks, which would probably not be counted at all in other studies of word length.

The graph below is one that appears in my American Scientist column. I reproduce it here because I want to call attention to some curious features of the curves.

historical time series of distinct words and word occurrences

The graph shows the number of distinct words per year (blue) and the total word occurrences per year (red) over the past 200 years or so. I find it interesting that some major historical events are visible in this record. There are dips in both curves at the time of the American Civil War and during both World Wars of the 20th century. Presumably, book publishing languished during those years. There are also ripples that might be attributed to the crash of 1929 and the ensuing Great Depression, although they are less clear.

Other features of the curves don’t have such a ready-made historical explanation. I am particularly curious about the broad sag in the red curve (but not the blue one) from the late 1960s through the 1970s. Between 1967 and 1973, the number of word occurrences declined 12 percent, while the number of distinct words rose 1 percent. This is very strange: We continued to invent new words, but we didn’t make much use of them. Nowhere else do the two curves maintain opposite slopes for any extended period. I can’t explain it. I call it the great Nixonian slump.

I thought I might learn something about the slump by looking at a selection of specific words that exhibit this pattern of abundance—less frequent in the early 70s than in surrounding years. So I extracted about 70,000 of them and sorted them by overall abundance. In many ways it’s an interesting collection. At the very top of the list is Reagan, apparently in eclipse during the Nixon years. Then there’s a swarm of words connected with mechanical and automotive engineering: torque, rotor, stator, pinion, crankshaft, impeller, carburetor, alternator. All of these words might plausibly appear in the same books, so seeing them fade in and out together is not so surprising. Still, one would like to know what happened in book publishing to cause the decline. The 70s were a tough time for the auto industry; is that enough to explain it?

Looking for patterns in these words is fun, but the truth is I don’t believe they have anything to do with the overall Nixonian swoon. Most words have large fluctuations in abundance over a time scale of a decade or two; my extraction procedure merely identified a subset of words that happened to enter a trough at the same time. I could find similar sets for other periods.

In another attempt to explain the slump I divided the data set into halves. One half consists of the 100 most frequent words, which happen to account for almost exactly half of all word occurrences. The other 7,380,156 words make up the second half. Maybe the slump afflicted only common words, or only rare ones? The result was remarkably uninformative:

times series for the top and bottom halves of the frequency range

The two curves trace almost exactly the same time course. Or maybe that’s not so uninformative after all: It tells us that whatever phenomenon causes the slump, the effect is spread out over the entire vocabulary, not just words in a certain frequency range.

Here’s another just-so story I’ve been telling myself in an attempt to understand the Nixonian swoon. The 1960s is when phototypesetting began to displace the older technology of metal type. Suppose that some characteristic of early phototypeset books causes trouble for optical character recognition (OCR) systems. On reading such a book, the OCR program would report an exaggerated number of unique words (since instances of a given word could be read in various different erroneous ways) but the total number of word occurrences would remain constant (since every word is still recognized as some word). This isn’t exactly the pattern we’re seeing, but there’s one more factor to take into account. In the Harvard-Google data set, no word is included unless it appears at least 40 times. If the phototypeset text caused the OCR system to misread the same word in many different ways, some of those errors would fail to reach the 40-occurrence threshold and would just disappear. Thus the total occurrence count could decline even as the number of distinct words increased.

Lying awake in the middle of the night, I thought this story sounded really good. When I got up in the morning, I ran a test. If an unusual number of words are falling off the bottom of the distribution during the Nixon years, then we should also see a bulge just above the bottom, consisting of those words that just barely reached the threshold. But here’s the time series for the 15,518 words with exactly 40 occurrences:

time series for words with exactly 40 occurrences

There’s no hump in the late 60s. On the contrary, the curve looks much like all the rest, with the same sorry sag. Isn’t it annoying when mere fact overturns a perfectly lovely theory?

But enough of the Watergate era. I have a few more loose ends to tie up.

A graph in the magazine illustrates our collective fondness for round numbers—those divisible by 5 or 10. I remark in the text: “Dollar amounts are even more dramatically biased in favor of well-rounded numbers.” Here’s the evidence:

frequency of dollar amounts for numbers from $1 to $100

Dollar amounts mod $1 tell the same story:

prevalence of cents values between 1 and 99

I was not surprised at the prominence of $X.25, $X.50 and $X.75, but in this graph I had expected also to see a strong signal from prices a penny less than a dollar. That signal is detectable but not conspicuous. Apparently, items mentioned in books—perhaps the books themselves—are more commonly priced at $X.95.

Finally, I want to mention two small but troublesome anomalies in the downloadable 1-gram files. The Google OCR algorithms treat most marks of punctuation as separate 1-grams. Thus we can count how many periods, colons and question marks appeared in printed books over the years. But the encoding of two of these symbols was garbled somewhere along the way. The most abundant 1-gram in the entire data set—with 21,396,850,115 occurrences, or about 6 percent of the total—is listed in the files as the double quotation mark. In fact it should be identified as the comma. (In the web interface to the online Ngram Viewer, the comma is a separator, which may have something to do with the confusion in the downloadable files. The character is correctly given as a comma in the private files of Aiden and Michel.)

The second problem entry is even weirder. Loaded into a text editor, it looks like this:

                          """ "

Closer examination with a binary editor shows that the space between the third and fourth quote marks consists of two control characters, with hexadecimal values 0×15 (NAK) and 0×12 (DC2). I make no sense of this, and Aiden and Michel have not yet been able to help. All the same, I think I know how to fix it. If the entry for the double quotation mark is actually a comma, then something else has to be the double quote. This bizarre character string looks like a good candidate.

Rashid’s bits

Sunday, April 10th, 2011

I’m in Pittsburgh this weekend, attending a conference at Carnegie Mellon University. The sessions are being held in the Rashid Auditorium of the Gates Building.

Gates bldg CMU

The seats in the auditorium are upholstered with a pattern that seems apt for a computer science department, but it also begs for elucidation:

Rashid seat bits small 0769

What is the patterning principle in these bits?

I’ve asked half a dozen people in the department—faculty, grad students, staff—but they all profess ignorance. Perhaps it’s a secret that’s never divulged to outsiders.

A quick web search turned up only one pertinent reference—a publication of the CMU chapter of the ACM. A photo caption on the last page includes this remark:

Do the binary codes written on the seats in Rashid Auditorium mean anything? After further investigation, we have concluded that while the binary patterns seem not to repeat, they do not map to any meaningful ASCII words.

Could it be true that the sequence has no repetition anywhere in the auditorium? I guess that depends on what you mean by repetition. Obviously short subsequences are repeated, and Axel Thue proved a century ago that no binary sequence can go on indefinitely without repeating some block of bits. But I have not tried to survey all 247 seats looking for the longest repeated block. (Most of the seats are occupied at the moment.)

I have a sneaking suspicion about what lies behind this pattern, but I’m trying to pay attention to a series of talks as I write this, so I’m not going to investigate further until I get home. In the meantime, I thought I’d throw open the question to anyone else who wants to speculate. To make the job a little easier, I’ve transcribed a few hundred bits:

1 0 1 0 1 1 0 1 1 1 0 1 0 1 0 0 0 1 1 0 1 1 0 1 0 1 1 0 1 1 1 0 1 0 1 0 1 0 0 0 1 0 1 0 1 1 0 1 0 1 0 0 0 1 1 0 1 1 0 0 1 0 1 0 1 1 0 1 1 1 0 1 0 1 0 0 0 1 1 1 0 1 0 0 1 1 1 0 1 1 0 1 0 1 0 1 1 0 1 1 1 0 1 1 0 1 1 0 1 0 1 1 1 0 1 1 0 1 0 1 0 0 0 0 1 1 0 1 0 1 0 1 1 0 1 1 1 0 1 1 0 1 0 0 1 1 1 0 1 1 0 1 0 1 0 1 1 0 1 0 1 1 0 1 0 1 0 0 1 1 0 1 0 1 0 0 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 0 0 1 1 0 1 0 1 0 0 1 0 0 0 1 0 1 0 1 1 0 1 0 1 0 0 1 1 0 1 0 1 0 0 1 0 1 0 1 0 0 1 1 1 0 1 0 1 0 1 1 0 1 0 0 1 0 0 1 0 1 0 1 0 1 1 1 0 1 0 0 1 0 1 0 1 0 0 0 1 0 1 0 1 0 1 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0 1 1 1 0 1 0 1 0 1 1 0 1 0 1 0 1 0 1 0 1 0 1 1 0 0 0 1 0 1 0 0 0 1 1 1 0 1 0 0 1 0 1 0 1 1 0 1 0 1 0 1 0 1 0 0 1 0 0 1 0 1 0 0 0 1 1 1 0 1 0 0 0 1 1 0 1 0 1 0 1 0 1 1 0 0 0 1 0 1 0 0 0 0 1 0 1 0 0 1 0 0 1 1 0 1 0 1 0 0 1 1 1 0 1 0 1 1 0 1 1 1 0 1 0 1 0 1 0 0 1 0 1 0 1 1 0 1 0 1 0 0 1 1 1 0 1 0 1 1 0 1 0 0 1 0 1 0 0 1 0 0 1 1 0 1 0 1 0 0 1 1 0 1 0 1 0 1 0 0 0 1 0 1 0 0 1 0 1 0 1 1 0 1 1 1 0 1 0 1 0 0 0 1 1 1 1 1 0 1 0 1 1 0 1 1 0 0 1 0 1 0 1 1 0 1 1 1 0 1 1 1 0 1 0 1 0 1 0 0 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 0 1 0 1 0 0 0 1 1 0 1 0 0 1 1 1 0 1 1 0 1 0 1 0 1 1 1 0 1 1 0 1 1 0 1 1 0 1 0 1 1 0 1 0 0 1 1 1 0 1 1 0 1 0 1 1 1 0 1 1 0 1 0 1 0 0 0 1 1 0 1 0 0 0 1 0 1 0 0 1 0 1 0 1 1 1 0 1 0 1 1 0 1 0 1 0 0 1 1 0 1 0 1 0 0 0 1 1 0 1 0 0 1 0 0 0 1 1 0 1 0 1 1 0 1 0 1 0 0 1 1 0 1 0 1 0 1 0 0 1 0 1 0 1 1 1 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 0 0 0 1 0 1 0 1 0 0 1 1 1 0 1 0 1 0 1 1 0 1 0 1 0 0 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 0 1 1 1 0 1 0 1 1 1 0 1 0 0 1 0 1 0 1 0 0 0 1 0 1 0 1 0 1 0 1 0 1 1 1 0 1 0 0 1 0 1 1 0 1 0 1 0 1 0 1 1 0 0 0 1 0 1 0 0 0 1 0 1 0 1 0 0 1 0 1 0 0 1 1 0 1 0 1 0 1 0 1 1 0 1 0 1 1 0 1 0 1 1 1 0 1 0 0 1 0 1 1 0 1 0 1 0 1 0 1 0 1 1

Any ideas?

Update 2011-04-12: The “sneaking suspicion” I had in mind the other day was ridiculously wrong. I had wanted the pattern to be a segment of the Thue-Morse sequence, which I obliquely alluded to above. (Obliquely and also clumsily, as one commenter rightly pointed out.) The Thue-Morse sequence is one of the little gems of discrete mathematics, and it seemed like a perfect choice for a temple of high nerdiness like the CMU CS department. Here’s how the sequence begins:

0 1 1 0 1 0 0 1 1 0 0 1 0 1 1 0 1 0 0 1 0 1 1 0 0 1 1 0 1 0 0 1 1 0 0 1 0 1 1 0 0 1 1 0 1 0 0 1 0 1 1 0 1 0 0 1 1 0 0 1 0 1 1 0 1 0 0 1 0 1 1 0 0 1 1 0 1 0 0 1 0 1 1 0 1 0 0 1 1 0 0 1 0 1 1 0 0 1 1 0 1 0 0 1 1 0 0 1 0 1 1 0 1 0 0 1 0 1 1 0 0 1 1 0 1 0 0 1

At first glance—just taking in the optical texture of 1s and 0s—it seemed a good match for the fabric in Rashid Auditorium. But a closer look at the pattern on the chairs shows that the bits there cannot possibly be drawn from the Thue-Morse sequence. The key property of the sequence is that it is “cube-free”: No block of bits w is repeated three times in succession, as in www. The block w can be of any length, including a length of just one bit. The fabric pattern includes subsequences of 000 and 111, both of which violate the cube-free constraint. (I should have noticed this before I even got up out of my seat, but I didn’t.)

If not Thue-Morse, then what?

Several commenters note that it cannot simply be bits chosen uniformly at random, and I totally agree. Here are the frequencies of subsequences from one to four bits long:

Rashid 1 bit freqs

Rashid 2 bit freqs

Rashid 3 bit freqs

Rashid 4 bit freqs

For random bits, all of these distributions are flat: Each k-bit subsequence occurs with expected frequency 2k. The pattern at CMU is anything but flat. It has prominent peaks and valleys, with a notable excess of alternating 1s and 0s, and a corresponding deficit of subsequences that repeat the same symbol.

Another way to look at the bit sequences is through the autocorrelation function, which measures how well the sequence matches up with itself when shifted along its length by some fixed distance. An autocorrelation of 1.0 indicates a perfect match, and –1.0 is perfect anticorrelation—all the bits are different. For any series, the autocorrelation has to be 1.0 at a shift of 0. As the graph below shows, the autocorrelation function for the Rashid sequence is strongly negative at a shift of 1, and positive at shift 2. These results are consistent with the observed prevalence of alternating bits. Interestingly, the correlations continue to oscillate at larger shifts, with little sign of damping. What’s even more curious, for shifts greater than 4, the sign of the correlation is the opposite of what you’ve expect for an alternating 01 sequence.

Autocorrelation rashid random thue

For comparison, the graph also shows the autocorrelation function of the Thue-Morse sequence, which has a fascinating structure of its own, and that of a random sequence of bits. Except at shift 0, the random sequence should converge on 0 autocorrelation everywhere.

The results above are based on a somewhat larger sample than the one I gave on Sunday. When I got home from Pittsburgh, I transcribed 35 rows of 79 digits each, for a total of 2,765 bits. I’ve treated the rows as 35 independent samples, since I don’t know how to join them without creating spurious conjunctions.

If you’d like to play with the bits for yourself, they are all here in a PDF that includes both an image of the fabric and my transcription of it. I should mention that typing 2,765 binary digits is not a lot of fun. Before embarking on that project, I struggled to get useful output from OCR services such as Online OCR, but the character-recognition algorithms try desperately to turn numerical data into English, producing stuff like this:

LOLOL010101.11.0110101.1.1010001010111.01.LoLotootoLoc t tot totot IMO /0t 100101 L LO LOLOt L tO1O 101000 OIOtOOlOLOtOlttOttOtOlttOOlOLOttOC

LOL indeed.

At this point, I admit I’m stumped. I just don’t know where all those bits came from.

The CMU-ACM magazine I mention above says the bit stream is not ASCII text. I don’t know how the author determined that, but I tend to agree. ASCII and most other byte-oriented encodings would show an autocorrelation spike at a shift of 8, and there’s no evidence of that.

I’m also sure the sequence is not Morse code. When you smush together letters and words in dot-dash encoding, the result has many more long runs than we see here.

Could it be the binary object code for some program? Rick Rashid was the principal author of the Mach operating system, created at CMU, so a fitting tribute would be to sew his work into the seats of the auditorium he funded. But I’ve never seen a core dump that looked anything like this. Most machine code has blocks of highly repetitive bit patterns.

Maybe the underlying message is text or an image that’s been run through a compression algorithm, such as gzip? That would raise the entropy of the signal and flatten the distribution of subsequences to some extent. But would it yield the specific pattern we see here? In particular, would it favor alternation of 0s and 1s? I don’t see why.

In the comments, Zvika suggests that the sequence looks like the product of a human operator trying to type randomly. I resist this idea. It’s just such an inhuman waste of human resources; I can’t stand to think of some poor schlub pecking away at a keyboard for hours, struggling to be mindless about it. But I suppose it could be true.

I came up with a slightly less barbaric alternative scenario. Suppose a fabric designer says to his programmer friend, “I need some random bits. Can you help me out?” The programmer goes away and fires up her RNG, and comes back with a million bits that pass all the usual tests of randomness. The designer says thanks, but later he complains, “These bits don’t look random. They’ve got all these long runs of 0s and 1s.” The programmer explains about independent choices from a uniform distribution, and the 1/2k spectrum of run lengths, but the designer will have none of it. “I don’t care about the math; give me something that looks right,” he says. So the programmer goes away and creates some new bits, meant to satisfy people who believe that when a coin has come up heads three times in a row, it’s more likely to show tails on the next toss.

I made an attempt to create such a function myself. It’s a simple Markov process, in which each appearance of a 1 in the sequence reduces the probability p that the next symbol will also be a 1. Specifically, I start with p = 1/2. If the next bit is a 1, p is adjusted to p/k, where k is a constant greater than 1. If the new bit is instead a 0, p becomes 1 — p/k. When I set k = 2, this process generated a stream of bits that looked much like the fabric pattern. Here’s a small sample:

1 0 1 1 0 1 0 1 1 0 1 0 0 1 0 1 1 1 0 1 0 1 0 1 0 1 1 1 0 1 1 0 1 1 0 0 1 0 1 1 1 0 1 1 0 0 1 0 1 1 0 1 0 1 0 1 1 1 1 0 1 0 1 0 1 1 0 1 0 1 0 1 1 1 1 0 1 1 0 1

And the subsequence frequencies also seemed broadly similar to those of the Rashid sequence:

Markov 3 bit freqs with lollies

(The orange bars are the results for the Markov process; the green lollipops show the corresponding frequencies for the Rashid bits.) This outcome was at least mildly encouraging, but then I looked at the autocorrelations in the Markov-generated stream of bits:

Autocorrelation markov

At shifts of 1 and 2, the Markov autocorrelations match those of the Rashid sequence almost exactly, but at longer distances the Markov correlations fade away, as they must in any random process. The long-range structure in the Rashid sequence seems all the more peculiar when displayed in this context.

Still another question is whether the fabric pattern can be truly aperiodic at large scales. If it were a printed fabric, I would be highly skeptical of this proposition. The printing process is inherently repetitive, with the same pattern recurring at intervals determined by the size of the printing plate or cylinder. (Think of wallpaper patterns.) ButRashid fabric detail 0792 the pattern on the upholstery cloth is not printed; it is woven into the fibers, and that changes everything. Looms have been programmable since the time of Jacquard, and there is no fundamental reason a modern loom could not be driven by a computer that supplies an arbitrarily long sequence of nonrepetitive instructions. The Thue-Morse sequence would do nicely for this purpose. So I can easily imagine an aperiodic upholstery pattern; I just don’t know if that’s what we’re seeing here.

Later this week I’ll go take a look at the Harvard version. (Thanks to Max Lin for discovering this second instance.)

Update 2011-04-13: I note here for the benefit of those who don’t read the comments: rouli has swept away all this speculation about looms driven by computers generating infinite aperiodic sequences. In the section of the pattern I transcribed, the last nine lines are identical to the first nine. I’ve checked a couple of other photos (of other seat backs) and confirmed that the whole pattern appears to repeat on a cycle of 26 lines.

How appalling that I could type more than 700 repeated characters and never notice the resemblance.

Update 2011-04-13, later:Yikes! And now Barry Cipra reveals that the pattern also repeats horizontally, with period 60.

Give me that good old-fashioned AI

Saturday, February 19th, 2011

It is said that to explain is to explain away. This maxim is nowhere so well fulfilled as in the area of computer programming, especially in what is called heuristic programming and artificial intelligence. For in those realms machines are made to behave in wondrous ways, often sufficient to dazzle even the most experienced observer. But once a particular program is unmasked, once its inner workings are explained in language sufficiently plain to induce understanding, its magic crumbles away; it stands revealed as a mere collection of procedures, each quite comprehensible. The observer says to himself “I could have written that”. With that thought he moves the program in question from the shelf marked “intelligent”, to that reserved for curios, fit to be discussed only with people less enlightened than he.

These words were written in 1966 by Joseph Weizenbaum, in a paper that explained—and explained away—his famous program ELIZA, the mock psychotherapist. ELIZA’s conversations could occasionally suggest a glimmer of understanding:

I need some help, that much seems certain.

WHAT WOULD IT MEAN TO YOU IF YOU GOT SOME HELP?

Perhaps I could learn to get along with my mother.

TELL ME MORE ABOUT YOUR FAMILY.

But Weizenbaum revealed that ELIZA was merely picking out a few keywords from the text and applying simple syntactic transformations, along with a dose of randomness. The program was manipulating symbols, but the symbols had no meaning attached to them.

What about Watson, the new Jeopardy champion? Watson gave a dazzling performance this past week, decisively winning a two-game match against the best human players in the history of the quiz show. But will the magic crumble if we look closely at how it works? Does the program really understand those quirky Jeopardy clues, or is it just pushing symbols around, in the manner of ELIZA?

The most detailed account of Watson’s innards that I’ve been able to find is an article published in AI Magazine last fall by David Ferrucci of IBM, the project’s lead engineer, and a dozen colleagues from IBM and Carnegie Mellon. (”Building Watson: An overview of the DeepQA project,” AI Magazine 31(3):59–79. The article is behind a paywall at the AI Magazine web site, but resourceful internauts may find it elsewhere.)

Here’s the overview:

The system we have built and are continuing to develop, called DeepQA, is a massively parallel probabilistic evidence-based architecture. For the Jeopardy Challenge, we use more than 100 different techniques for analyzing natural language, identifying sources, finding and generating hypotheses, finding and scoring evidence, and merging and ranking hypotheses.

Thus we learn that behind Watson’s calm, metallic voice is a clamor of 100+ agents doing their massively parallel probabilistic evidence-based thing. This is not one big brain but a society of mind. (By the way, I think “probabilistic” means simply that potential answers are scored by assigning them probabilities; as far as I can tell there is no randomness or indeterminacy in the algorithms, but I could be wrong about that.)

The first step in the run-time question-answering process is question analysis. During question analysis the system attempts to understand what the question is asking and performs the initial analyses that determine how the question will be processed by the rest of the system.

This description implies that the system is indeed making an effort to dig down into the semantics of natural language. But how does it attempt to understand the clue? The rest of the paragraph is more of a shopping list than an explanation:

The DeepQA approach encourages a mixture of experts at this stage, and in the Watson system we produce shallow parses, deep parses (McCord 1990), logical forms, semantic role labels, coreference, relations, named entities, and so on, as well as specific kinds of analysis for question answering.

The reference to (McCord 1990) is perhaps the most illuminating item in this list. The author is Michael C. McCord, who is at IBM’s Yorktown Heights lab where Watson was built. The phrase “deep parses” apparently refers to McCord’s idea of a slot grammar, which provides a single framework for combining the syntactic analysis of sentences (subject, predicate, object, etc.) with semantic features (word senses, logical relations, predicates). Unfortunately, the AI Magazine article gives no further hints about how slot grammars are used in the analysis of Jeopardy clues. (For some useful recent accounts of slot grammars, see the links in McCord’s publications list.)

The part of the question-analysis phase that Ferrucci et al. discuss at greatest length is a process they call “LAT detection.” LAT is “lexical answer type”: a word or phrase in the clue that specifies what kind of response is wanted—a person, a city, a book, a substance, and so on. Consider this clue, in a category titled “Oooh…. chess”:

Invented in the 1500s to speed up the game, this maneuver involves two pieces of the same color.

The LAT in the clue is “maneuver”: Whatever the answer is, it must be something that can plausibly be described as a maneuver. If you were to fixate on the wrong LAT—say “the game” or “two pieces”—you’d have no hope of coming up with the correct answer. Naming the two pieces “king and rook” would not score any points, even though that particular choice of pieces suggests you have the right idea in mind; to get credit for the answer, you need to give the name of the maneuver: “castling.”

Identifying the correct LAT is clearly important. It’s also clearly difficult. What’s not so clear is how Watson does it. In the chess example, does “maneuver” stand out from the rest of the words in the clue for grammatical reasons (it’s the subject of the main clause), or because it’s pointed to by the demonstrative adjective “this,” or for some other reason? How would you write a program to identify the LAT of an arbitrary Jeopardy clue?

Moving on from the analysis of questions to the finding of answers, the algorithmic details remain a little fuzzy.

Watson has access to various sources of “structured” knowledge: relational databases, taxonomies, ontologies. With such resources, retrieval is straightforward. Yet it turns out that few clues can be reformulated as database queries. Ferrucci writes: “Watson’s current ability to effectively use curated databases to simply ‘look up’ the answers is limited to fewer than 2 percent of the clues.” I suppose this is not really surprising. If the game could be reduced to database lookup, it wouldn’t be much fun.

For the other 98 percent of the queries, I gather that the retrieval process is more like Googling for the answer. The machine has no live internet connection during the Jeopardy contest, so it can’t actually search the web. But lots of free-form textual data was loaded into the Watson servers ahead of time, including all of Wikipedia and many other reference works. Using these documents as seeds, the system then trawled the web for other sources that might be useful, and cached copies of them for use offline. About four terabytes of material was available for query answering.

As for the search methods applied to this archive, the article by Ferrucci et al. offers another shopping list:

A variety of search techniques are used, including the use of multiple text search engines with different underlying approaches (for example, Indri and Lucene), document search as well as passage search, knowledge base search using SPARQL on triple stores, the generation of multiple search queries for a single question, and backfilling hit lists to satisfy key constraints identified in the question.

In a long series of training runs the system was tuned to balance the competing demands of coverage, accuracy and speed.

The operative goal for primary search eventually stabilized at about 85 percent binary recall for the top 250 candidates; that is, the system generates the correct answer as a candidate answer for 85 percent of the questions somewhere within the top 250 ranked candidates.

The trouble with free-form textual search is that you may very well identify relevant snippets of text but still have a hard time extracting the correct answer. Indeed, the same kind of analysis that goes into figuring out the question also has to be applied to candidate answers. For example, Ferrucci et al. discuss this clue: “He was presidentially pardoned on September 8, 1974.” Among the materials retrieved by the search algorithm was the text fragment: “Ford pardoned Nixon on Sept. 8, 1974.” For a human player with a little knowledge of U.S. history, this result would be more than enough to settle the matter, but a computer program still has some work to do. Suppose the program has correctly identified the LAT of the clue as “He,” and suppose further that it knows that both “Ford” and “Nixon” refer to male persons, perhaps even that they were presidents. Which of the two names is the right choice? Several of the tests that Watson applies are essentially string-matching algorithms, similar to those that search DNA sequences for genetic patterns. Those algorithms might count how often each name occurs in association with the given date, but that result will not resolve the ambiguity in this case. The correct answer comes from a program module that undertakes a deeper logical analysis and recognizes the difference between subject and object in the two statements.

•     •     •

Given this glimpse into how Watson works, do we deem its intelligence to be explained, or explained away? Personally, I have mixed feelings.

I admit to a sentimental fondness for what John Haugeland called Good Old-Fashioned AI, or GOFAI—the ambitious kind of artificial intelligence that aspired to build a true thinking machine, a system with some deep internal representation (a mental model) of the world in which it functions. The outstanding example of this style is Terry Winograd’s SHRDLU program, written in 1970, which conversed about objects in a world of toy blocks on a tabletop. At the time, Winograd firmly asserted that the program was able to “understand discourse,” and he meant by this that the program understood not only the words but also the objects and relations the words referred to.

The promise of SHRDLU was that we could extend the same methods to broader domains of discourse, steadily building toward a general-purpose, human-like intelligence, with the same kind of carefully planned knowledge representation. But that never happened. Later in the 1970s, AI entered a time of troubles. When it came back in the 80s, the emphasis had shifted, and the technology had diversified. The new AI focused on expert systems, on data mining, on statistical rather than deductive methods; another branch of AI turned away from the human cerebral cortex in favor of the motor neurons of the cockroach. Overall, the field took a more pragmatic turn, with less concern for understanding the ultimate nature of intelligence and more energy invested in getting useful results, whatever the methodology.

Watson is in this latter-day pragmatic tradition, with its 100+ agents and its massively parallel probabilistic evidence-based architecture. Compared with SHRDLU, it’s all so messy, so ad hoc, so opaque. But it works, doesn’t it.

And I suppose my own mind is not quite as tidy as I would like to believe.

•     •     •

Although Watson won its Jeopardy match by a wide margin and made very few mistakes along the way, the moment everyone will remember is the program’s spectacular flub of a Final Jeopardy question on the second night. The category was “U.S. Cities,” and the clue was:

Its largest airport is named for a WWII hero, its second largest for a WWII battle.

Watson replied “Toronto.” As it happens, I got that question right; just seconds after the clue was revealed, I called out “Chicago.” Later, though, when I thought about the mental process that led to my answer, I realized that this was not at all a product of well-focused deductive reasoning. I was doing the same kind of scattershot, parallel, probabilistic groping in the dark that I frown on in a machine.

My “reasoning” went something like this: If it has two airports, it must be a pretty big city…. New York has three airports…. There’s Dallas, with DFW and Love—but no heroes or battles there. Chicago has two. Oh! Midway—that must be the battle of Midway.

That’s when I pressed the buzzer.

Note how sketchy my thinking was. I had no idea O’Hare was named for a war hero. As a matter of fact, I had no idea that Midway was named for the naval battle. If I had been asked in a more straightforward way, “Why is Chicago’s second airport named ‘Midway’?”, I would have guessed that it lies halfway between Point A and Point B. The Pacific island would not have entered my consciousness. And I never bothered to dig any deeper into the catalogue of multi-airport cities—Washington, San Francisco, Houston (isn’t G. H. W. Bush a WWII hero?).

So messy and ad hoc.

Googling the lexicon

Monday, December 20th, 2010

In 1860, when work began on the New English Dictionary on Historical Principles (better known today as the Oxford English Dictionary), the basic plan was to build an index to all of English literature. James Murray, principal editor of the OED from 1878 to 1915, poses in his scriptoriumVolunteer readers would pore over texts and send in paper slips with transcribed quotations, each slip showing a word in its native context. The slips of paper were collected in a “scriptorium,” where they were sorted alphabetically and became the raw material for the work of the lexi­cographers. The project was supposed to be completed in 10 years, but it took almost 70. Some 2,000 readers contributed 5 million quo­tation slips, citing phrases from 4,500 published works.

Google and Harvard have now given us a new index to the corpus of written English—and they’ve thrown in a few other languages as well. The data cover more than 500 billion word occurrences, drawn from 5,195,769 books (estimated to be 4 percent of all the books ever printed). The entire archive is being made available for download into your home scriptorium, under a Creative Commons license.

The project was announced December 16th with the online release of a paper to appear in Science. Here’s a quick rundown:

Publication: The Science article is “Quantitative Analysis of Culture Using Millions of Digitized Books.” It is supposed to remain freely available to nonsubscribers. See also the supplementary online material.

Authors: Jean-Baptiste Michel and Erez Lieberman Aiden of Harvard, with a dozen co-authors: Yuan Kui Shen, Aviva P. Aiden, Adrian Veres, Matthew K. Gray, The Google Books Team, Joseph P. Pickett, Dale Hoiberg, Dan Clancy, Peter Norvig, Jon Orwant, Steven Pinker and Martin A. Nowak.

Languages: English, Chinese, French, German, Russian and Spanish. There are actually five archives for English, based on various subsets and intersecting sets of the underlying texts (e.g., British vs. U.S.).

Data format: The OED quotations were meaningful hunks of text—typically a sentence. Here we get n-grams. A 1-gram is a single word or other lexical unit, such as a number. An n-gram is a sequence of n consecutive 1-grams. The Harvard-Google collaboration has compiled lists of n-grams for values of n between 1 and 5. Thus the 1-gram files are lists of single words, and the other files give snippets of text consisting of two, three, four or five words. For each n-gram we learn the number of occurrences per year from 1550 through 2008 as well as the number of pages on which the n-gram is found and the number of books in which it appears.

Links to related stuff:

Links to less-related stuff:

•     •     •

If you want to play with this new toy, the place to begin is the Google n-gram viewer. At this site you submit a query to an online copy of the database and get back a graph showing normalized n-gram frequencies for a selected range of years. Here’s a search on the days of the week. (The graphs from the n-gram viewer are very wide, and bit-player is very skinny. I’ve squeezed the graphs horizontally; hover on them to unsqueeze. (See nifty effect in Safari, Chrome, maybe other Webkit browsers.))

days-of-week-1800-2008.png

The results are not surprising, I think, but they’re interesting. “Sunday” is an outlier. “Monday” was once the next-most-often-mentioned day, but around 1860 it was overtaken by Saturday, and now it has also fallen behind Friday. As for the middle of the week—nobody cares about those days.

Here is a collection of “theory” bigrams: “number theory”, “set theory”, “group theory”, “graph theory”, “string theory”, “chaos theory”, “catastrophe theory”, “K theory”. Which of them had the highest frequency in the 20th century? (I guessed wrong.)

theory-1900-2008.png

(By the way, the graph above has a strange lump in the first decade of the 20th century. I think it’s a metadata error. The same anomalous blip turns up for many other search terms, such as “transistor” and “Internet”. Somewhere along the way, a batch of books from 2005 were recorded as having been published in 1905. (Geoff Nunberg at Language Log has pointed out many other problems with Google Books metadata.))

Here are a few magazines I have written for.

magazines-1900-2008.png

I’m afraid there’s a dismal pattern in this data: Publications reach their peak and begin declining soon after I arrive. (But the most recent plunge at Scientific American is not my doing.)

The graph below offers some tech trends. It appears we have officially entered the age of the gigabyte, and terabytes are coming on strong:

bytes-1970-2008.png

Next is a list of rare (even nonexistent) words: “uroboros”, “widdershins”, “abacot”, “baragouin”, “funambulist”, “futhark”, “gongorism”, “hapax legomenon”, “hypnopompic”:

rare-words-1800-2008.png

It’s curious that so many of these archaic-looking terms seem to be increasing in frequency.

At the other end of the spectrum are some very common words:

common-words-1800-2008.png

Again there’s a mild surprise here: The ordering of these words in the Google Books corpus apparently differs from that of the list usually cited.

•     •     •

I think the n-gram browser is great fun, but it provides access to only one aspect of the data set: We can plot the normalized frequency of specific n-grams as a function of time. There are many other kinds of questions one might ask about all these words. For starters, I’d like to invert the query function and find all n-grams that have a given frequency. (An obvious project is to compile a list of the commonest words and phrases in the corpus.)

At an even more elementary level: What is the distribution of word lengths in English?

Here’s another question: In the formula “I [verb] you”, what are the commonest verbs? The information needed to answer this question is present in the 3-gram files, but the Google viewer does not provide a means to get at it.

With appropriate software, the n-grams could also be used for language synthesis—generating highly plausible generic gibberish through a Markov process.

Still another idea is to explore the history of spelling errors and typos over the centuries. Did the introduction of the qwerty keyboard lead to different kinds of errors? How about the later introduction of spell-checking software, and the concomitant decline of proofreading as a trade?

Furthermore, the n-gram files are full of numbers as well as words. Can we learn anything of economic interest by charting the prevalence of numbers that look like monetary amounts? (It’s easy to search for specific strings of digits, such as “$9.99″, but I would like to treat these values as numbers rather than sequences of digits, so that “0.99″, “.99″ and “0.99000″ would all be numerically equal.)

The way to carry out any of these projects is to download the full n-gram files and start writing software to explore them. I’ve taken my first step in that direction: I’ve bought a new disk drive to hold the data. Next I’m going to borrow a faster Internet connection to do the bulk of the downloading. After that, I must confess that I don’t have a lot of confidence about how best to proceed. The scale of these data sets is roughly a terabyte, which in this age of Big Data is no more than a warmup exercise. But it’s bigger than anything I’ve ever tried to grab hold of personally. I’ll be grateful for advice from those with more experience.

The n-gram lists come in fragments. The 1-grams, for example, are broken up into 10 files, each of which weighs about a gigabyte. Within each file the n-grams are listed alphabetically, but the distribution across files is random. (For example, file 0 has “gab”, “gaw” and “gay”, but “gag”, “gap” and “gas” are elsewhere.) Ordinarily, my first impulse would be to run a big merge-sort over these files, putting all the words in order. But maybe that’s exactly the wrong thing to do; keeping them scattered is a potential opportunity for multithread parallelism.

Even casual browsing through the raw n-gram files is quite a revelation. They really are raw. Word frequencies are the prototypical example of a Zipf distribution, which has notoriously long tails. It follows that when you choose an entry at random from one of the n-gram files, you are likely to be waaaaaaay out in the aberrant fringe, looking at symbol strings that you might or might not recognize as English words. Here are 20 lines selected at random from a 1-gram file:

1-GRAM          YEAR   COUNT   PAGES   BOOKS
lilywhites      1994       1       1       1
Carneri         2002      24      24      12
Thurh           1971       1       1       1
Elsee           1832       5       5       4
cFMte           2008      12      12      12
COFFEE          1876     288     270     167
APO3            1990       6       3       1
Odumbara        1963      15      15       6
connubialibus   1967       1       1       1
Pickerel        1900      65      54      34
fubje&ed        1757       6       6       5
nader           1971      34      29      19
fiyled          1782       1       1       1
existfed        1993       2       2       2
Soveit          2007       1       1       1
monongahela     1939       4       4       4
suffeiing       1851       6       6       6
brake           1774      12      10       7
ofBrasenose     1951       3       3       3
Horas           1798       3       3       2

This is a pretty strange stew. There are obscure words, several proper names and abbreviations, as well as quite a few misspellings and nonstandard capitalizations. But the oddities that stand out most sharply have another origin: They are errors of optical character recognition. A word that was correct in the original document has gotten all fubje&ed up in the course of scanning. Of the 20 items listed above, 5 appear to be marred by OCR errors, so this is not just a matter of minor contamination. My best guess is that the word recorded as “fubje&ed” appeared in the 1757 book as:

subjected.png

with an initial long “s” that was assimilated to an “f” and a “ct” ligature that the OCR program confused with an ampersand.

The files are rife with such problems. Another case that caught my eye was “quicro”. As a Scrabble player, I ought to know that 18-point word! It turns out to be an OCR error for the Spanish verb “quiero”. And why is there a Spanish word in an English lexicon? Well, when I ran a search at Google Books, the top hit for “quicro” was Robert Southey’s Commonplace Book, published in 1850, which is properly classified as an English work even though it includes many long passages of Spanish and Portuguese verse.

Should we worry about such distractions? The real “quiero” is roughly 200 times as frequent as “quicro”, so the OCR error will not have a major statistical impact. Perhaps the most disturbing effect of the OCR noise is that it greatly lengthens the already-long tail of the frequency distribution. Suppose a word appears 100,000 times in the corpus. If the OCR process is 99.9 percent accurate, 100 instances of the word will be read incorrectly; in the worst case, each erroneous reading could be different, adding 100 spurious entries to the lexicon.

Cleaning up this mess looks like a major undertaking. (If it were easy, Google would have done it already.)

The long tail of the distribution has already been truncated to some extent: No n-gram is included in the data set unless it appears at least 40 times in the corpus of texts. For many kinds of analysis, an even higher threshold might be appropriate—excluding all terms that fall below 400 or maybe even 4,000 occurrences. But that still won’t eliminate all the OCR errors: “quickfilver” appears 5,411 times.

•     •     •

Before I go, I want to tell one more story, which is a mystery story. It all began when I was trying out a few seasonal phrases:

christmas-1850-2008.png

Note the distinctive and dramatic dip in all three frequencies starting in the late 40s or early 50s and continuing into the 70s, with an eventual strong recovery in the 90s. The frequencies fall by roughly 50 percent, then return to the neighborhood of their earlier peak. What’s going on here? Did the Grinch steal Christmas, and then give it back? Was there a backlash against Jingle Bells in the era of disco dancing and Watergate?

As a check on these results I tried a few terms associated with other holidays, unrelated to the midwinter madness. The same pattern emerged:

holidays-1850-2008.png

As a further control, I added still more phrases, this time with no obvious connection to any holiday whatever. All that the terms have in common is that they fall into roughly the same range of frequencies:

mystery-dip-1850-2008.png

The slump in the 60s and 70s is still visible in this augmented set of words. The dip looks rather like one of those episodes of mass extinction in the fossil record. And it provokes the same question: What caused it? And what ended it?

Of course not all words and phrases follow this pattern. Because the curves represent normalized frequencies—the count of each n-gram’s occurrences divided by the total number of all n-gram occurrences—a valley in any one n-gram’s frequency must be balanced by a peak for some other word or words. This fact leads to a hypothesis about the cause of the Great Postwar Santa Depression. The 50s and 60s were a period in which technical and scientific publishing bloomed, thereby diluting the share of printed books that would be likely to mention phrases such as “Santa Claus” and “vacuum cleaner”; instead we got volumes full of “asymptotic freedom”, “chymotrypsin inhibitor” and “field-effect transistor”.

The trouble with this notion is that the explosive growth of the sci/tech vocabulary did not end in the 1980s or 90s; thus it’s hard to understand how Santa has made such a spectacular comeback.

Here’s one wild guess at an explanation. Most of the books that Google has scanned come from university libraries, and so I wonder if we might be observing an artifact of the acquisition and retention policies of university librarians. Suppose that libraries tend to buy a broad cross-section of newly published titles, but when shelf space gets scarce, the half-life of Quantum Information Theory is longer than that of Frosty the Snowman. As a result of this selective culling, Santa books from the 60s have melted away, but those from the 90s have not yet disappeared. Could such practices account for the dip and the recovery? If you have a better idea, please share.

CAPTCHA arbitrage

Tuesday, November 23rd, 2010

What a world we live in. It seems there are places on this planet that are wired well enough to support internet commerce, yet where people are poor enough that solving CAPTCHAs for 50 cents per thousand is an economically appealing proposition. That’s roughly three hours of work for half a buck—minus whatever it costs the worker for internet access.

I have learned this from a fascinating paper (PDF) by Stefan Savage and his colleagues at the University of California, San Diego. The Savage group studied the CAPTCHA-solving economy in a very direct way—by participating in it, both as customers and as workers.

reCAPTCHA example imageCAPTCHAs are meant to thwart computer programs that sign up for bogus email accounts in order to send spam, or that post spammy comments on forums or blogs like this one. Transcribing the distorted and obscured text is a task that’s supposed to be easier for people than for machines. The spammers’ first response to CAPTCHAs was to write programs that solve them algorithmically, but in this arms race the advantage is with the white hats. Deploying a new style of CAPTCHA is quick and cheap; developing a new solver for that CAPTCHA is slow and costly. And so the spammers have turned to human solvers.

A CAPTCHA that gets caught up in this illicit trade is likely to take quite a globe-girdling journey in a matter of seconds. The images are typically generated by a service such as reCAPTCHA (now owned by Google), and embedded in the sign-up form or comment form of a web site. The spammer’s software (such as GYC Automator) scrapes the image from the page and forwards it to a front-end system, which aggregates CAPTCHA-solving requests and collects payment from the spammers. The image is then passed on to a back-end operation, which marshals the services of individual solvers and distributes payments to them. The solver sees the image on a simple web form and types the transcription into a text box. If all goes well (from the spammer’s point of view), a correct solution comes back within the 30 seconds or so that most pages allow for solving the puzzle.

Some other findings of the Savage study:

  • The piece-work rate offered to solvers has fallen steadily, from $10 per thousand in 2007 to the present level of $0.50 to $0.75.
  • The price paid by spammers is higher, of course, and also variable. Typical current rates are in the range of $1 to $2 per thousand, but some services charge as much as $20.
  • Most of the services tested by Savage et al. were fast and accurate, solving 85 to 90 percent of the CAPTCHAs correctly, with a median response time of 14 seconds.
  • By gradually raising the rate of CAPTCHA submissions until responses slowed and additional work was refused, Savage et al. estimated the size of the workforce. The largest outfits seemed to have at least 400 to 500 workers online at once.
  • In an attempt to learn where the solvers live, Savage et al. sent out specially fabricated CAPTCHAs with images of words in various languages. They reasoned that accuracy would be highest in the worker’s native language. If this hypothesis is correct, many of the CAPTCHA solvers are fluent in Chinese, Russian or Hindi. But one organization showed exceptional linguistic versatility, even solving challenges in Klingon.
  • By offering their services as solvers, the Savage group were able to gather some statistics on what kinds of CAPTCHAs are flowing through the dark side of the internet. Microsoft CAPTCHAs (used on the Hotmail sign-up form) were the most popular. Others seen frequently included reCAPTCHA images and products of several Russian-language services.

 

The paper includes a discussion of the ethics and legality of this project. Is it acceptable, even for research purposes, to abet the unsavory activities of spammers? The Savage group concluded that acting as buyers of the service caused little harm because the purchased solutions were never used to register fraudulent accounts or post messages. But working as solvers was more troubling, because the solutions they provided would indeed be used to further the aims of spammers.

To sidestep this concern, we chose not to solve these CAPTCHAs ourselves. Instead, for each CAPTCHA one of our worker agents was asked to solve, we proxied the image back into the same service via the associated retail interface. Since each CAPTCHA is then solved by the same set of solvers who would have solved it anyway, we argue that our activities do not impact the gross outcome.

It’s an ingenious dodge, even if it doesn’t put one’s mind totally at ease about the ethical question. (Suppose we were studying murder-for-hire instead of CAPTCHAs-for-hire—would the same reasoning be acceptable?) Ethics aside, however, the tactic of recirculating work requests back into the same system raises other curious issues.

For one thing, it suggests a way of measuring the size of the CAPTCHA-solving enterprise. We could run a capture-recapture experiment (or should I say CAPTCHA-reCAPTCHA?). Suppose we never solve a CAPTCHA ourselves, but we make a record of each image as it arrives, before we dump it back into the work stream. From the fraction of recirculated CAPTCHAs that come back to us at least once more we could estimate the total size of the work flow. Of course this assumes that the stream of CAPTCHAs is well-mixed. It also assumes there is no one else out there recirculating CAPTCHAs.

This last caveat leads to an interesting economic question. As noted above, retail prices for CAPTCHA-solving vary over a wide range, from about $1 per thousand to $20 per thousand. This price spread, and the fact that it’s technically feasible to route a CAPTCHA through the system more than once, suggests a major arbitrage opportunity. We can set up a high-price CAPTCHA service and farm out all the actual work to low-price competitors. In a free economy—and what economy could be freer of regulation than a criminal one?—that situation is not supposed to endure.

•     •     •

While I’m on the subject of spam, I’ll take the opportunity to update my own running tally.

 

personal spam receipts Jan 2007 through oct 2010

Activity in my inbox has been unexciting since my last report. Spam volume is still well below the peaks of mid-2009, but on the other hand there’s not much support for the fond notion that spam is on the verge of extinction. News reports have suggested that the shutdown of a Russian web site called SpamIt.com, allegedly run by Igor Gusev, caused a sharp dropoff in the global spam rate in September. And indeed my September intake was the smallest since June of 2007. But the chronology isn’t quite right. SpamIt was closed on September 27, so that event should have depressed the October numbers more than those for September. My spam receipts rebounded in October. About 40 percent of the October spam messages use a Russian-language encoding.

Update 2010-11-28: Peter G. Neumann’s RISKs list has an item about criminal charges against the operators of Wiseguy Tickets, whose business model involved solving CAPTCHAs in bulk. By circumventing the CAPTCHAs, they supposedly jumped to the head of the line at Ticketmaster and scooped up 11,984 Hannah Montana concert tickets.

The RISKs item cites a Wired article by Kim Zetter, which seems to be based mainly on the indictment (PDF) filed in the U.S. District Court of New Jersey. Here’s how the Wiseguys did it, according to the indictment:

11. It was further part of the conspiracy that to enable the CAPTCHA Bots to purchase tickets automatically, Wiseguys:

a. Downloaded hundreds of thousands of possible CAPTCHA Challenges from reCAPTCHA. To obtain these CAPTCHA Challenges anonymously, Wiseguys wrote a computer script that disguised the origin of the download requests by impersonating would-be users of Facebook, which also subscribed to reCAPTCHA.

b. Created an “Answer Database” by having its employees and agents read tens of thousands of CAPTCHA Challenges (or listen to audio CAPTCHA Challenges) and enter the answers into a database of File IDs and corresponding answers.

12. It was further part of the conspiracy that, during the ticket-buying process, instead of “reading” a CAPTCHA Challenge, the CAPTCHA Bots identified the CAPTCHA Challenge’s File ID. The CAPTCHA Bots then instantly compared the CAPTCHA Challenge’s File ID against the Answer Database, looking for a matching File ID. If the CAPTCHA Bots found a matching File ID, it immediately and automatically transmitted the pre-typed answer to that CAPTCHA Challenge to the Online Ticket Vendors’ website. This process took place in a fraction of a second, much faster than a human user could respond to a typical CAPTCHA Challenge.

I’m having trouble making sense of this. The scheme would work only if reCAPTCHA is repeatedly sending out the same image, linked to the same file ID, and reusing images often enough that if you save a few hundred thousand of them, you’ll have a good chance of finding any new challenge already present in your database. Surely it’s not that easy?

It’s true that reCAPTCHA (unlike other CAPTCHA services) must gather multiple solutions for some images. That’s because of their aim of using the labor of solvers to proofread scanned texts. Each reCAPTCHA includes two words. The solution for the “control word” is known in advance and is used to authenticate the solver; the solution offered by the solver for the “unknown word” becomes a candidate reading of that word in the OCR process. Multiple solvers must agree on the same reading of the unknown word before the transcription is accepted. Thus each unknown word is presented more than once. But is it always paired with the same control word? And with the same file ID?

The reCAPTCHA folks are pretty savvy about such weaknesses. A list of guidelines on their web site includes this paragraph:

Script Security. Building a secure CAPTCHA is not easy. In addition to making the images unreadable by computers, the system should ensure that there are no easy ways around it at the script level. Common examples of insecurities in this respect include: (1) Systems that pass the answer to the CAPTCHA in plain text as part of the web form. (2) Systems where a solution to the same CAPTCHA can be used multiple times (this makes the CAPTCHA vulnerable to so-called “replay attacks”).

So if the reCAPTCHA programmers are taking their own advice, the Wiseguys should have been out of luck. And yet we have the evidence of those 11,984 Hannah Montana tickets. What’s the story?

Whack-a-Rectangle

Saturday, November 13th, 2010

It’s been almost a year since Bill Gasarch gave us the problem of four-coloring the nodes of a 17 × 17 grid in such a way that no rectangle has all four corners the same color. (See my earlier commentary here and here.)

The problem remains open, the prize of $172 unclaimed.

A few weeks ago I had a momentary fantasy that I might have come upon the answer. I was browsing in a strange New England emporium called Building 19, full of merchandise found nowhere else (thankfully). From across the room I spotted a sofa cushion that looked like it might well be a rectangle-free four-colored grid.

A rectangle-free coloring on a sofa cushion at Building 19?

Sadly, a closer look showed that the cushion is neither rectangle-free nor four-colored. Nor 17 × 17 for that matter. And the price tag reads $399.90, so if Bill wants to receive the solution in the form of tacky furniture, he’s going to have to up his offer.

But let us set aside these disappointments. There’s happier news from elsewhere. Martin Schweitzer has turned the square-free-rectangle problem into a game! He explains that he was exploring the new canvas element of HTML5, and thought the rectangle-free problem would make a nice Javascript programming project. The result is entertaining and may even lead to better intuition about the structure of the problem.

On the other hand, the hardest level (”Ninja”) in Schweitzer’s game is only a 12 × 12 array, so don’t think you’re going to click away at little colored squares and earn yourself $289.

The program requires a canvas-capable browser, such a recent version of Chrome, Firefox, Opera or Safari.

In the zone

Tuesday, August 24th, 2010

Mt. Shasta, from a nearby hilltop owned by Coca Cola

Before leaving on a trip to the West Coast, I copied my return flight information onto my Google calendar: SFO to BOS, 12:50 p.m. to 9:30 p.m. Now that I’m nearing the end of my visit to the Mythical State of Jefferson (see local landmark above), I’ve just checked the departure details by calling up the calendar on my cell phone. It tells me the flight departs at 9:50 a.m. and arrives at 6:30 p.m.

It could be worse, of course. As an eastbound traveler all that I risk is wasting three hours at the airport. If I were westbound, I might well miss my flight.

The developers of Google calendar would doubtless argue that automatic time-zone conversion is a feature, not a bug. If the event in question had been a conference call scheduled for 12:50 p.m. Eastern Daylight Time, then 9:50 a.m. Pacific Daylight Time would indeed be the moment to dial in. Or, if I had a series of pills to be taken at fixed intervals, it might be helpful to have the program remind me at the correct times as I wander across continents. But in the case of today’s flight home, the result is just plain wrong.

It’s not only a Google calendar problem. A few weeks ago a friend was entering a schedule of talks into a Drupal web page for a conference that begins November 7, 2010. All the times were mysteriously shifted by an hour. A 1:00 p.m. talk on the input form became a 2:00 p.m. talk on the displayed web page. The key to solving this mystery is knowing that November 7 is the date daylight saving time ends in (most of) the U.S.

I am certainly not the first to encounter such problems. Peter Neumann’s RISKS Digest reports hundreds of computational mishaps involving time zones or daylight saving time, going back over the past 25 years. The issue is known to Google. But what is the right fix?

Some other calendar software offers the option of specifying a time zone for an event. To handle the airline case correctly, the program needs to allow for different zones for the start and the end times. And the Drupal problem suggests we may also need some means of indicating whether or not to adjust for daylight saving time. It gets very messy. Note that an event scheduled for 1:30 a.m. on November 7, 2010, will happen twice. An event at 2:30 a.m. on March 13, 2011, will never take place.

For years my own makeshift solution to these complexities was to live in a single time zone, no matter where I was. I carried a laptop, but I never changed its time-zone setting, even when I was away from home for weeks or months. When conversion was needed, it happened in my head. Even now the Google calendar display on my laptop gives the correct departure time from SFO, because the laptop doesn’t know that it ever left Boston. But in our new world of location-aware devices, pretending to stay home is no longer an option.

I begin to wonder if the whole railroad-age concept of time zones hasn’t outlived its usefulness. But I haven’t time just now to consider the alternatives. Right now it’s time to leave for the airport. I think.

The ormat game

Monday, August 16th, 2010

Here’s the deal. I’m going to give you a square grid, with some of the cells colored and others possibly left blank. We’ll call this a template. Perhaps the grid will be one of these 3×3 templates:

colored 3x3 ormat grids

You have a supply of transparent plastic overlays that match the grid in size and shape and that also bear patterns of black dots:

dot patterns for the six 3x3 permutation matrices

Note that each of these patterns has exactly three dots, with one dot in each row and each column. The six overlays shown are the only 3×3 grids that have this property.

Your task is to assemble a subset of the overlays and lay them on the template in such a way that dots cover all the colored squares but none of the blank squares. You are welcome to superimpose multiple dots on any colored square, but overall you want to use as few overlays as possible. To make things interesting, I’ll suggest a wager. I’ll pay you $3 for a correct covering of a 3×3 template, but you have to pay me $1 for each overlay you use. Is this a good bet?

Before going further, I should mention that not every conceivable template can be covered under these rules. To take an obvious example, no 3×3 template with fewer than three colored squares can possibly be covered by any combination of the six overlays. But I promise to submit only templates that can be covered by some combination of the given dot patterns; if I err about this, I forfeit the bet.

How does the game play out? If I give you the template marked “1″ above, you can easily win; just choose permutations a and b, which together cover all the colored squares and no others. You pay $2 and get $3. Template 2, with all nine squares colored, looks like it might be the toughest challenge. Clearly, it cannot be covered with fewer than three overlays, since we need a total of nine dots; and it turns out that exactly three overlays are required. Indeed, there are two ways of covering the template with three overlays: a + d + e and b + c + f. Thus this template is a breakeven proposition: You earn $3 and pay $3.

Now we come to template 3, which has eight colored squares and one blank. Surely if you can cover the full nine squares with just three overlays, then you should also be able to cover eight squares—no? I invite you to try it. In fact the only covering that works requires four overlays: b + d + e + f. Thus you shouldn’t take my bet, since I can always give you a template with just one blank, and you’ll have a net loss of $1.

Some background. I’ll return to the gaming table momentarily, but first let me explain what this is all about and where it came from. A few weeks ago, I was writing about “ranges of rankings,” which led me into the topic of permutation matrices. To recapitulate:

  • A permutation matrix is a square matrix with a single 1 in each column and each row, and all the rest of the elements 0.
  • An ormat is a superposition of permutation matrices, formed by applying the Boolean OR function to corresponding elements of the permutation matrices. For example: matrix-or-sum.png
  • Not all square matrices with (0,1) entries can be formed by OR-ing permutation matrices, but there’s an efficient algorithm for deciding whether or not a given matrix is an ormat. (I thank some helpful commenters for enlightening me on this point.)
  • Given an ormat, the total number of distinct permutation paths that can be threaded through the 1 entries of the matrix is equal to the permanent of the matrix. Calculating the permanent is known to be a hard computational problem.

In a comment, Barry Cipra posed the following query:

The permanent tells us the maximum number of different permutations that can be OR-summed to produce a given ormat, but what is the corresponding minimum number? Also, in how many different ways can the minimum be achieved?

The connection between ormats and my little game is probably apparent by now. The template of colored and blank squares is an ormat; the dotted overlays represent permutation matrices; to maximize your payoff in the game (or to minimize your loss), you need to answer Barry’s first question, finding the minimum number of permutations that can be combined to yield the given ormat.

For 3×3 matrices, we can solve this problem by exhaustive search, calculating the OR-sums of all possible combinations of the six 3×3 permutation matrices taken 1, 2, 3, …, 6 at a time. I did this with pencil and paper on a recent airplane trip. Here is a summary of the results:

number of ormats generated by various combinations of permutation matrices

Some of the numbers on this card are easy to explain. The six ormats with just three 1 entries are the permutation matrices themselves. There are six of them because there are 3! = 6 permutations of three things. There are no ormats with four 1 entries for a reason that bears thinking about: There can be no permutations that differ from one another in just one element. When you superimpose any of the six overlays shown above, you can wind up with three, five or six dots, but never four.

At the other end of the scale, it’s no surprise that there’s exactly one ormat with nine 1 entries, and that it takes three permutations to produce it. And then there are the nine ormats with eight 1 entries, which each require four permutations to be OR-ed. These are the single-blank patterns like template 3 above.

Based on these results, I began speculating about what I would see in a tabulation of all 4×4 ormats.

guesses about stats for 4x4 ormats

There would have to be 4! = 24 patterns with four 1 entries, and just one pattern with all 1s, generated by OR-ing four permutations. And there should be 16 ormats that require five permutations, namely the 16 matrices with a single 0 element. This last prediction seemed a little less self-evident than the others.

Pocket change and Cheerios. My thoughts about the single-zero (or single-blank) case went something like this. To cover 15 squares with sets of four dots each, we need at least four sets, or else we simply won’t have enough dots. So a useful starting point is one of the optimal arrangements that cover all 16 squares without gaps or overlaps. By this time I had grown tired of drawing zillions of dots, and so I started working with sets of coins.

initial configuration of four permutations of coins

In this arrangement each coin denomination forms a permutation, with no two pennies, nickels, dimes or quarters in the same row or the same column. We have successfully covered all the colored squares, but unfortunately we’ve also covered the blank at the lower right. Thus this pattern of coins is not an acceptable solution, but maybe we can fix it up somehow?

adjusted configuration after one coin is moved

Moving the penny from the blank square to another square in the same column solves one problem but creates another: Now the arrangement of pennies is no longer a permutation. There are two pennies in the third row.

coins after second adjustment to restore permutation

So now we have to shift another penny to restore the one-per-row-and-column property. Inevitably, this leaves a colored square uncovered. The only way we can cover that exposed square is to introduce a fifth permutation. Since I had run out of coin denominations, I chose a popular brand of breakfast toroids. Voila:

coins and cheerios -- the five-permutation solution

There’s nothing special about the particular moves I chose in this sequence. If you try some alternatives, you should be able to persuade yourself that moving the penny that covers the blank to any other square in the fourth column (or in the fourth row) would lead to essentially the same situation. Likewise the game would come out the same if the single blank square were placed anywhere else in the grid. And you could also start with a different set of initial permutations (provided they cover all the squares).

This coin-shuffling exercise demonstrates that we can cover any 4×4 template that has a single blank by combining no more than five permutations, but how do we know that five are actually needed? Maybe there’s some totally different arrangement that would do the job with just four permutations? Well, think about what such an arrangement would look like. It would have to differ at exactly one position from some other layout of permutations that covers the full 16-square grid. But no two permutations can differ at one and only one place. Thus the reason there can be no four-permutation cover of 15 squares is essentially the same as the reason no 4×4 ormat pattern can cover just five squares.

This argument generalizes to k×k matrices: For any integer k, there must be at least k ormat patterns that cannot be covered with fewer than k+1 permutations. But then comes the bigger speculative leap: Perhaps k+1 is an upper bound. Perhaps part of the answer to Barry’s question is that no k×k ormat pattern requires more than k+1 permutations. At one point I even had a “proof” of this conjecture. Then I wrote a program to check it, doing much the same thing I did with the dots on the airplane.

Out of bounds. My program found the expected 16 ormat patterns that require five permutations—and it found many more as well. In all it identified 2,032 4×4 ormats that can’t be composed from fewer than five permutations. And then came a bigger surprise: The program also found 480 patterns that require six permutations. So much for my proposed upper bound.

One of those problematic 480 ormats takes this form:

((1 1 1 1) (1 1 1 1) (1 1 1 1) (1 1 0 0))

Looking over this pattern, I thought I understood where my earlier reasoning had gone awry. This matrix is just like the single-zero pattern, but with two zeros! (I do mean for that statement to make sense. Bear with me.) Suppose we start again with a set of four permutations that completely cover the grid, including the two blanks.

starting configuration of 16 coins on 4x4 template with two blanks

Then we can uncover each blank just as we did in the coin-shuffling procedure above, although we have to be careful the two sets of movements don’t interfere with each other. (Not much point in removing the penny from a blank square, then putting the nickel there.) Here is a strategy for clearing both blank squares while maintaining the one-per-column-and-row permutation property:

four coins and two blanks: first solution

Inevitably, when we uncover the two blank squares, we also remove coins from two colored squares, which now have to be filled in again. The key point is that no single permutation can repair that damage, because the two open colored squares are in the same row. To cover both of those squares we need two additional permutations.

Other ways of reshuffling the coins avoid putting the two open squares in the same row or column, but they still foil all attempts to complete the covering with just five permutations. Try adding four Cheerios to the diagram below. If you cover both of the open blue squares, then either you also cover one of the blank squares or you wind up with two Cheerios in the same row.

four coins and two blanks: second solution

So now it’s clear we need as many as six permutations to cover a 4×4 ormat. Does that suggest that the general upper bound might be k+2 rather than k+1? Or perhaps the appropriate formula is 2k–2? In support of this latter possibility I offer these two ormats, which require 8 and 10 permutations respectively:

2k-2ormats.png

Another wager. Having fooled myself several times about the upper bound on minimal ormat coverings, I feel I should build in a little margin for error before I invite you to make a further wager. We already have direct evidence that covering a k×k ormat can take as many as 2k–2 permutations. So I’ll be generous and offer a full $2k for a proper covering, while charging $1 per permutation. If k=3 or k=4, you can definitely make money on this deal. But is it a good bet for larger k? (Hint: I’d be willing to play the game on these terms for real money.)

•     •     •

Update 2010-08-19: No takers for my bet, eh? Too bad; I had already spent my winnings.

Barry Cipra, who raised the question about minimal ormat covers in the first place, sends this illuminating letter:

I’m going to tiptoe a short ways out on a long long limb and conjecture (really just guess) that the “worst case” behavior, in terms of the minimum number of permutations it takes to produce a given ormat, occurs for ormats of the following form, shown here for k=7:

((1 1 1 1 1 1 1) (1 1 1 1 1 1 1) (0 1 1 1 1 1 1) (0 0 1 1 1 1 1) (0 0 0 1 1 1 1) (0 0 0 0 1 1 1) (0 0 0 0 0 1 1))

So as not to abuse existing matrix terminology, I’ll call any (square) matrix of this type—i.e., whose entries below the main subdiagonal are all 0—”uppity triangular.” I can (and will!) show that this uppity triangular ormat for k=7 requires (at least) 16 permutations—and the number appears to grow concavely upwards from that, so I, for one, will definitely not take you up on your $2k wager.

The trick, I realized, is to view each ormat as the “shadow” of what I’ll call an “addmat.” If you let P1, P2, …, Pr be k×k permutation matrices, their addmat is simply the ordinary result of addition: S = P1 + P2 + … + Pr, whose entries are positive integers wherever one or more of the constituent permutations has a 1 and otherwise 0. The associated ormat is obtained by changing each of these entries to a 1, while leaving the 0’s alone. In this sense, the ormat’s 1’s are the “shadows” of the addmat’s positive entries.

What’s crucial is that addmats have a lovely little property not shared with their shadows: the row and columns sums of the entries of an addmat all equal the number of permutations that produce them, r.

Come now, let us reason together…. The uppity triangular ormat example above (for k=7) must come from an addmat of the form

{{

where a, b, and all the *’s are positive integers. In particular, each * is at least 1. Since all row and column sums must be equal, the sum a+b must equal the sum of b and all 6 *’s above it. Hence a is at least 6. Likewise a+b must equal the sum of a and all 6 *’s above it, so b is also at least 6. Hence a+b is at least 12, which means the OR-sum that produced the given ormat involves at least 12 permutations.

This clearly generalizes to arbitrary k, which is more than “direct evidence” that covering a k×k ormat can take as many as 2k–2 permutations, it’s rigorous proof! But we can immediately do better, at least on a case-by-case basis. If we try to get by with just 12 permutations for this uppity triangular ormat, we quickly run into trouble. We obviously must have a = b = 6, and it follows that all the *’s above them are 1’s (to make those column sums 12). That is, we have the addmat

{{

where I now wish to call your attention to the entry labeled “@”. To make its row-sum equal 12, we need @ = 10. But that means its column sum (with the 5 *’s above it) is at least 15, which cannot be! So we are forced to try larger values of a and/or b—which is to say, we need more permutation matrices to produce this addmat.

It turns out you can’t satisfy the row and column sum condition until you get to a = b = 8. I won’t take you through all the steps, but just give you a taste with the penultimate possibility, a = 7, b = 8. The best you can hope for in this case is

{{

Note that I put as much of the “weight” in the last two columns as close to the 7 and 8 as possible, so that I could use the smallest possible value (10) as the entry with 5 positive entries above it. This makes the last three rows, and the right three columns all have the same sum, 15, but now we see a problem in the 12’s column: Its sum is at least 16. So once again, we’re screwed. It’s only with the next attempt that we avoid contradiction:

{{8, 3, 1, 1, 1, 1, 1}, {8, 3, 1, 1, 1, 1, 1}, {0, 10, 2, 1, 1, 1,<br />
  1}, {0, 0, 12, 1, 1, 1, 1}, {0, 0, 0, 12, 2, 1, 1}, {0, 0, 0, 0, 10, 3, 3}, {0, 0, 0, 0, 0, 8, 8}}

This matrix finally has all its row and column sums equal. Please note, this may or may not be an actual addmat of a set of 16 permutation matrices—I suspect it probably is, but I haven’t bothered to check. All we know is that it satisfies a necessary condition of being an addmat, namely that its row and column sums are all equal. (It’d be nice if that were also a sufficient condition, but something tells me it isn’t.)

This example, which can clearly be played out for larger values of k, suggests that not only are you safe with a $2k wager, but with a $(2k+2) wager and higher I’ve played around with this a bit, and persuaded myself that the number of permutation matrices will go to 2k + a lot—for k=10, if I did things correctly, you need 24 permutations (or possibly more, if the uppity triangular matrix the analysis leads to is not an actual addmat). I am entirely convinced that some additional careful thought can streamline the analysis into a nice, slick proof. I’m just not sure I haven’t already made a mistake, and built an elaborate house of cards….

Does any of this jibe with what you’ve already found to be the case?

It does indeed jibe.

First of all, to answer a small question Barry left open, here is a set of 16 permutations that will successfully cover his 7×7 “uppity triangular” matrix:

{1,2,3,4,5,6,7} {2,1,4,3,6,7,5} {1,3,2,5,4,7,6} {1,3,4,2,6,5,7}
{2,3,1,5,6,4,7} {1,2,3,5,6,7,4} {1,2,4,5,3,6,7} {1,2,4,5,6,3,7}
{1,2,4,5,6,7,3} {1,3,4,5,2,6,7} {1,3,4,5,6,2,7} {1,3,4,5,6,7,2}
{2,3,4,1,5,6,7} {2,3,4,5,1,6,7} {2,3,4,5,6,1,7} {2,3,4,5,6,7,1}

This was found with a simple greedy search.

My own attempts to find an upper bound have focused not on uppity triangular matrices but on matrices I’ve been calling “flags,” like this 7×7 case:

{{0, 0, 0, 1, 1, 1, 1}, {0, 0, 0, 1, 1, 1, 1}, {0, 0, 0, 1, 1, 1, 1}, {1, 1, 1, 1, 1, 1, 1}, {1, 1, 1, 1, 1, 1, 1}, {1, 1, 1, 1, 1, 1, 1}, {1, 1, 1, 1, 1, 1, 1}}

This matrix also requires 16 permutations for a proper covering. To see why, try threading permutations through the columns of the matrix, starting at the left edge and in each column choosing a 1 element (never a 0) from a different row. Because of the block of zeros at the upper left, the first three elements of every permutation must lie in rows 4 through 7. Thus each permutation “uses up” three of the last four rows in the first three columns, and the rest of the permutation can revisit this range of rows only once. It follows that each permutation can touch only one element in the 4×4 block of 1s in the lower right corner of the matrix, and at least 16 permutations are needed to cover all the 1s in the matrix. Showing that 16 are sufficient is not hard.

This kind of analysis works for any odd k, and thus we know that such matrices can require as many as

{\biggl\lceil\frac{k}{2}\biggr\rceil}^2

permutations. (For even k the situation is a little less symmetrical, and I haven’t worked out the exact details.)

These results give us a lower bound on the upper bound on the number of permutations that may be needed to cover a k×k ormat. But we haven’t proved it’s the true upper bound. Are there other ormats that require even more permutations? My guess is no, but keep in mind that almost all my conjectures along these lines have turned out to be wrong.

The state of the spamosphere

Sunday, August 8th, 2010

BP has finally stoppered the Macondo well with a plug of mud and cement, but the gusher of spam continues to pollute inboxes everywhere. Maybe we need a relief well?

spam statistics Aug 2008 through July 2010

It’s been six months since my last spam update. The good news, I suppose, is that last summer’s huge spurt of spam has subsided. But I’m still getting 2,000 inanities a month. That’s six or seven times the rate I was seeing when I first started monitoring my spam intake in 2003. For the two years covered by the graph above, the cumulative total is 111,945 unwanted emails received.

Needless to say, most of those messages are utterly unremarkable. But as I reviewed the most recent batch of dreck, one series of emails caught my eye. In the past 24 hours I’ve received five copies of a spam with the subject line: “Solve this if you could…!!!!”:

For all Math’s champs, Accounting Experts & Number Numbssss… (including future champs too)

Find the 6th Number

1, 2, 6, 42, 1806, ____?

6th number is the password of the attachment.

The attachment mentioned in the last line is a zip archive that unpacks to reveal an .exe file, which I would not run even if I could. But I confess that I did stop to solve the little puzzle sequence. It’s not difficult, although it is a little harder than any of the series I use in the spambot filter on the bit-player comment form. Maybe I should add it to my repertory.

Four questions about fuzzy rankings

Saturday, July 24th, 2010

The National Research Council is getting ready to release a new assessment of graduate-education programs in the U.S. The previous study, published in 1995, gave each Ph.D.-granting department a numerical score between 0 and 5, then listed all the programs in each discipline in rank order. For example, here’s the top-10 list for doctoral programs in mathematics (as presented by H. J. Newton of Texas A&M University):

 rank    school                 score   
    1    Princeton               4.94
    2    Cal Berkeley            4.94
    3    MIT                     4.92
    4    Harvard                 4.90
    5    Chicago                 4.69
    6    Stanford                4.68
    7    Yale                    4.55
    8    NYU                     4.49
    9    Michigan                4.23
   10    Columbia                4.23

Note that the scores of the first two schools are identical (to two decimal places), and the first four scores differ by less than 1 percent. Given the uncertainties in the data, it seems reasonable to suppose that the ranking could have turned out differently. If the whole survey had been repeated, the first few schools might have appeared in a different order. Doctoral candidates in mathematics are presumably sophisticated enough to understand this point. Nevertheless, the spot at top of the list still carries undeniable prestige, even when you know that the distinction could be merely an artifact of statistical noise.

The committee appointed by the NRC to conduct the new graduate-school study wants to avoid this “spurious precision problem.” They’ve adopted some jazzy statistical methods—mainly a technique called resampling—to model the uncertainty in the data, and they’ve also decreed that the results will be presented differently. There will be no sorted master list showing overall ranks in descending order. Instead the programs in each discipline will be listed alphabetically, and each program will be given a range of possible ranks. For example, a program might be estimated to rank between fifth place and ninth place. Let’s call such a range of ranks a rank-interval, and denote it {5, 6, 7, 8, 9} or {5–9}.

For a hypothetical set of 10 institutions, A through J, here’s what a set of rank-intervals might look like.

bar graph showing ranges of rankings for schools A through J.png

Acknowledging the uncertainty in your findings is commendable. But let’s be realistic. If you actually want to make use of these results—for example, if you’re a student choosing a grad-school program—the first thing you’re going to do is sort those bars into some sort of rank order, trying to figure out which school is best and how they all stack up against one another. In other words, you’re going to undo all the elaborate efforts the NRC committee has put into obscuring that information.

Below is one possible ordering of the bars. I have sorted first on the top of the rank-intervals, then, if two columns have the same top rank, I’ve sorted on the bottom rank. Other sorting rules give similar but not identical results. For example, sorting on the midpoints of the intervals would interchange columns B and F.

bar graphs showing rank-ranges sorted into one canonical order.png

Question 1. Does sorting a set of rank-intervals by one of these simple rules yield a consistent and meaningful total ordering of the data? To put it another way, can you trust this attempt to reconstruct a ranking?

I hasten to add that this is not really a practical question about finding the best grad school. If you’re facing such a choice in real life, the NRC rank-intervals are not the only available source of information. But, for the sake of the mathematical puzzle, let’s pretend that all we know about schools A through J is embodied in those ranges of rankings.

It turns out that rank-intervals have some fairly peculiar behavior. Ranges of ratings are not a problem. If the NRC merely gave each school a fuzzy rating on the 0-to-5 scale, no one would have much trouble interpreting the results. But when you turn fuzzy ratings into fuzzy rankings, there are hidden constraints. For example, not all sets of rank-intervals are well-formed.

two impossible sets of rank-ranges

The set at left is impossible because there’s no one in last place. (We can’t all be above average.) The example at right is also nonsensical because D has no ranking at all. For a set of rank-intervals to be valid, there has to be at least one entry in each row and each column.

That’s a necessary condition, but not a sufficient one, as the two graphs below illustrate.

two more impossible rank-intervals

Do you see the problem with the example at left? Column B has a rank-interval of {1–2}, but in fact B can never rank first because A has no alternative to being first. The case at right is conceptually similar but a little subtler: If B is ranked third, then either first place or second place will have to remain vacant.

The underlying issue here is the presence of constraints or linkages within a set of rankings. Suppose you have calculated ratings and rankings of several schools, and then some new information turns up about one school. You can change the rating of that school without any need to adjust other ratings, but not so the ranking. If a school goes from third place to fourth place, the old fourth-place school has to move to some other rung of the ladder, and somebody has to fill the vacancy in third place. These interdependencies are obvious in a non-fuzzy ranking, but they also exist in the fuzzy case. You can’t just assign arbitrary rank-intervals to the items in a set and assume they’ll all fit together. This observation leads to a second question:

Question 2. What are the admissible sets of rank-intervals? How do we characterize them?

I have a partial answer to this question. It goes like this. Any ranking of k things must be a permutation of the integers from 1 through k. A permutation can be embodied in a permutation matrix—a square k × k matrix in which every row has a single 1, every column has a single 1, and all the other entries are 0. For example, here are the six possible 3 × 3 permutation matrices:

3x3-permutation-matrices.png

They correspond to the rankings (1, 2, 3), (1, 3, 2), (2, 1, 3), (3, 1, 2), (2, 3, 1) and (3, 2, 1).

Since a permutation matrix represents a specific (non-fuzzy) ranking, we can build up a set of rank-intervals by taking the OR-sum of two or more permutation matrices. What do I mean by an OR-sum? It’s just the element-by-element sum of the matrices using the boolean OR operator, ∨, instead of ordinary addition. OR has the following addition table:

                      0 ∨ 0 = 0
                      0 ∨ 1 = 1
                      1 ∨ 0 = 1
                      1 ∨ 1 = 1

For the first two 3 × 3 matrices shown above the arithmetic sum is:

matrix-addition.png

whereas the OR-sum looks like this:

matrix-or-sum.png

Every valid set of rank-intervals must correspond to an OR-sum of permutation matrices, simply because a set of rank-intervals is in fact a collection of permutations. The converse also holds: Any OR-sum of permutation matrices yields an admissible set of rank-intervals. Thus the OR-sums of permutation matrices—let’s call them ormats for brevity—are in one-to-one correspondence with the admissible sets of rank-intervals. (There’s just one catch when applying this idea to the NRC study. The columns of an ormat may well have “gaps,” as in the column pattern (0 1 1 0 0 1 1), which corresponds to the rank-interval {2–3, 6–7}. Will the NRC allow such discontinuous ranges in their grad-school assessments? Perhaps the issue will never come up in practice. In any case, I’m ignoring it here.)

Arithmetic sums of permutation matrices form an open-ended, infinite series; in contrast, there are only finitely many distinguishable OR-sums. The reason is easy to see: Ormats have k2 entries, each of which can take on only two possible values, and so there can’t be more than \(2^{k^{2}}\) distinct matrices. Because of the various constraints on the arrangement of the entries, the actual number of ormats is smaller. For example, at k = 3 the \(2^{k^{2}}\) upper bound allows for 512 ormats, but there are only 49:

the-49-3-by-3-or-sums.png

Thus we come to the next question.

Question 3. For each k ≥ 1, how many distinct ormats can we build by OR-ing subsets of k × k permutation matrices? Is there a closed-form expression for this number?

I have answers only for puny values of k.

   k       upper bound        # of ormats  
   1                 1                  1
   2                16                  3
   3               512                 49
   4            65,536              7,443
   5        33,554,432          6,092,721
   6    68,719,476,736                  ?

The tallies of ormats were calculated by direct enumeration, which is not a promising approach for larger k. (I note—to spare folks the bother of looking—that the sequence 1, 3, 49, 7443, 6092721 does not yet appear in the OEIS.)

To extend this series, we might try to exploit the internal structure and symmetries of the ormats. By sorting the columns and rows of the matrices, we can reduce the 49 3×3 ormats to just six equivalence classes, with the following exemplars:

exemplars of six ormat equivalence classes

Enumerating just these reduced sets of matrices should make it possible to reach larger values of k, but I have not pursued this idea. (Furthermore, the two-dimensional sorting of matrices looks to be a curiously challenging task in itself.)

By the way, I think the number of ormats will approach the \(2^{k^{2}}\) upper bound asymptotically as k increases. Many of the features that disqualify a matrix from ormathood—such as all-zero rows or columns—become rarer when k is large. I have tested this conjecture by generating random (0,1) matrices and then counting how many of them turn out to be ormats.

fraction-of-ormats.png

For k = 1 through 5 the results are in close agreement with the actual counts of ormats, and up to k = 10 the trend is clearly upward. But continuing this inquiry to larger values of k will depend on a positive answer to the next question.

Question 4. Given a square matrix with (0,1) entries, is there an efficient algorithm for deciding whether or not it is an OR-sum of permutation matrices, and thus an admissible set of rank-intervals?

The question asks for a recognition predicate—a procedure that will return true if a matrix is an ormat and otherwise false. If efficiency doesn’t matter, there’s no question such an algorithm exists. At worst, we can generate all the k × k ormats and see if a given matrix is among them. But that’s like saying we can factor integers by producing a complete multiplication table. It just won’t do in practice. Isn’t there a quick and easy shortcut, some distinctive property of ormats that will let us recognize them at a glance?

If we could replace the OR-sum with the ordinary arithmetic sum, the answer would be yes. Permutation matrices have the handy property that all rows and columns sum to 1. An arithmetic sum of r permutation matrices has rows and columns that all sum to r. (It is a semi-magic square.) The converse is also true (though harder to prove): If a matrix of nonnegative integers has rows and columns that all sum to r, it is a sum of r permutation matrices. This fact yields a simple test: Sum the rows and the columns and check for equality.

Unfortunately, the trick won’t work for ormats, because the boolean OR operation throws away even more information than summing does. Because 0 ∨ 1 = 1 ∨ 0 = 1 ∨ 1, infinitely many sets of operands map into the same result, and there’s no obvious way to recover the operands or even to determine how many permutation matrices entered into the OR-sum.

Maybe there’s some other clever trick for recognizing ormats, but I haven’t found it. Let me make the question more concrete. Below are three (0,1) square matrices. Two of them are ormats but the third is not. Can you tell the difference?

three-puzzle-matrices.png

If it’s so hard to recognize an ormat, how did I count the ormats among a bunch of randomly generated (o,1) matrices? By hard work: I reconstructed the set of permutations allowed by each matrix. Visualize a permutation as a path threading its way through the matrix from left to right, connecting only non-zero elements and touching each column and each row just once. When you have drawn all possible permutation paths, check to see if every non-zero element is included in at least one path; if so, then the matrix is an ormat. Note that this is not an efficient recognition procedure. In the worst case (namely, an all-ones matrix), there are k! permutations, so this method has exponential running time. But k! is better than \(2^{k^2}\); and, besides, for sparse matrices the number of permutations is much smaller than k!. The 10 × 10 matrix presented as an example at the start of this post gives rise to 580 permutations, a manageable number. Here’s what they look like, plotted as a spider web of red paths across the bar chart.

ranges-with-paths

Every nonzero site is visited by at least one permutation path, so this set of rank-intervals is indeed valid.

This process of lacing permutations through a matrix finally brings me back to Question 1, about how to make sense of the NRC’s fuzzy ranking scheme. Let’s take a small example:

probability-example-1.png

Examining the graph above shows that A must rank either first or second—but which is more likely? In the absence of more-detailed information, it seems reasonable to assume the two cases are equally likely; we assign them each a probability of 1/2. Similarly, B has the rank-interval {1–3}, and so we might suppose that each of these three cases has probability 1/3. Continuing in the same way, we assign probabilities to every element of the matrix.

probability-example-2.png

But wait! This can’t be right; our probabilities have sprung a leak. Any proper set of probabilities has to sum to 1. Our procedure assures that each column obeys this rule, but there is no such guarantee for the rows. In row 1, we’re missing one-sixth of our probability, and in row 2 we have an excess of 1/2; row 4 comes up short by 1/3.

Is there any self-consistent assignment of probabilities for the elements of this matrix? Sure. As a matter of fact, there are infinitely many such assignments, including this one:

probability-example-3.png

I’ll return in a moment to the question of how I plucked those particular numbers out of the air, but note first what they imply about the ranking of items A through D. For item A, with the rank-interval {1–2}, the odds are two-to-one that it ranks first rather than second. B has the behavior we expected from the outset, with probability uniformly distributed over the three cases. But if you pick either C or D, each with the rank-interval {2–4}, your chance of getting second place is only 1/6, and half the time you’ll be in last place.

Where do these numbers come from? Instead of starting with the assumption that probability is uniformly distributed over each rank-interval, assume that each possible permutation of the ranks is equiprobable. For this matrix there are six allowed permutations: (1, 2, 3, 4), (1, 2, 4, 3), (1, 3, 2, 4), (1, 3, 4, 2), (2, 1, 3, 4) and (2, 1, 4, 3). Observe that four of the six ordering put A first, and only two permutations place A second. We can also tally up such “occupation numbers” for all the other matrix elements:

probability-example-4.png

Dividing these numbers by the total number of permutations, 6, yields the probabilities given above.

We can do the same computation for the 10 × 10 example matrix, which turns out to allow 580 permutations:

ranges-with-path-weights.png

If you care to check, you’ll find that each column and each row sums to 580; dividing all the entries by this number yields a probability matrix with columns and rows that sum to 1 (also known as a doubly stochastic matrix).

This process of tabulating permutation paths recovers some of the information we would have gotten from the arithmetic sum of the permutation matrices—information that was lost in the OR-ing operation. But we get back only some of the information because we have to assume that each permutation included in the OR-sum appears only once. (This is just another way of saying that the allowed permutations are equiprobable.) There’s no particularly good reason to make this assumption, but at least it leads to a feasible probability matrix.

Is there any way of calculating the entries in the doubly stochastic matrix without explicitly tracing out all the permutation paths? I’m sure there is. I think the construction of the matrix can be approached as an integer-programming problem, and perhaps through other kinds of optimization technology. What seems less likely is that there’s some simple and efficient shortcut algorithm. But I could be wrong about that; there’s a lot of mathematics connected with this subject that I don’t understand well enough to write about (e.g., the Birkhoff polytope). I hope others will fill in the gaps.

Getting back to the assessment of grad schools—have we finally found the right way to understand those rank-intervals that the NRC promises to publish any day now? My sense is that a semi-magic square (or, equivalently, a doubly stochastic matrix) will give a less-misleading impression than a simple eyeball sorting on the spans or midpoints of the rank-intervals. But what a lot of bother to get to that point! How many prospective grad students are going to repeat this analysis?

Acknowledgment: Thanks to Geoff Davis of PhDs.org for introducing me to this story. PhDs.org will have the new ratings as soon as the NRC releases them, and may even find a way to make them intelligible! Disclaimer: I’ve done paid work for the PhDs.org web site (but this is not a paid endorsement).

Update 2010-07-27: If you’ve gotten this far, please read the comments as well. A number of commenters have provided important insights and context, which have helped me understand what’s going on in the matrices I’ve been calling ormats. But I’m still a bit murky about the best way to recognize and count them. I’m not sure that publishing my still-murky thoughts is terribly helpful, but maybe someone else will read what follows and give us a dazzling, gemlike synthesis.

For the ormat-recognition problem (Question 4 above), three basic approaches have been mentioned: enumerating the permutation paths through the matrix, examining matrix minors, and looking for perfect matchings in a bipartite graph defined by the matrix. It seems to me that all of these methods are doing the same thing.

Start with Barry Cipra’s method of minors. The basic operation is to choose a nonzero matrix element, then delete the row and the column in which that element occurs. You then apply the same operation to the remaining, smaller matrix.

In tracing permutation paths, we’re looking for sequences of nonzero elements, drawing one element from each column and each row. A way of organizing this search is to choose a nonzero element and then, after recording its location, delete the corresponding column and row, so that no other elements can be chosen from that column or row.

In the method based on Hall’s theorem, as explained by John R., we view the ormat as the adjacency matrix of a bipartite graph, where every nonzero element designates an edge connecting a row vertex to a column vertex. To find a matching, we delete an edge, along with the two vertices it connects (and also all the other edges incident on those vertices). Then we recurse on the smaller remaining graph. (See further update below.) If you translate this operation on the graph back into the language of matrices, deleting an edge and its endpoints amounts to deleting a row and a column of the adjacency matrix.

I am not asserting that these three algorithms are all identical, but they all rely on the same underlying operation. To say more, we would need to consider the control structure of the algorithms—how the basic operations are organized, how the recursion works, all the details of the bookkeeping. I don’t trust myself to make those comparisons without trying to implement the three methods, which I have not yet done. However, at this point I just don’t see how any method can guarantee correct results without something resembling backtracking (or else exhaustive search through an exponential space). After all, we’re not looking for just one matching in the graph, or one decomposition into matrix minors, or one permutation path; we have to examine them all.

Here’s a further hand-wavy argument for the essential difficulty of the task. For a (0,1) matrix, the number of permutation paths that avoid all zero entries is equal to the permanent of the matrix. Computing the permanent of such a matrix is known to be #P-complete.

Update 2010-07-31: With lots of help from my friends, I think I finally get it. Although there could be as many as k! permutation paths in a k × k matrix, you don’t need to examine all of the paths to decide whether or not the matrix is an ormat. It’s enough to establish that one such path passes through each nonzero element. This is what the algorithm based on Hall’s theorem does. As Frans points out in a comment below, I misunderstood the essential nature of that algorithm (in spite of having it explained to me several times). There is no recursive deconstruction into progressively smaller matrix minors; instead, we just loop over all the nonzero elements of the matrix, find the minor associated with each such element, then check for a perfect matching in the minor. (Still more refinements are possible—but already we have a polynomial algorithm.)

With this efficient recognizer predicate, it’s easy to measure the proportion of ormats in random matrices at larger values of k:

fraction-of-ormats-k25.png

As expected, the fraction of ormats approaches 1 beyond about k = 20.

So much for identifying ormats. I am still unable to extend the series of exact counts beyond k = 5. The tabulations for random (0,1) matrices suggest that for k = 6 there should be about 20 billion ormats, and counting that high is just too painful. I need to work out the symmetries of the problem.

As far as I can tell, assigning exact probabilities to the nonzero matrix elements requires a full enumeration of all the permutation paths, and thus a calculation equivalent to the permanent. There may be a useful approximation.

Barry Cipra asks a really good question: The permanent tells us the maximum number of permutations that could possibly be included in a given ormat, but what is the minimum number? A naive upper bound is the number of 1s in the matrix, but I don’t see an easy path to an exact count. But enough for now.