Archive for December, 2010

Googling the lexicon

Monday, December 20th, 2010

In 1860, when work began on the New English Dictionary on Historical Principles (better known today as the Oxford English Dictionary), the basic plan was to build an index to all of English literature. James Murray, principal editor of the OED from 1878 to 1915, poses in his scriptoriumVolunteer readers would pore over texts and send in paper slips with transcribed quotations, each slip showing a word in its native context. The slips of paper were collected in a “scriptorium,” where they were sorted alphabetically and became the raw material for the work of the lexi­cographers. The project was supposed to be completed in 10 years, but it took almost 70. Some 2,000 readers contributed 5 million quo­tation slips, citing phrases from 4,500 published works.

Google and Harvard have now given us a new index to the corpus of written English—and they’ve thrown in a few other languages as well. The data cover more than 500 billion word occurrences, drawn from 5,195,769 books (estimated to be 4 percent of all the books ever printed). The entire archive is being made available for download into your home scriptorium, under a Creative Commons license.

The project was announced December 16th with the online release of a paper to appear in Science. Here’s a quick rundown:

Publication: The Science article is “Quantitative Analysis of Culture Using Millions of Digitized Books.” It is supposed to remain freely available to nonsubscribers. See also the supplementary online material.

Authors: Jean-Baptiste Michel and Erez Lieberman Aiden of Harvard, with a dozen co-authors: Yuan Kui Shen, Aviva P. Aiden, Adrian Veres, Matthew K. Gray, The Google Books Team, Joseph P. Pickett, Dale Hoiberg, Dan Clancy, Peter Norvig, Jon Orwant, Steven Pinker and Martin A. Nowak.

Languages: English, Chinese, French, German, Russian and Spanish. There are actually five archives for English, based on various subsets and intersecting sets of the underlying texts (e.g., British vs. U.S.).

Data format: The OED quotations were meaningful hunks of text—typically a sentence. Here we get n-grams. A 1-gram is a single word or other lexical unit, such as a number. An n-gram is a sequence of n consecutive 1-grams. The Harvard-Google collaboration has compiled lists of n-grams for values of n between 1 and 5. Thus the 1-gram files are lists of single words, and the other files give snippets of text consisting of two, three, four or five words. For each n-gram we learn the number of occurrences per year from 1550 through 2008 as well as the number of pages on which the n-gram is found and the number of books in which it appears.

Links to related stuff:

Links to less-related stuff:

•     •     •

If you want to play with this new toy, the place to begin is the Google n-gram viewer. At this site you submit a query to an online copy of the database and get back a graph showing normalized n-gram frequencies for a selected range of years. Here’s a search on the days of the week. (The graphs from the n-gram viewer are very wide, and bit-player is very skinny. I’ve squeezed the graphs horizontally; hover on them to unsqueeze. (See nifty effect in Safari, Chrome, maybe other Webkit browsers.))

days-of-week-1800-2008.png

The results are not surprising, I think, but they’re interesting. “Sunday” is an outlier. “Monday” was once the next-most-often-mentioned day, but around 1860 it was overtaken by Saturday, and now it has also fallen behind Friday. As for the middle of the week—nobody cares about those days.

Here is a collection of “theory” bigrams: “number theory”, “set theory”, “group theory”, “graph theory”, “string theory”, “chaos theory”, “catastrophe theory”, “K theory”. Which of them had the highest frequency in the 20th century? (I guessed wrong.)

theory-1900-2008.png

(By the way, the graph above has a strange lump in the first decade of the 20th century. I think it’s a metadata error. The same anomalous blip turns up for many other search terms, such as “transistor” and “Internet”. Somewhere along the way, a batch of books from 2005 were recorded as having been published in 1905. (Geoff Nunberg at Language Log has pointed out many other problems with Google Books metadata.))

Here are a few magazines I have written for.

magazines-1900-2008.png

I’m afraid there’s a dismal pattern in this data: Publications reach their peak and begin declining soon after I arrive. (But the most recent plunge at Scientific American is not my doing.)

The graph below offers some tech trends. It appears we have officially entered the age of the gigabyte, and terabytes are coming on strong:

bytes-1970-2008.png

Next is a list of rare (even nonexistent) words: “uroboros”, “widdershins”, “abacot”, “baragouin”, “funambulist”, “futhark”, “gongorism”, “hapax legomenon”, “hypnopompic”:

rare-words-1800-2008.png

It’s curious that so many of these archaic-looking terms seem to be increasing in frequency.

At the other end of the spectrum are some very common words:

common-words-1800-2008.png

Again there’s a mild surprise here: The ordering of these words in the Google Books corpus apparently differs from that of the list usually cited.

•     •     •

I think the n-gram browser is great fun, but it provides access to only one aspect of the data set: We can plot the normalized frequency of specific n-grams as a function of time. There are many other kinds of questions one might ask about all these words. For starters, I’d like to invert the query function and find all n-grams that have a given frequency. (An obvious project is to compile a list of the commonest words and phrases in the corpus.)

At an even more elementary level: What is the distribution of word lengths in English?

Here’s another question: In the formula “I [verb] you”, what are the commonest verbs? The information needed to answer this question is present in the 3-gram files, but the Google viewer does not provide a means to get at it.

With appropriate software, the n-grams could also be used for language synthesis—generating highly plausible generic gibberish through a Markov process.

Still another idea is to explore the history of spelling errors and typos over the centuries. Did the introduction of the qwerty keyboard lead to different kinds of errors? How about the later introduction of spell-checking software, and the concomitant decline of proofreading as a trade?

Furthermore, the n-gram files are full of numbers as well as words. Can we learn anything of economic interest by charting the prevalence of numbers that look like monetary amounts? (It’s easy to search for specific strings of digits, such as “$9.99″, but I would like to treat these values as numbers rather than sequences of digits, so that “0.99″, “.99″ and “0.99000″ would all be numerically equal.)

The way to carry out any of these projects is to download the full n-gram files and start writing software to explore them. I’ve taken my first step in that direction: I’ve bought a new disk drive to hold the data. Next I’m going to borrow a faster Internet connection to do the bulk of the downloading. After that, I must confess that I don’t have a lot of confidence about how best to proceed. The scale of these data sets is roughly a terabyte, which in this age of Big Data is no more than a warmup exercise. But it’s bigger than anything I’ve ever tried to grab hold of personally. I’ll be grateful for advice from those with more experience.

The n-gram lists come in fragments. The 1-grams, for example, are broken up into 10 files, each of which weighs about a gigabyte. Within each file the n-grams are listed alphabetically, but the distribution across files is random. (For example, file 0 has “gab”, “gaw” and “gay”, but “gag”, “gap” and “gas” are elsewhere.) Ordinarily, my first impulse would be to run a big merge-sort over these files, putting all the words in order. But maybe that’s exactly the wrong thing to do; keeping them scattered is a potential opportunity for multithread parallelism.

Even casual browsing through the raw n-gram files is quite a revelation. They really are raw. Word frequencies are the prototypical example of a Zipf distribution, which has notoriously long tails. It follows that when you choose an entry at random from one of the n-gram files, you are likely to be waaaaaaay out in the aberrant fringe, looking at symbol strings that you might or might not recognize as English words. Here are 20 lines selected at random from a 1-gram file:

1-GRAM          YEAR   COUNT   PAGES   BOOKS
lilywhites      1994       1       1       1
Carneri         2002      24      24      12
Thurh           1971       1       1       1
Elsee           1832       5       5       4
cFMte           2008      12      12      12
COFFEE          1876     288     270     167
APO3            1990       6       3       1
Odumbara        1963      15      15       6
connubialibus   1967       1       1       1
Pickerel        1900      65      54      34
fubje&ed        1757       6       6       5
nader           1971      34      29      19
fiyled          1782       1       1       1
existfed        1993       2       2       2
Soveit          2007       1       1       1
monongahela     1939       4       4       4
suffeiing       1851       6       6       6
brake           1774      12      10       7
ofBrasenose     1951       3       3       3
Horas           1798       3       3       2

This is a pretty strange stew. There are obscure words, several proper names and abbreviations, as well as quite a few misspellings and nonstandard capitalizations. But the oddities that stand out most sharply have another origin: They are errors of optical character recognition. A word that was correct in the original document has gotten all fubje&ed up in the course of scanning. Of the 20 items listed above, 5 appear to be marred by OCR errors, so this is not just a matter of minor contamination. My best guess is that the word recorded as “fubje&ed” appeared in the 1757 book as:

subjected.png

with an initial long “s” that was assimilated to an “f” and a “ct” ligature that the OCR program confused with an ampersand.

The files are rife with such problems. Another case that caught my eye was “quicro”. As a Scrabble player, I ought to know that 18-point word! It turns out to be an OCR error for the Spanish verb “quiero”. And why is there a Spanish word in an English lexicon? Well, when I ran a search at Google Books, the top hit for “quicro” was Robert Southey’s Commonplace Book, published in 1850, which is properly classified as an English work even though it includes many long passages of Spanish and Portuguese verse.

Should we worry about such distractions? The real “quiero” is roughly 200 times as frequent as “quicro”, so the OCR error will not have a major statistical impact. Perhaps the most disturbing effect of the OCR noise is that it greatly lengthens the already-long tail of the frequency distribution. Suppose a word appears 100,000 times in the corpus. If the OCR process is 99.9 percent accurate, 100 instances of the word will be read incorrectly; in the worst case, each erroneous reading could be different, adding 100 spurious entries to the lexicon.

Cleaning up this mess looks like a major undertaking. (If it were easy, Google would have done it already.)

The long tail of the distribution has already been truncated to some extent: No n-gram is included in the data set unless it appears at least 40 times in the corpus of texts. For many kinds of analysis, an even higher threshold might be appropriate—excluding all terms that fall below 400 or maybe even 4,000 occurrences. But that still won’t eliminate all the OCR errors: “quickfilver” appears 5,411 times.

•     •     •

Before I go, I want to tell one more story, which is a mystery story. It all began when I was trying out a few seasonal phrases:

christmas-1850-2008.png

Note the distinctive and dramatic dip in all three frequencies starting in the late 40s or early 50s and continuing into the 70s, with an eventual strong recovery in the 90s. The frequencies fall by roughly 50 percent, then return to the neighborhood of their earlier peak. What’s going on here? Did the Grinch steal Christmas, and then give it back? Was there a backlash against Jingle Bells in the era of disco dancing and Watergate?

As a check on these results I tried a few terms associated with other holidays, unrelated to the midwinter madness. The same pattern emerged:

holidays-1850-2008.png

As a further control, I added still more phrases, this time with no obvious connection to any holiday whatever. All that the terms have in common is that they fall into roughly the same range of frequencies:

mystery-dip-1850-2008.png

The slump in the 60s and 70s is still visible in this augmented set of words. The dip looks rather like one of those episodes of mass extinction in the fossil record. And it provokes the same question: What caused it? And what ended it?

Of course not all words and phrases follow this pattern. Because the curves represent normalized frequencies—the count of each n-gram’s occurrences divided by the total number of all n-gram occurrences—a valley in any one n-gram’s frequency must be balanced by a peak for some other word or words. This fact leads to a hypothesis about the cause of the Great Postwar Santa Depression. The 50s and 60s were a period in which technical and scientific publishing bloomed, thereby diluting the share of printed books that would be likely to mention phrases such as “Santa Claus” and “vacuum cleaner”; instead we got volumes full of “asymptotic freedom”, “chymotrypsin inhibitor” and “field-effect transistor”.

The trouble with this notion is that the explosive growth of the sci/tech vocabulary did not end in the 1980s or 90s; thus it’s hard to understand how Santa has made such a spectacular comeback.

Here’s one wild guess at an explanation. Most of the books that Google has scanned come from university libraries, and so I wonder if we might be observing an artifact of the acquisition and retention policies of university librarians. Suppose that libraries tend to buy a broad cross-section of newly published titles, but when shelf space gets scarce, the half-life of Quantum Information Theory is longer than that of Frosty the Snowman. As a result of this selective culling, Santa books from the 60s have melted away, but those from the 90s have not yet disappeared. Could such practices account for the dip and the recovery? If you have a better idea, please share.

A square yard of idea

Wednesday, December 15th, 2010

After a year’s absence, I am home again in the pages of American Scientist. I want to thank the six friends and colleagues who kept the Computing Science department going while I was away. Here are their articles:

A Tisket, a Tasket, an Apollonian Gasket
Fractals made of circles do funny things to mathematicians
Dana Mackenzie
January–February 2010

Avoiding a Digital Dark Age
Data longevity depends on both the storage medium
and the ability to decipher the information
Kurt D. Bollacker
March–April 2010

The Bootstrap
Statisticians can reuse their data to quantify
the uncertainty of complex models
Cosma Shalizi
May–June 2010

E Pluribus Confusion
There’s more than one way to turn census data
into congressional seats
Barry Cipra
July–August 2010

The Great Principles of Computing
Computing may be the fourth great domain of science
along with the physical, life and social sciences
Peter J. Denning
September–October 2010

Recreational Computing
Puzzles and tricks from Martin Gardner inspire math and science
Erik D. Demaine
November–December 2010

•     •     •

My new column, now available on the web and coming soon in good old ink and paper, is:

Flights of Fancy
How birds (and bird-watchers) compute the behavior
of a flock on the wing
Brian Hayes
January–February 2011

I have written before on the impressive aerial maneuvers of bird flocks, once as part of an American Scientist column in 1999 and twice here at bit-player, first in 2007 and then most recently in 2009. (Alert readers will notice that I have even retreaded a title.)

The new column focuses on the work of a European collaboration called STARFLAG, which has figured out how to track the position and velocity of individual birds in large flocks of starlings, like the gathering seen below in a photograph made near the main railroad terminal in Rome.

flocks of starlings photographed above the railroad terminal in Rome

For the rest of the STARFLAG story, see the column; here I want to say a word about an earlier analyst of bird flocks, Edmund Selous (1857–1934). Selous was an English naturalist and author of at least 20 books, mostly on birds but with a few on other animals and insects. One of his last books, Thought-Transference (or What?) in Birds, published in 1931, argued that coordinated movements of birds in flocks might best be explained through some kind of telepathy. This proposal of paranormal communication between bird brains has made Selous a figure of fun. I’ve contributed my own share of mockery, and I don’t take back a word of it; the whole idea of bird telepathy is dopy. On the other hand, what a patient and determined and creative observer this man was!

Thought Transference is essentially a diary, a compendium of Selous’s field notes, many of them written down while he was crouched under hedges or hiding in a copse of firs:

April 4, 1923. (Langton Herring).—A most wretched day, the whole sky, and one may almost say the whole air, one great damp cloud, dissolving at intervals into misty rain. However, I was abroad in it and saw, at some distance, a number of small birds which, when I put up the glasses, proved, to my great joy, to be goldfinches, and I watched them for the greater part of the afternoon. They flew about from one part to another of the hillside, in an erratic, uncertain sort of manner, coming down and feeding at intervals, hovering for a little just over the ground before they descended upon it, and then moving rapidly about with little sprightly, never-changing hops. A constant feature of the flights was their all turning in the air together—so it seemed—and shooting back in the direction from which they had come, and one could never say the moment at which they might not do this….

There must have been, I think, a hundred birds as a minimum in the flock when it was at its fullest and undivided, and the sudden flashing of red heads and yellow wing-patches, when near enough to get the effect, was very striking…. A little before I left they flew to the small plantation of Scotch firs, in which I stood, and there was every tree full of goldfinches—at least it seemed so—all of them warbling and twittering in a most delightful manner. I thought they had come to roost, it being about five and the day so dull, but presently they flew off again. This was an interesting little bit for me—what would it have been with sunshine? But they were their own. Dear, pretty, sprightly little cheery birds!

But how did these goldfinches rise and turn and twist and come down again, as though it was all what they all wanted, just at the same moment, to do? Their little minds must act together. Though I cannot understand it, yet it seems to me that they must think collectively, all at the same time, or at least in streaks or patches—a square yard or so of idea, a flash out of so many brains.

I first learned of Selous and his work from Frank Heppner of the University of Rhode Island, who helped me with both the 1999 column and the recent one. When Heppner noticed that I had identified Selous as “an intrepid English bird-watcher,” he suggested that “birder” might be a more appropriate (or less offensive) description. This was doubtless good advice, but I resisted it. As far as I can tell, it was Selous himself who invented the term “bird watcher,” and it’s definitely what he called himself.

Finally, I have to confess an embarrassing error. In 2007, when I wrote here about my own Selous-like afternoon of admiring the flocks on a barren field in North Carolina, I described the event as a great congregation of starlings. Recently I looked more closely at my photographs. Although there are indeed a few European starlings (Sturnus vulgaris) in the flock, the vast majority of the birds are brown-headed cowbirds (Molothrus ater). Not even similar. So don’t call me a birder or a bird-watcher; I’m a bird-bumbler.

brown-headed-cowbirds-2053.jpg