Archive for the ‘linguistics’ Category

How Did the Stars Get Their Points?

Thursday, December 8th, 2011

a field of bright stars and dust clouds in the Large Magellanic Cloud, photographed by the Hubble Space Telescope, courtesy Wikipedia

Those are hot young stars in the Large Magellanic Cloud—one of the puppy-dog galaxies that follow the Milky Way around—photographed by the Hubble Space Telescope. (Detail cropped from a Wikipedia image.) Note that four rays seem to emanate from each of the brightest stars. The rays are not, of course, true beams of light radiating in the four cardinal directions. They are an artifact of the telescope’s structure: a diffraction pattern created by the four vanes of the “spider” that supports the secondary mirror within the barrel of the telescope. Many other telescopes have three-vane spiders that yield a six-pointed diffraction pattern.

Stars, engraving by M. C. Escher, from WikipediaRecently, in my lovable know-it-all manner, I was holding forth on the idea that this diffraction effect—a mere accident of instrumental design—might actually be the source of the familiar iconographic star, with its five or six angular points. In other words, we think of a star as something spiky, poking out in various directions, because we’re used to seeing telescopic images with this diffractive defect. At right is M. C. Escher’s interpretation of what stellar means. For other examples see the Hollywood Walk of Fame or the flags of the U.S. and the E.U. and those of more than 50 other countries, not to mention Texas.

Well, it turns out my cute idea about the cultural influence of telescopic photos is utterly bogus. If you need any evidence, the engraving reproduced below should suffice. It shows the muse Astronomia (a.k.a. Urania) pointing out the moon and stars to Ptolemy. The stars are five- or six-pointed scribbles that beg to be called asterisks. The engraving appears in the Margarita Philosophica of Gregor Reisch, published in 1504, which is a full century before Galileo turned his telescope to the heavens. Whatever those engraved stars are, they are not artifacts of telescope spider vanes.

Ptolemy and Astronomia with stars and moon from Margarita Philosophica 1504

The dictionary offers further evidence. For example, the starfish (genus Asterias, class Asteroidea) has had that name at least since 1538. And the asterisk—the typographical mark—has a citation in the OED going all the way back to 1382. These terms make sense only if the concept of a star was already associated in most people’s minds with a spiky polygon, rather than a dimensionless point of light in the night sky.

And that’s what puzzles me, because the stars really do appear to be dimensionless points of light. When I stare at the sky, I see some twinkling going on, but nowhere do I see pentagrams and hexagrams pinned to black velvet, or even the slightest hint of angularity. So where did this tradition get started? Did the Greek word ?????? already convey a sense of symmetrical spikiness, so that ancient Athenians would have understood why we call certain flowers asters? Is the same iconography prevalent in other cultures, say in China? Those 50+ star-studded flags (including China’s) suggest that the conventional stellar icon is at least recognized globally, but they don’t tell us where and when it all began. After my telescopic theory fell apart, I had a second hypothesis, namely that the star icon might come from the symbol-happy world of astrology, but I’ve found no support for this idea either. So I throw the question out to the starry void: How did the star get its points?

Addendum 2011-12-16: The illuminating comments below on ancient Egyptian paintings of stars would appear to settle part of my question: Well over 2,000 years ago, at least some people were already drawing stars in much the same way a modern kindergartner does. What I’d still like to know is why. Yes, there are many plausible just-so stories, but you’d think that someone at the time might have offered a word of explanation.

The other day I spent a pleasant afternoon leafing through The History and Practice of Ancient Astronomy, by James Evans (New York: Oxford University Press, 1998). It’s quite a thorough introduction to Greek and Egyptian ideas about the sky, but I did not find an answer to my question about the points of stars. The astronomers of that period were engrossed in charting the positions and motions of the stars, but one gets the impression they had no interest whatever in the nature of those bright objects—what they look like up close, what they’re made of, why they shine. Of course I don’t really believe the ancients were so lacking curiosity. Surely Aristotle holds forth somewhere on the substance of the stars? But I haven’t found it yet.

Zipfy n-grams

Thursday, April 28th, 2011

In the 1930s and 40s George Kingsley Zipf studied word frequencies in several languages and came up with a general observation: If you sort all the words from commonest to rarest, the frequency of the word at rank r is proportional to 1/r. This law has a distinctive graphical interpretation: Plotting the logarithm of frequency against the logarithm of rank yields a straight line with a slope of –1.

Do the Google n-grams obey Zipf’s law? When commenter Nick Black raised this question, I didn’t know the answer. Now I do.

Here’s a log-log plot of the number of occurrences of a word w as a function of w’s rank among all words:

Zipf curve for n-gram abundance

We don’t have a straight line here, but it’s not too far off. You can get a good fit with two piecewise straight lines—one line for the top 10,000 ranks and the other for the rest of the 7.4 million words.

Zipf curve with piecewise linear fitted lines

The upper part of the distribution is quite close to Zipf’s prediction, with a slope of –1.007. The lower part is steeper, with a slope of –1.66. Thus the communal vocabulary seems to be split into two parts, with a nucleus of about 10,000 common words and a penumbra of millions of words that show up less often. The two classes have different usage statistics.

I don’t have any clear idea of why language should work this way, but very similar curves have been observed before in other corpora. For example, Ramon Ferrer-i-Cancho and Ricard V. Solé report slopes of –1.06 and –1.97 in the British National Corpus, with the breakpoint again at roughly the rank 10,000. (Preprint here.)

Incidentally, the Zipf plot is truncated at the bottom because the n-gram data set excludes all words that occur fewer than 40 times. Extrapolating the line of slope –1.66 yields an estimate of how large the data set would have been if everything had been included: 77 million distinct n-grams.

The Library of Babble

Saturday, April 23rd, 2011

The new issue of American Scientist is out, both on newsstands and on the web. My “Computing Science” column takes up a topic I’ve already written about here on bit-player: the huge corpus of “n-grams” extracted from the Google Books scanning project and released to the public by a team from Harvard and Google. (The earlier bit-player items are titled “Googling the Lexicon” and “3.14.”)

After two blog posts and a magazine column, my faithful readers may have had enough of this subject—but not me. I can’t seem to get it out of my system. So I’m going to take this opportunity to publish some of the overflow matter that wouldn’t fit in the column. (And even this won’t be the end of the story. Stay tuned for still more n-grams.)

For the benefit of readers who have not been doting on my every word, here’s a precis. Google aims to digitize all the world’s printed books, and so far they have scanned about 15 million volumes (which is probably about one-eighth of the total). At Harvard, Erez Lieberman Aiden and Jean-Baptiste Michel, with a dozen collaborators, have been working with digitized text from a subset of 5,195,769 Google book scans. Because of copyright restrictions they cannot release the full text, but they have extracted lists of n-grams, or phrases of n words each, for values of n between 1 and 5. The 1-grams are individual words (or other character strings, such as numbers and punctuation marks); the 2-grams are two-word phrases, and so on. Each n-gram is accompanied by a time series giving the number of occurrences of that n-gram in each year from 1520 to 2008. (For more background and technical detail, see the Science article by Michel, et al.)

If you want to trace changes in the frequency of a specific word over time, Google has set up an online Ngram Viewer that makes this easy. But you can also download the data set, which allows for many other kinds of exploration. The cost is some heavy lifting of multigigabyte files. So far I have worked only with English text (six other languages are also covered) and only with 1-grams, which form the smallest part of the data set.

Here are some basic facts and figures on the English 1-grams:

uncompressed file size (bytes) 9,672,200,350
number of distinct 1-grams 7,380,256
total 1-gram occurrences 359,675,008,445
number of distinct character codes 484
total character occurrences 1,515,454,264,550

That’s a lot of verbiage.

It’s worth pausing for a comment on those 484 distinct character codes. We tend to think of English as being written with an alphabet of just 26 letters, or 52 if you count upper case and lower case separately. Then there are numbers and marks of punctuation, and miscellaneous symbols such as $ and +. The original ASCII code had 95 printable characters. How do you get up to 484? Well, even though these files are derived from English-language books, a fair amount of non-English turns up in them. There’s Greek and Cyrillic and a smattering of Asian languages, as well as all the accented versions of Latin characters seen in Romance and Germanic and Slavic languages. There’s even a little mathematical notation. It’s actually surprising that the data set spans only 484 symbols; this is a small subset of the full Unicode spectrum.

Here is the length distribution of the 7 million distinct 1-grams:

graph of abundance vs. word length for the 7 million distinct 1-grams

Note that the abundance of shorter words is combinatorially limited. With an “alphabet” of 484 symbols, there cannot possibly be more than 484 single-character words, or 4842 two-character words. But this constraint becomes unimportant beyond length 3; for words of five or six letters, only a tiny fraction of all possible combinations are actually observed.

The corresponding distribution for the 359 billion word occurrences looks rather different:

graph of abundance vs. word length for the 359 billion word occurrences

This is roughly what you’d expect to see for a language that encodes information efficiently. As in a Huffman coding, the shortest words are very common, and the sesquipedalian ones are rare. The overall trend is generally linear, and it is remarkably so in the range of lengths from 5 to 11 or 12. Is this a well-known fact? (It is not the shape I would have guessed.) Words of length three and four stand out above the linear trend line. And I should mention that the abundance of single-character words is boosted in this tabulation by the inclusion of punctuation marks, which would probably not be counted at all in other studies of word length.

The graph below is one that appears in my American Scientist column. I reproduce it here because I want to call attention to some curious features of the curves.

historical time series of distinct words and word occurrences

The graph shows the number of distinct words per year (blue) and the total word occurrences per year (red) over the past 200 years or so. I find it interesting that some major historical events are visible in this record. There are dips in both curves at the time of the American Civil War and during both World Wars of the 20th century. Presumably, book publishing languished during those years. There are also ripples that might be attributed to the crash of 1929 and the ensuing Great Depression, although they are less clear.

Other features of the curves don’t have such a ready-made historical explanation. I am particularly curious about the broad sag in the red curve (but not the blue one) from the late 1960s through the 1970s. Between 1967 and 1973, the number of word occurrences declined 12 percent, while the number of distinct words rose 1 percent. This is very strange: We continued to invent new words, but we didn’t make much use of them. Nowhere else do the two curves maintain opposite slopes for any extended period. I can’t explain it. I call it the great Nixonian slump.

I thought I might learn something about the slump by looking at a selection of specific words that exhibit this pattern of abundance—less frequent in the early 70s than in surrounding years. So I extracted about 70,000 of them and sorted them by overall abundance. In many ways it’s an interesting collection. At the very top of the list is Reagan, apparently in eclipse during the Nixon years. Then there’s a swarm of words connected with mechanical and automotive engineering: torque, rotor, stator, pinion, crankshaft, impeller, carburetor, alternator. All of these words might plausibly appear in the same books, so seeing them fade in and out together is not so surprising. Still, one would like to know what happened in book publishing to cause the decline. The 70s were a tough time for the auto industry; is that enough to explain it?

Looking for patterns in these words is fun, but the truth is I don’t believe they have anything to do with the overall Nixonian swoon. Most words have large fluctuations in abundance over a time scale of a decade or two; my extraction procedure merely identified a subset of words that happened to enter a trough at the same time. I could find similar sets for other periods.

In another attempt to explain the slump I divided the data set into halves. One half consists of the 100 most frequent words, which happen to account for almost exactly half of all word occurrences. The other 7,380,156 words make up the second half. Maybe the slump afflicted only common words, or only rare ones? The result was remarkably uninformative:

times series for the top and bottom halves of the frequency range

The two curves trace almost exactly the same time course. Or maybe that’s not so uninformative after all: It tells us that whatever phenomenon causes the slump, the effect is spread out over the entire vocabulary, not just words in a certain frequency range.

Here’s another just-so story I’ve been telling myself in an attempt to understand the Nixonian swoon. The 1960s is when phototypesetting began to displace the older technology of metal type. Suppose that some characteristic of early phototypeset books causes trouble for optical character recognition (OCR) systems. On reading such a book, the OCR program would report an exaggerated number of unique words (since instances of a given word could be read in various different erroneous ways) but the total number of word occurrences would remain constant (since every word is still recognized as some word). This isn’t exactly the pattern we’re seeing, but there’s one more factor to take into account. In the Harvard-Google data set, no word is included unless it appears at least 40 times. If the phototypeset text caused the OCR system to misread the same word in many different ways, some of those errors would fail to reach the 40-occurrence threshold and would just disappear. Thus the total occurrence count could decline even as the number of distinct words increased.

Lying awake in the middle of the night, I thought this story sounded really good. When I got up in the morning, I ran a test. If an unusual number of words are falling off the bottom of the distribution during the Nixon years, then we should also see a bulge just above the bottom, consisting of those words that just barely reached the threshold. But here’s the time series for the 15,518 words with exactly 40 occurrences:

time series for words with exactly 40 occurrences

There’s no hump in the late 60s. On the contrary, the curve looks much like all the rest, with the same sorry sag. Isn’t it annoying when mere fact overturns a perfectly lovely theory?

But enough of the Watergate era. I have a few more loose ends to tie up.

A graph in the magazine illustrates our collective fondness for round numbers—those divisible by 5 or 10. I remark in the text: “Dollar amounts are even more dramatically biased in favor of well-rounded numbers.” Here’s the evidence:

frequency of dollar amounts for numbers from $1 to $100

Dollar amounts mod $1 tell the same story:

prevalence of cents values between 1 and 99

I was not surprised at the prominence of $X.25, $X.50 and $X.75, but in this graph I had expected also to see a strong signal from prices a penny less than a dollar. That signal is detectable but not conspicuous. Apparently, items mentioned in books—perhaps the books themselves—are more commonly priced at $X.95.

Finally, I want to mention two small but troublesome anomalies in the downloadable 1-gram files. The Google OCR algorithms treat most marks of punctuation as separate 1-grams. Thus we can count how many periods, colons and question marks appeared in printed books over the years. But the encoding of two of these symbols was garbled somewhere along the way. The most abundant 1-gram in the entire data set—with 21,396,850,115 occurrences, or about 6 percent of the total—is listed in the files as the double quotation mark. In fact it should be identified as the comma. (In the web interface to the online Ngram Viewer, the comma is a separator, which may have something to do with the confusion in the downloadable files. The character is correctly given as a comma in the private files of Aiden and Michel.)

The second problem entry is even weirder. Loaded into a text editor, it looks like this:

                          """ "

Closer examination with a binary editor shows that the space between the third and fourth quote marks consists of two control characters, with hexadecimal values 0×15 (NAK) and 0×12 (DC2). I make no sense of this, and Aiden and Michel have not yet been able to help. All the same, I think I know how to fix it. If the entry for the double quotation mark is actually a comma, then something else has to be the double quote. This bizarre character string looks like a good candidate.

3.14

Monday, March 14th, 2011

All those books that Google has been scanning for the past ten years are surprisingly rich in numbers as well as words. The Google Books data set released last December by a Harvard-Google team includes (by my count) 9,620,835,344 occurrences of 458,794 distinct numbers. (Plus another 31,293 numeric values that have dollar signs attached.)

In recognition of pi day, I want to zero in on some successive approximations to the world’s favorite irrational:

Pops 3

Pops 3 point 1

Pops 3 point 14

Pops 3 point 141

In tabular form here are the closest approximations found in the files, along with the abundance of each value:

3.141592 704
3.1415923 80
3.1415926 1141
3.14159265 1300
3.141592653 143
3.1415926535 286
3.141592653589 54
3.14159265358979 338
3.141592653589793 453
3.1415926535898 65
3.14159265359 177
3.1415926536 289
3.141592654 512
3.1415927 776
3.1415928 109
3.1415929 133
3.141593 843

The data set includes only items that appear at least 40 times in the collection of scanned volumes. Closer approximations to pi evidently fell below that threshold. In particular there is no sign of William Shanks’s famous 707-digit calculation, which was published in 1873. So, just for the sake of celebrating 3.14, here are 707 digits of pi—but unlike the product of Shanks’s many years of labor, I think these digits may be correct:

3.1415926535897932384626433832795028841971693993751058209
74944592307816406286208998628034825342117067982148086513
282306647093844609550582231725359408128481117450284102701
938521105559644622948954930381964428810975665933446128475
648233786783165271201909145648566923460348610454326648213
393607260249141273724587006606315588174881520920962829254
0917153643678925903600113305305488204665213841469519415116
09433057270365759591953092186117381932611793105118548074462
3799627495673518857527248912279381830119491298336733624406
5664308602139494639522473719070217986094370277053921717629
3176752384674818467669405132000568127145263560827785771342
7577896091736371787214684409012249534301465495853710507922
796892589235420200

Googling the lexicon

Monday, December 20th, 2010

In 1860, when work began on the New English Dictionary on Historical Principles (better known today as the Oxford English Dictionary), the basic plan was to build an index to all of English literature. James Murray, principal editor of the OED from 1878 to 1915, poses in his scriptoriumVolunteer readers would pore over texts and send in paper slips with transcribed quotations, each slip showing a word in its native context. The slips of paper were collected in a “scriptorium,” where they were sorted alphabetically and became the raw material for the work of the lexi­cographers. The project was supposed to be completed in 10 years, but it took almost 70. Some 2,000 readers contributed 5 million quo­tation slips, citing phrases from 4,500 published works.

Google and Harvard have now given us a new index to the corpus of written English—and they’ve thrown in a few other languages as well. The data cover more than 500 billion word occurrences, drawn from 5,195,769 books (estimated to be 4 percent of all the books ever printed). The entire archive is being made available for download into your home scriptorium, under a Creative Commons license.

The project was announced December 16th with the online release of a paper to appear in Science. Here’s a quick rundown:

Publication: The Science article is “Quantitative Analysis of Culture Using Millions of Digitized Books.” It is supposed to remain freely available to nonsubscribers. See also the supplementary online material.

Authors: Jean-Baptiste Michel and Erez Lieberman Aiden of Harvard, with a dozen co-authors: Yuan Kui Shen, Aviva P. Aiden, Adrian Veres, Matthew K. Gray, The Google Books Team, Joseph P. Pickett, Dale Hoiberg, Dan Clancy, Peter Norvig, Jon Orwant, Steven Pinker and Martin A. Nowak.

Languages: English, Chinese, French, German, Russian and Spanish. There are actually five archives for English, based on various subsets and intersecting sets of the underlying texts (e.g., British vs. U.S.).

Data format: The OED quotations were meaningful hunks of text—typically a sentence. Here we get n-grams. A 1-gram is a single word or other lexical unit, such as a number. An n-gram is a sequence of n consecutive 1-grams. The Harvard-Google collaboration has compiled lists of n-grams for values of n between 1 and 5. Thus the 1-gram files are lists of single words, and the other files give snippets of text consisting of two, three, four or five words. For each n-gram we learn the number of occurrences per year from 1550 through 2008 as well as the number of pages on which the n-gram is found and the number of books in which it appears.

Links to related stuff:

Links to less-related stuff:

•     •     •

If you want to play with this new toy, the place to begin is the Google n-gram viewer. At this site you submit a query to an online copy of the database and get back a graph showing normalized n-gram frequencies for a selected range of years. Here’s a search on the days of the week. (The graphs from the n-gram viewer are very wide, and bit-player is very skinny. I’ve squeezed the graphs horizontally; hover on them to unsqueeze. (See nifty effect in Safari, Chrome, maybe other Webkit browsers.))

days-of-week-1800-2008.png

The results are not surprising, I think, but they’re interesting. “Sunday” is an outlier. “Monday” was once the next-most-often-mentioned day, but around 1860 it was overtaken by Saturday, and now it has also fallen behind Friday. As for the middle of the week—nobody cares about those days.

Here is a collection of “theory” bigrams: “number theory”, “set theory”, “group theory”, “graph theory”, “string theory”, “chaos theory”, “catastrophe theory”, “K theory”. Which of them had the highest frequency in the 20th century? (I guessed wrong.)

theory-1900-2008.png

(By the way, the graph above has a strange lump in the first decade of the 20th century. I think it’s a metadata error. The same anomalous blip turns up for many other search terms, such as “transistor” and “Internet”. Somewhere along the way, a batch of books from 2005 were recorded as having been published in 1905. (Geoff Nunberg at Language Log has pointed out many other problems with Google Books metadata.))

Here are a few magazines I have written for.

magazines-1900-2008.png

I’m afraid there’s a dismal pattern in this data: Publications reach their peak and begin declining soon after I arrive. (But the most recent plunge at Scientific American is not my doing.)

The graph below offers some tech trends. It appears we have officially entered the age of the gigabyte, and terabytes are coming on strong:

bytes-1970-2008.png

Next is a list of rare (even nonexistent) words: “uroboros”, “widdershins”, “abacot”, “baragouin”, “funambulist”, “futhark”, “gongorism”, “hapax legomenon”, “hypnopompic”:

rare-words-1800-2008.png

It’s curious that so many of these archaic-looking terms seem to be increasing in frequency.

At the other end of the spectrum are some very common words:

common-words-1800-2008.png

Again there’s a mild surprise here: The ordering of these words in the Google Books corpus apparently differs from that of the list usually cited.

•     •     •

I think the n-gram browser is great fun, but it provides access to only one aspect of the data set: We can plot the normalized frequency of specific n-grams as a function of time. There are many other kinds of questions one might ask about all these words. For starters, I’d like to invert the query function and find all n-grams that have a given frequency. (An obvious project is to compile a list of the commonest words and phrases in the corpus.)

At an even more elementary level: What is the distribution of word lengths in English?

Here’s another question: In the formula “I [verb] you”, what are the commonest verbs? The information needed to answer this question is present in the 3-gram files, but the Google viewer does not provide a means to get at it.

With appropriate software, the n-grams could also be used for language synthesis—generating highly plausible generic gibberish through a Markov process.

Still another idea is to explore the history of spelling errors and typos over the centuries. Did the introduction of the qwerty keyboard lead to different kinds of errors? How about the later introduction of spell-checking software, and the concomitant decline of proofreading as a trade?

Furthermore, the n-gram files are full of numbers as well as words. Can we learn anything of economic interest by charting the prevalence of numbers that look like monetary amounts? (It’s easy to search for specific strings of digits, such as “$9.99″, but I would like to treat these values as numbers rather than sequences of digits, so that “0.99″, “.99″ and “0.99000″ would all be numerically equal.)

The way to carry out any of these projects is to download the full n-gram files and start writing software to explore them. I’ve taken my first step in that direction: I’ve bought a new disk drive to hold the data. Next I’m going to borrow a faster Internet connection to do the bulk of the downloading. After that, I must confess that I don’t have a lot of confidence about how best to proceed. The scale of these data sets is roughly a terabyte, which in this age of Big Data is no more than a warmup exercise. But it’s bigger than anything I’ve ever tried to grab hold of personally. I’ll be grateful for advice from those with more experience.

The n-gram lists come in fragments. The 1-grams, for example, are broken up into 10 files, each of which weighs about a gigabyte. Within each file the n-grams are listed alphabetically, but the distribution across files is random. (For example, file 0 has “gab”, “gaw” and “gay”, but “gag”, “gap” and “gas” are elsewhere.) Ordinarily, my first impulse would be to run a big merge-sort over these files, putting all the words in order. But maybe that’s exactly the wrong thing to do; keeping them scattered is a potential opportunity for multithread parallelism.

Even casual browsing through the raw n-gram files is quite a revelation. They really are raw. Word frequencies are the prototypical example of a Zipf distribution, which has notoriously long tails. It follows that when you choose an entry at random from one of the n-gram files, you are likely to be waaaaaaay out in the aberrant fringe, looking at symbol strings that you might or might not recognize as English words. Here are 20 lines selected at random from a 1-gram file:

1-GRAM          YEAR   COUNT   PAGES   BOOKS
lilywhites      1994       1       1       1
Carneri         2002      24      24      12
Thurh           1971       1       1       1
Elsee           1832       5       5       4
cFMte           2008      12      12      12
COFFEE          1876     288     270     167
APO3            1990       6       3       1
Odumbara        1963      15      15       6
connubialibus   1967       1       1       1
Pickerel        1900      65      54      34
fubje&ed        1757       6       6       5
nader           1971      34      29      19
fiyled          1782       1       1       1
existfed        1993       2       2       2
Soveit          2007       1       1       1
monongahela     1939       4       4       4
suffeiing       1851       6       6       6
brake           1774      12      10       7
ofBrasenose     1951       3       3       3
Horas           1798       3       3       2

This is a pretty strange stew. There are obscure words, several proper names and abbreviations, as well as quite a few misspellings and nonstandard capitalizations. But the oddities that stand out most sharply have another origin: They are errors of optical character recognition. A word that was correct in the original document has gotten all fubje&ed up in the course of scanning. Of the 20 items listed above, 5 appear to be marred by OCR errors, so this is not just a matter of minor contamination. My best guess is that the word recorded as “fubje&ed” appeared in the 1757 book as:

subjected.png

with an initial long “s” that was assimilated to an “f” and a “ct” ligature that the OCR program confused with an ampersand.

The files are rife with such problems. Another case that caught my eye was “quicro”. As a Scrabble player, I ought to know that 18-point word! It turns out to be an OCR error for the Spanish verb “quiero”. And why is there a Spanish word in an English lexicon? Well, when I ran a search at Google Books, the top hit for “quicro” was Robert Southey’s Commonplace Book, published in 1850, which is properly classified as an English work even though it includes many long passages of Spanish and Portuguese verse.

Should we worry about such distractions? The real “quiero” is roughly 200 times as frequent as “quicro”, so the OCR error will not have a major statistical impact. Perhaps the most disturbing effect of the OCR noise is that it greatly lengthens the already-long tail of the frequency distribution. Suppose a word appears 100,000 times in the corpus. If the OCR process is 99.9 percent accurate, 100 instances of the word will be read incorrectly; in the worst case, each erroneous reading could be different, adding 100 spurious entries to the lexicon.

Cleaning up this mess looks like a major undertaking. (If it were easy, Google would have done it already.)

The long tail of the distribution has already been truncated to some extent: No n-gram is included in the data set unless it appears at least 40 times in the corpus of texts. For many kinds of analysis, an even higher threshold might be appropriate—excluding all terms that fall below 400 or maybe even 4,000 occurrences. But that still won’t eliminate all the OCR errors: “quickfilver” appears 5,411 times.

•     •     •

Before I go, I want to tell one more story, which is a mystery story. It all began when I was trying out a few seasonal phrases:

christmas-1850-2008.png

Note the distinctive and dramatic dip in all three frequencies starting in the late 40s or early 50s and continuing into the 70s, with an eventual strong recovery in the 90s. The frequencies fall by roughly 50 percent, then return to the neighborhood of their earlier peak. What’s going on here? Did the Grinch steal Christmas, and then give it back? Was there a backlash against Jingle Bells in the era of disco dancing and Watergate?

As a check on these results I tried a few terms associated with other holidays, unrelated to the midwinter madness. The same pattern emerged:

holidays-1850-2008.png

As a further control, I added still more phrases, this time with no obvious connection to any holiday whatever. All that the terms have in common is that they fall into roughly the same range of frequencies:

mystery-dip-1850-2008.png

The slump in the 60s and 70s is still visible in this augmented set of words. The dip looks rather like one of those episodes of mass extinction in the fossil record. And it provokes the same question: What caused it? And what ended it?

Of course not all words and phrases follow this pattern. Because the curves represent normalized frequencies—the count of each n-gram’s occurrences divided by the total number of all n-gram occurrences—a valley in any one n-gram’s frequency must be balanced by a peak for some other word or words. This fact leads to a hypothesis about the cause of the Great Postwar Santa Depression. The 50s and 60s were a period in which technical and scientific publishing bloomed, thereby diluting the share of printed books that would be likely to mention phrases such as “Santa Claus” and “vacuum cleaner”; instead we got volumes full of “asymptotic freedom”, “chymotrypsin inhibitor” and “field-effect transistor”.

The trouble with this notion is that the explosive growth of the sci/tech vocabulary did not end in the 1980s or 90s; thus it’s hard to understand how Santa has made such a spectacular comeback.

Here’s one wild guess at an explanation. Most of the books that Google has scanned come from university libraries, and so I wonder if we might be observing an artifact of the acquisition and retention policies of university librarians. Suppose that libraries tend to buy a broad cross-section of newly published titles, but when shelf space gets scarce, the half-life of Quantum Information Theory is longer than that of Frosty the Snowman. As a result of this selective culling, Santa books from the 60s have melted away, but those from the 90s have not yet disappeared. Could such practices account for the dip and the recovery? If you have a better idea, please share.