Zipfy n-grams

In the 1930s and 40s George Kingsley Zipf studied word frequencies in several languages and came up with a general observation: If you sort all the words from commonest to rarest, the frequency of the word at rank r is …

The Library of Babble

The new issue of American Scientist is out, both on newsstands and on the web. My "Computing Science" column takes up a topic I've already written about here on bit-player: the huge corpus of "n-grams" extracted from the Google Books …

All those books that Google has been scanning for the past ten years are surprisingly rich in numbers as well as words. The Google Books data set released last December by a Harvard-Google team includes (by my count) 9,620,835,344 occurrences of 458,794 distinct …

Googling the lexicon

In 1860, when work began on the New English Dictionary on Historical Principles (better known today as the Oxford English Dictionary), the basic plan was to build an index to all of English literature. Volunteer readers would pore over texts …

