600613

Pick a number, N, then try searching for it on the web via Bing or Google (or maybe the leet version of Google). What can you expect to learn? I wasn’t quite sure of the answer, so I ran some experiments.

When N is a small positive integer—less than 100, say—the leading results tend to be mass-audience web pages that happen to display the numeral N in some prominent way, such as in a headline or a title. There are news stories (Packers 43, Falcons 37), TV stations (WXMI Fox 17), a few brand names (Motel 6), references to iconic events (9/11, Apollo 13), listings of Bible verses (Romans 3:23).

With somewhat larger integers—three or four digits—I see a lot of street addresses, area codes, tax forms, statutes and ordinances. With five-digit numbers, Zip codes become prominent. At six digits we enter the land of hex colors, accompanied by a baffling variety of part numbers, account numbers, serial numbers, patent numbers, error numbers, lottery numbers. With a search string of 8 to 10 digits, telephone directories dominate the results. Still further out on the number line, you eventually come to a numerical desert where Google and Bing usually come up empty.


To get a more quantitative sense of how numbers are distributed across the web, I decided to do some sampling. I randomly selected 2,000 positive integers of 1 to 12 decimal digits, and submitted them to Google as web search queries. To construct the query integers I started with 2,000 floating-point numbers selected uniformly at random (with replacement) from the range \(0 \le m \lt 12\). For each \(m\) I calculated \(N = \lfloor 10^{m}\rfloor\), then ran a Google search for N. The work was done by a Python script with a politeness pause of one second between queries.From the results of each search I extracted \(H(N)\), the number of hits, which Google reports near the top of the page. Here’s what I found, plotted on log-log scales:

Google hits 12 digit

What an intriguing graph! Over most of the range in the log-log plot, the broad trend looks very nearly linear. What does that mean? If the Google data accurately reflect the state of the web, and if my sampling of the data can be trusted, it means the number of web pages mentioning numbers of magnitude \(10^k\) is roughly constant for all k in the range from \(k = 2\) to \(k = 10\). I don’t mean to suggest that specific large numbers appear just as frequently as specific small numbers. That’s obviously untrue: A typical two- or three-digit number might be found on a billion web pages, whereas a specific nine- or ten-digit number is likely to appear on only one or two pages. But there are only 90 two-digit numbers, compared with 90 billion 10-digit numbers, so the overall number of pages in those two classes is approximately the same.

Here’s another way of saying the same thing: The product of \(N\) and \(H(N)\) is nearly constant, with a geometric mean of roughly \(7 \times 10^{10}\). An equivalent statement is that:

\[\log_{10}{N} + \log_{10}{H(N)} \approx 10.86.\]

You can visualize this fact without doing any arithmetic at all. Just print a series of \(N, H(N)\) tuples in a column and observe that the total number of digits in a tuple is seldom less than 11 or greater than 13.

    N,        H(N)
    96964835, 2120
    2048, 164000000
    476899618, 214
    96416, 374000
    75555964, 3020
    171494, 182000
    154045436, 2160
    1206, 112000000
    761088, 50200
    7500301034, 24
    13211445, 10900
    1289, 77000000
    1507549, 18100
    3488, 3330000
    7507624475, 10
    17592745, 2830
    1430187656, 30
    691, 265000000
    41670244642, 2
    326, 52900000

Although the vast majority of the 2,000 data points lie near the 10.86 “main sequence” line, there are some outliers. One notable example is 25898913. Most numbers of this magnitude garner a few thousand hits on Google, but 25898913 gets 29,500,000. What could possibly make that particular sequence of digits 10,000 times more popular than most of its neighbors? Apparently it’s not just an isolated freak. About half the integers between 25898900 and 25898999 score well below 10,000 hits, and the other half score above 20 million. I can’t discern any trait that distringuishes the two classes of numbers. Sampling from other nearby ranges suggests that such anomalies are rare.


A straight line on a log-log plot often signals a power-law distribution. The classic example is the Zipfian distribution of word frequencies in natural-language text, where the kth most common word can be expected to appear with frequency proportional to \(k^{-\alpha}\), with \(\alpha \approx 1\). Does a similar rule hold for integers on the web? Maybe. I tried fitting a power law to the data with the powerlaw Python package from Jeff Alstott et al. The calculated value of \(\alpha\) was about 1.17, which seems plausible enough, but other diagnostic indicators were not so clear. Identifying power laws in empirical data is notoriously tricky, and I don’t have much confidence in my ability to get it right, even with the help of a slick code library.

I’m actually surprised that the pattern in the graph above looks so Zipfian, because the data being plotted don’t really represent the frequencies of the numbers. Google’s hit count \(H(N)\) is an approximation to the number of web pages on which \(N\) appears, not the number of times that \(N\) appears on the web. Those two figures can be expected to differ because a page that mentions \(N\) once may well mention it more than once. For example, a page about the movie 42 has eight occurrences of 42, and a page about the movie 23 has 13 occurrences of 23. (By the way, what’s up with all these numeric movie titles?)

Another distorting factor is that Google apparently implements some sort of substring matching algorithm for digit strings. If you search for 5551212, the results will include pages that mention 8005551212 and 2125551212, and so on. I’m not sure how far they carry this practice. Does a web page that includes the number 1234 turn up in search results for all nonempty substrings: 1234, 123, 234, 12, 23, 34, 1, 2, 3, 4? That kind of multiple counting would greatly inflate the frequencies of numbers in the Googleverse.

It’s also worth noting that Google does some preprocessing of numeric data both in web pages and in search queries. Commas, hyphens, and parentheses are stripped out (but not periods/decimal points). Thus searches for 5551212, 555-1212, and 5,551,212 all seem to elicit identical results. (Enclosing the search string in quotation marks suppresses this behavior, but I didn’t realize that until late in the writing of this article, so all the results reported here are for unquoted search queries.)


In the graph above, the linear trend seems to extend all the way to the lower righthand corner, but not to the upper lefthand corner. If we take seriously the inferred equation \(N \times H(N) = 7 \times 10^{10}\), then the number of hits for \(N = 1\) should obviously be \(7 \times 10^{10}\). In fact, searches for integers in the range \(1 \le N \le 25\) generally return far fewer hits. Many of the results are clustered around \(10^{7}\), four or five orders of magnitude smaller than would be expected from the trend line.

To investigate this discrepancy, I ran another series of Google searches, recording the number of hits for each integer from 0 through 100. Note that in this graph the y axis is logarithmic but the x axis is linear.

Google hits 0 100

There’s no question that something is depressing the abundance of most numbers less than 25. The abruptness of the dip suggests that this is an artifact of an algorithm or policy imposed by the search engine, rather than a property of the underlying distribution. I have a guess about what’s going on. Small numbers may be so common that they are treated as “stop words,” like “a,” “the,” “and,” etc., and ignored in most searches. Perhaps the highest-frequency numbers are counted only when they appear in an <h1> or <h2> heading, not when they’re in ordinary text.

But much remains unexplained. Why do 2, 3, 4, and 5 escape the too-many-to-count filter? Same question for 23. What’s up with 25 and 43, which stand more than 10 times taller than their nearest neighbors? Finally, in this run of 101 Google searches, the hit counts for small integers are mostly clustered around \(10^6\), whereas the earlier series of 2,000 random searches produced a big clump at \(10^7\). In that earlier run I also noticed that searching repeatedly for the same \(N\) could yield different values of \(H(N)\), even when the queries were submitted in the space of a few seconds. For example, with \(N=1\) I saw \(H(N)\) values ranging from 10,400,000 to 1,550,000,000. Presumably, the different values are coming from different servers or different data centers in Google’s distributed database.

I was curious enough about the inconsistencies to run another batch of random searches. In the graph below the 2,000 data points from the first search are light blue and the 2,000 new points are dark blue.

Google hits 12 digit ccombo

Over most of the range, the two data sets are closely matched, but there’s a conspicuous change in the region between \(10^2\) and \(10^4\). In the earlier run, numbers in that size range were split into two populations, with frequencies differing by a factor of 10. I was unable to identify any property that distinguishes members of the two populations; they are not, for example, just odd and even numbers. In the new data, the lower branch of the curve has disappeared. Now there is a sharp discontinuity at \(N = 10^4\), where typical frequency falls by factor of 10. I have no idea what this is all about, but I strongly suspect it’s something in the Google algorithms, not in the actual distribution of numbers on the web.


The limitations of string matching—or even regular-expression matching—are more troublesome when you go beyond searching for simple positive integers. I’ve hardly begun to explore this issue, but the following table hints at one aspect of the problem.

N top hit
17.3 HP Anodized Silver 17.3″ Pavilion
17.30 17.30j Syllabus - MIT
17.300 Chapter 17.300 COMPLIANCE
17.3000 Map of Latitude: 17.3000, Longitude: -62.7333
17.30000 41 25 0 0 2.000000 4.000000 6.000000 8.000000
17.300000 17.300000 [initially -35.600000] gi_24347982 (+) RNA

Search queries that are mathematically equal (when interpreted as decimal representations of real numbers) yield quite different results. And 4.999… is definitely not equal to 5.000… in the world of Google web search.

It gets even worse with fractions. A search for 7/3 brought me a calculator result correctly announcing that “7/3 = 2.33333333333″ but it also gave me articles headlined “7^3 - Wolfram Alpha”, “Matthew 7:3″, “49ers take 7-3 lead”, and “Hokua 7’3″ LE - Naish”. (Enclosing the search term in quotation marks doesn’t help in this case.)


Before closing the book on this strange numerical diversion that has entertained me for the past couple of weeks, I want to comment on one more curious discovery. If you run enough searches for large numbers, you’ll eventually stumble on web sites such as numberworld, numberempire, numbersbase, each-number, every-number, all-numbers, integernumber, numbersaplenty, and numberopedia. A few of these sites appear to be created and curated by genuine number-lore enthusiasts, but others have a whiff of sleazy search-engine baiting. (For that reason I’m not linking to any of them.)

Here’s part of a screen capture from Numbers Aplenty, which is one of the more interesting sites:

NumbersAplenty screen

Each of the numbers displayed on the page is a link to another Numbers Aplenty page, and the site is apparently equipped to display such a page for any positive integer less than \(10^{16}\). A few years ago, Google reported that they had indexed a trillion unique URLs on the world wide web. Evidently they hadn’t yet worked their way through the 10,000 trillion URLS at Numbers Aplenty. (But I’m pretty sure the server doesn’t have \(10^{16}\) HTML files stored on disk, patiently waiting for someone to request the information.


And, finally, a trivia question: What is the smallest positive integer for which a Google search returns zero results? The smallest I have recorded so far is 10,041,295,923. (Of course that could change after the Googlebot indexes this page.) Can anyone find an example with 10 or fewer digits?


Update 2014-12-22. Commenter Samuel Bierwagen wrote:

The hits number on the first page is very inaccurate, frequently off by several orders of magnitude. To get better results you have to go to the second or third page.

I’ve now given this idea a try. Whenever the estimated hit count is at least 1 million, I repeat the search, appending “start=20″ to the query string. This has the effect of requesting the third page of results (i.e., results 20 through 29). Here’s the outcome:

Google hits 12 digit p3

Light blue dots are from earlier surveys (4,000 points in all). Dark blue dots are from the new survey, with the page-three request installed. There’s a dramatic change in the hit counts \(H(N)\) for \(N \lt 100\). For these small \(N\), Google’s first-page hit count fluctuates wildly and is often near \(10^{7}\). The third-page results are higher and much more consistent. Indeed, all \(N \lt 14\) returned exactly the same hit count: 25,270,000,000. This uniformity suggests that we’re still seeing some sort of filtering in the results–I suspect Google may be trying to keep secret the overall size of their index–but at least the trend line is now monotonic.

In another comment, Brian J. Peterson mentions seeing HTTP results with an error code 503 (service temporarily unavailable). I had not encountered any such errors in my earlier search series, but I did see some in this latest run (86 errors out of 2,000 searches). My best guess is that a request for page 3 may occasionally take more than 1 second, so that the transaction hasn’t completed when the next search is initiated.

Meanwhile, I have learned of another study of number prevalence on the web with a much better data source than Google hit counts. In a short paper from the 2014 World Wide Web companion conference, Willem Robert van Hage and two colleagues used the Common Crawl web archive to measure number frequencies. They looked at real numbers, not just integers, and I’m not sure how to compare their results with mine. My main response to seeing this work is that the Common Crawl is an amazing resource–each of us can build a Google of our own–and I want to spend some time next year playing with it.

Posted in computing, mathematics | 20 Comments