Pick a number, N, then try searching for it on the web via Bing or Google (or maybe the leet version of Google). What can you expect to learn? I wasn’t quite sure of the answer, so I ran some experiments.
When N is a small positive integer—less than 100, say—the leading results tend to be mass-audience web pages that happen to display the numeral N in some prominent way, such as in a headline or a title. There are news stories (Packers 43, Falcons 37), TV stations (WXMI Fox 17), a few brand names (Motel 6), references to iconic events (9/11, Apollo 13), listings of Bible verses (Romans 3:23).
With somewhat larger integers—three or four digits—I see a lot of street addresses, area codes, tax forms, statutes and ordinances. With five-digit numbers, Zip codes become prominent. At six digits we enter the land of hex colors, accompanied by a baffling variety of part numbers, account numbers, serial numbers, patent numbers, error numbers, lottery numbers. With a search string of 8 to 10 digits, telephone directories dominate the results. Still further out on the number line, you eventually come to a numerical desert where Google and Bing usually come up empty.
To get a more quantitative sense of how numbers are distributed across the web, I decided to do some sampling. I randomly selected 2,000 positive integers of 1 to 12 decimal digits, and submitted them to Google as web search queries.
What an intriguing graph! Over most of the range in the log-log plot, the broad trend looks very nearly linear. What does that mean? If the Google data accurately reflect the state of the web, and if my sampling of the data can be trusted, it means the number of web pages mentioning numbers of magnitude \(10^k\) is roughly constant for all k in the range from \(k = 2\) to \(k = 10\). I don’t mean to suggest that specific large numbers appear just as frequently as specific small numbers. That’s obviously untrue: A typical two- or three-digit number might be found on a billion web pages, whereas a specific nine- or ten-digit number is likely to appear on only one or two pages. But there are only 90 two-digit numbers, compared with 90 billion 10-digit numbers, so the overall number of pages in those two classes is approximately the same.
Here’s another way of saying the same thing: The product of \(N\) and \(H(N)\) is nearly constant, with a geometric mean of roughly \(7 \times 10^{10}\). An equivalent statement is that:
\[\log_{10}{N} + \log_{10}{H(N)} \approx 10.86.\]
You can visualize this fact without doing any arithmetic at all. Just print a series of \(N, H(N)\) tuples in a column and observe that the total number of digits in a tuple is seldom less than 11 or greater than 13.
N, H(N)
96964835, 2120
2048, 164000000
476899618, 214
96416, 374000
75555964, 3020
171494, 182000
154045436, 2160
1206, 112000000
761088, 50200
7500301034, 24
13211445, 10900
1289, 77000000
1507549, 18100
3488, 3330000
7507624475, 10
17592745, 2830
1430187656, 30
691, 265000000
41670244642, 2
326, 52900000
Although the vast majority of the 2,000 data points lie near the 10.86 “main sequence” line, there are some outliers. One notable example is 25898913. Most numbers of this magnitude garner a few thousand hits on Google, but 25898913 gets 29,500,000. What could possibly make that particular sequence of digits 10,000 times more popular than most of its neighbors? Apparently it’s not just an isolated freak. About half the integers between 25898900 and 25898999 score well below 10,000 hits, and the other half score above 20 million. I can’t discern any trait that distringuishes the two classes of numbers. Sampling from other nearby ranges suggests that such anomalies are rare.
A straight line on a log-log plot often signals a power-law distribution. The classic example is the Zipfian distribution of word frequencies in natural-language text, where the kth most common word can be expected to appear with frequency proportional to \(k^{-\alpha}\), with \(\alpha \approx 1\). Does a similar rule hold for integers on the web? Maybe. I tried fitting a power law to the data with the powerlaw
Python package from Jeff Alstott et al. The calculated value of \(\alpha\) was about 1.17, which seems plausible enough, but other diagnostic indicators were not so clear. Identifying power laws in empirical data is notoriously tricky, and I don’t have much confidence in my ability to get it right, even with the help of a slick code library.
I’m actually surprised that the pattern in the graph above looks so Zipfian, because the data being plotted don’t really represent the frequencies of the numbers. Google’s hit count \(H(N)\) is an approximation to the number of web pages on which \(N\) appears, not the number of times that \(N\) appears on the web. Those two figures can be expected to differ because a page that mentions \(N\) once may well mention it more than once. For example, a page about the movie 42 has eight occurrences of 42, and a page about the movie 23 has 13 occurrences of 23. (By the way, what’s up with all these numeric movie titles?)
Another distorting factor is that Google apparently implements some sort of substring matching algorithm for digit strings. If you search for 5551212, the results will include pages that mention 8005551212 and 2125551212, and so on. I’m not sure how far they carry this practice. Does a web page that includes the number 1234 turn up in search results for all nonempty substrings: 1234, 123, 234, 12, 23, 34, 1, 2, 3, 4? That kind of multiple counting would greatly inflate the frequencies of numbers in the Googleverse.
It’s also worth noting that Google does some preprocessing of numeric data both in web pages and in search queries. Commas, hyphens, and parentheses are stripped out (but not periods/decimal points). Thus searches for 5551212, 555-1212, and 5,551,212 all seem to elicit identical results. (Enclosing the search string in quotation marks suppresses this behavior, but I didn’t realize that until late in the writing of this article, so all the results reported here are for unquoted search queries.)
In the graph above, the linear trend seems to extend all the way to the lower righthand corner, but not to the upper lefthand corner. If we take seriously the inferred equation \(N \times H(N) = 7 \times 10^{10}\), then the number of hits for \(N = 1\) should obviously be \(7 \times 10^{10}\). In fact, searches for integers in the range \(1 \le N \le 25\) generally return far fewer hits. Many of the results are clustered around \(10^{7}\), four or five orders of magnitude smaller than would be expected from the trend line.
To investigate this discrepancy, I ran another series of Google searches, recording the number of hits for each integer from 0 through 100.
There’s no question that something is depressing the abundance of most numbers less than 25. The abruptness of the dip suggests that this is an artifact of an algorithm or policy imposed by the search engine, rather than a property of the underlying distribution. I have a guess about what’s going on. Small numbers may be so common that they are treated as “stop words,” like “a,” “the,” “and,” etc., and ignored in most searches. Perhaps the highest-frequency numbers are counted only when they appear in an <h1>
or <h2>
heading, not when they’re in ordinary text.
But much remains unexplained. Why do 2, 3, 4, and 5 escape the too-many-to-count filter? Same question for 23. What’s up with 25 and 43, which stand more than 10 times taller than their nearest neighbors? Finally, in this run of 101 Google searches, the hit counts for small integers are mostly clustered around \(10^6\), whereas the earlier series of 2,000 random searches produced a big clump at \(10^7\). In that earlier run I also noticed that searching repeatedly for the same \(N\) could yield different values of \(H(N)\), even when the queries were submitted in the space of a few seconds. For example, with \(N=1\) I saw \(H(N)\) values ranging from 10,400,000 to 1,550,000,000. Presumably, the different values are coming from different servers or different data centers in Google’s distributed database.
I was curious enough about the inconsistencies to run another batch of random searches. In the graph below the 2,000 data points from the first search are light blue and the 2,000 new points are dark blue.
Over most of the range, the two data sets are closely matched, but there’s a conspicuous change in the region between \(10^2\) and \(10^4\). In the earlier run, numbers in that size range were split into two populations, with frequencies differing by a factor of 10. I was unable to identify any property that distinguishes members of the two populations; they are not, for example, just odd and even numbers. In the new data, the lower branch of the curve has disappeared. Now there is a sharp discontinuity at \(N = 10^4\), where typical frequency falls by factor of 10. I have no idea what this is all about, but I strongly suspect it’s something in the Google algorithms, not in the actual distribution of numbers on the web.
The limitations of string matching—or even regular-expression matching—are more troublesome when you go beyond searching for simple positive integers. I’ve hardly begun to explore this issue, but the following table hints at one aspect of the problem.
N | top hit |
---|---|
17.3 | HP Anodized Silver 17.3″ Pavilion |
17.30 | 17.30j Syllabus - MIT |
17.300 | Chapter 17.300 COMPLIANCE |
17.3000 | Map of Latitude: 17.3000, Longitude: -62.7333 |
17.30000 | 41 25 0 0 2.000000 4.000000 6.000000 8.000000 |
17.300000 | 17.300000 [initially -35.600000] gi_24347982 (+) RNA |
Search queries that are mathematically equal (when interpreted as decimal representations of real numbers) yield quite different results. And 4.999… is definitely not equal to 5.000… in the world of Google web search.
It gets even worse with fractions. A search for 7/3 brought me a calculator result correctly announcing that “7/3 = 2.33333333333″ but it also gave me articles headlined “7^3 - Wolfram Alpha”, “Matthew 7:3″, “49ers take 7-3 lead”, and “Hokua 7’3″ LE - Naish”. (Enclosing the search term in quotation marks doesn’t help in this case.)
Before closing the book on this strange numerical diversion that has entertained me for the past couple of weeks, I want to comment on one more curious discovery. If you run enough searches for large numbers, you’ll eventually stumble on web sites such as numberworld, numberempire, numbersbase, each-number, every-number, all-numbers, integernumber, numbersaplenty, and numberopedia. A few of these sites appear to be created and curated by genuine number-lore enthusiasts, but others have a whiff of sleazy search-engine baiting. (For that reason I’m not linking to any of them.)
Here’s part of a screen capture from Numbers Aplenty, which is one of the more interesting sites:
Each of the numbers displayed on the page is a link to another Numbers Aplenty page, and the site is apparently equipped to display such a page for any positive integer less than \(10^{16}\). A few years ago, Google reported that they had indexed a trillion unique URLs on the world wide web. Evidently they hadn’t yet worked their way through the 10,000 trillion URLS at Numbers Aplenty. (But I’m pretty sure the server doesn’t have \(10^{16}\) HTML files stored on disk, patiently waiting for someone to request the information.
And, finally, a trivia question: What is the smallest positive integer for which a Google search returns zero results? The smallest I have recorded so far is 10,041,295,923. (Of course that could change after the Googlebot indexes this page.) Can anyone find an example with 10 or fewer digits?
Update 2014-12-22. Commenter Samuel Bierwagen wrote:
The hits number on the first page is very inaccurate, frequently off by several orders of magnitude. To get better results you have to go to the second or third page.
I’ve now given this idea a try. Whenever the estimated hit count is at least 1 million, I repeat the search, appending “start=20″ to the query string. This has the effect of requesting the third page of results (i.e., results 20 through 29). Here’s the outcome:
Light blue dots are from earlier surveys (4,000 points in all). Dark blue dots are from the new survey, with the page-three request installed. There’s a dramatic change in the hit counts \(H(N)\) for \(N \lt 100\). For these small \(N\), Google’s first-page hit count fluctuates wildly and is often near \(10^{7}\). The third-page results are higher and much more consistent. Indeed, all \(N \lt 14\) returned exactly the same hit count: 25,270,000,000. This uniformity suggests that we’re still seeing some sort of filtering in the results–I suspect Google may be trying to keep secret the overall size of their index–but at least the trend line is now monotonic.
In another comment, Brian J. Peterson mentions seeing HTTP results with an error code 503 (service temporarily unavailable). I had not encountered any such errors in my earlier search series, but I did see some in this latest run (86 errors out of 2,000 searches). My best guess is that a request for page 3 may occasionally take more than 1 second, so that the transaction hasn’t completed when the next search is initiated.
Meanwhile, I have learned of another study of number prevalence on the web with a much better data source than Google hit counts. In a short paper from the 2014 World Wide Web companion conference, Willem Robert van Hage and two colleagues used the Common Crawl web archive to measure number frequencies. They looked at real numbers, not just integers, and I’m not sure how to compare their results with mine. My main response to seeing this work is that the Common Crawl is an amazing resource–each of us can build a Google of our own–and I want to spend some time next year playing with it.
Claiming and subsequently ruining 10,041,295,922 as a resultless int.
I’ll take 76,237,802,445,370,263
10041295924
I claim 10041295901.
The hits number on the first page is very inaccurate, frequently off by several orders of magnitude. To get better results you have to go to the second or third page.
This is a very good suggestion (also mentioned by commenters at Hacker News). Maybe I should try running the script again with an added provision: If the count exceeds some threshold, search again with “pagewanted=3″ added to the query.
It’s well known, at least among linguists, that Google hit counts cannot be relied on: they are not even consistent (A B can have a higher hit count than A, for example), never mind correct. Nothing whatever should be concluded from hit counts larger than 1000 except the roughest of indications of the popularity of a term.
Well, there is indeed a lot of noise in the hit counts; that’s part of the story I was telling. But there’s also a lot of signal. Otherwise we wouldn’t see that band of dots lined up on the diagonal of the log-log plot.
Hmm, I wonder if the “batching” of numbers from 10^k to 10^{k+1} in your last graph is a byproduct of Benford’s Law[1]?
[1] http://en.wikipedia.org/wiki/Benford's_law
The labels on the x-axis of your second graph appear incorrectly to me. Instead of “0 10 20 30 … 90 100,” they appear as “01 02 03 04 … 09 0 100.”
Wow! Thanks for alerting me. For me the problem turns up only in Firefox; when I view the page in Chrome, Safari or Opera, all’s well. I’ll try to figure out what’s going wrong and fix it later today.
Fixed now. Weird bug; hardly worth explaining. My current graphics workflow involves generating data in Python, transferring it to Lisp, creating PostScript, importing into Adobe Illustrator, and finally generating SVG. It turns out that Illustrator tries to save a few bytes in the SVG file by consolidating bits of type that happen to line up horizontally. Instead of separate objects for ’1′, ’2′, ’3′, it produces a string ’1 2 3′ with wide inter-character spacing. Baroque, but it works in most browsers. Firefox, however, ignores the character spacing.
Hi Brian,
Tricky bug! Have you considered Matplotlib for outputting figures in SVG? I can’t speak to the quality of the generated XML, but it seems to render correctly in Chrome and Firefox. In any event, it could help simplify your graphics workflow some.
Here’s an attempt to replicate your first figure using only Python. I only had patience to wait for 50 samples, but I think it’s a reasonable facsimile. I also attempted to address the limitations identified by several readers regarding the accuracy of the results count.
Curiously, Google began rejecting my requests (with a 503 response) despite a 1 second courtesy pause - did you run into this issue as well?
Wow! You mean I can make pictures with just one programming language, instead of three or four!?
I actually use matplotlib, when I want a quick peek at the data. But I’ve never been able to get matplotlib to make the finished product look the way I want it to. (The seaborn module helps, but not quite enough.) Still, I really do need to join the 21st century at some point. The home-baked graphics routines I use have roots that go back 25 years — to a time when there was no Python, no SVG, no www.
New year’s resolution: I’m going to update this stuff.
I can’t explain the 503s. I haven’t seen any HTTP errors of any kind. I can only guess that the problem might have to do with your added parameters, ‘start=20′ and ‘rc=1′, which I haven’t yet tried. Maybe Google barfs when you ask for the third page of results if there are fewer than three pages?
Search for 25898913 on google.co.uk - 262,000 results instead.
Fascinating!
Oddly, when I googled on 25898913, I got a different result, but I did see the essential qualitative vagaries in its neighborhood. Here are the number of hits for the range 25898900-25898929 (dropping all but the last two digits of the number):
00: 5400
01: 5740
02: 3910
03: 25,000,000
04: 2630
05: 2350
06: 2190
07: 2020
08: 2190
09: 2360
10: 3590
11: 4190
12: 2300
13: 24,300,000*
14: 23,600,000
15: 25,300,000
16: 25,000,000
17: 20,400,000
18: 22,200,000
19: 27,500,000
20: 2920
21: 2250
22: 23,000,000
23: 2980
24: 34,200,000
25: 47,800,000
26: 2420
27: 19,700,000
28: 23,400,000
29: 2980
30: 5500
*Well I’ll be darned. When I did this search earlier from home (I’m composing this at a coffee shop), Google only gave 171,000 hits for 25898913. I remember that distinctly. I’ll have to check again when I get back home.
In any event, here’s what I get for the same range when I change the second digit from a 5 to a 6, i.e., when I google on 26898900-29:
00: 3840
01: 2640
02: 2220
03: 1960
04: 1970
05: 19,300
06: 1910
07: 2150
08: 1940
09: 1960
10: 2520
11: 21,000
12: 1880
13: 2150
14: 2070
15: 2130
16: 2170
17: 3540
18: 4680
19: 2660
20: 1850
21: 1820
22: 1990
23: 1780
24: 1720
25: 1770
26: 12,200
27: 2210
28: 2100
29: 2740
This is entirely consistent (though not digit-for-digit identical) with what I see. And it doesn’t seem to be some sort of short-living inconsistency in the distributed database; I first noted it three weeks ago.
I’m back home and googled 25898913 again, and this time got 24,300,000 hits. My notes from earlier definitely say 171,000 (and I remember checking it a couple of times). All the other numbers I jotted down earlier agree, more or less, with I got at the coffee shop. I can’t explain how Google mislaid so many hits for the one number earlier (nor why it should do so for the specific number you mentioned in your column, unless Google is monitoring our activities and messing with us).
I hereby claim the number 10041295937142.
I claim 08827835699.
No, 00987835699.
No, 00097835699.
But I can’t beat 11 digits.