All those books that Google has been scanning for the past ten years are surprisingly rich in numbers as well as words. The Google Books data set released last December by a Harvard-Google team includes (by my count) 9,620,835,344 occurrences of 458,794 distinct numbers. (Plus another 31,293 numeric values that have dollar signs attached.)

In recognition of pi day, I want to zero in on some successive approximations to the world’s favorite irrational:

In tabular form here are the closest approximations found in the files, along with the abundance of each value:

3.141592 | 704 |

3.1415923 | 80 |

3.1415926 | 1141 |

3.14159265 | 1300 |

3.141592653 | 143 |

3.1415926535 | 286 |

3.141592653589 | 54 |

3.14159265358979 | 338 |

3.141592653589793 | 453 |

3.1415926535898 | 65 |

3.14159265359 | 177 |

3.1415926536 | 289 |

3.141592654 | 512 |

3.1415927 | 776 |

3.1415928 | 109 |

3.1415929 | 133 |

3.141593 | 843 |

The data set includes only items that appear at least 40 times in the collection of scanned volumes. Closer approximations to pi evidently fell below that threshold. In particular there is no sign of William Shanks’s famous 707-digit calculation, which was published in 1873. So, just for the sake of celebrating 3.14, here are 707 digits of pi—but unlike the product of Shanks’s many years of labor, I think these digits may be correct:

3.1415926535897932384626433832795028841971693993751058209

74944592307816406286208998628034825342117067982148086513

282306647093844609550582231725359408128481117450284102701

938521105559644622948954930381964428810975665933446128475

648233786783165271201909145648566923460348610454326648213

393607260249141273724587006606315588174881520920962829254

0917153643678925903600113305305488204665213841469519415116

09433057270365759591953092186117381932611793105118548074462

3799627495673518857527248912279381830119491298336733624406

5664308602139494639522473719070217986094370277053921717629

3176752384674818467669405132000568127145263560827785771342

7577896091736371787214684409012249534301465495853710507922

796892589235420200

That’s an interesting place to stop, as printing 707 digits gives “…20200″, but printing 708 digits gives “…201996″. It’s a rare (well, one in a hundred, I suppose) occurrence that three digits depend on your choice of rounding.

3. 1415926535 8979323846 2643383279 5028841971 6939937510 5820974944 5923078164 0628620899 8628034825 3421170679 8214808651 3282306647 0938446095 5058223172 5359408128 4811174502 8410270193 8521105559 6446229489 5493038196 4428810975 6659334461 2847564823 3786783165 2712019091 4564856692 3460348610 4543266482 1339360726 0249141273 7245870066 0631558817 4881520920 9628292540 9171536436 7892590360 0113305305 4882046652 1384146951 9415116094 3305727036 5759591953 0921861173 8193261179 3105118548 0744623799 6274956735 1885752724 8912279381 8301194912 9833673362 4406566430 8602139494 6395224737 1907021798 6094370277 0539217176 2931767523 8467481846 7669405132 0005681271 4526356082 7785771342 7577896091 7363717872 1468440901 2249534301 4654958537 1050792279 6892589235 4201995611 2129021960 8640344181 5981362977 4771309960 5187072113 4999999… and so on!

I would have expected a lot more occurrences of pi related values. Did your search allow for 1 (one) being scanned as l (ell), 0 (zero) as o or O (ohh)? I found that this made a big difference when matching hexadecimal literals

@Derek Jones

Trying to measure the impact of OCR errors in the Google Books data set is very much on my mind. But getting reliable answers is not easy. You might think we could just compare the frequency of 3.1416 with all the plausible variants: 3.l416, 3.i416, 3.i4l6, 3.!4ib, &.iAlh, and so on. There are at least two problems with this approach. The first is that entropy works against us: There’s only one correct reading, but there are exponentially many possible errors. Thus even if the total number of erroneous readings is large, any one variant is likely to be rare and therefore

unlikely to reach the 40-count threshold for inclusion in the data set.In this particular case, the situation is even worse. The OCR algorithm reads a string of decimal digits with a single noninitial period as a number, but in any other context a period is taken to be a word separator. Thus “3.1416″ –> “3.1416″ but “3.14i6″ –> “3″ “.” “14i6″.

With integers we have a better chance of learning something. I have not done any sort of comprehensive study, but here’s a single example, plucked more or less at random—the number of occurrences for a few likely misreadings of “1997″

That’s more than 200,000 errors (assuming they are all in fact OCR failures, which is probably not far from the truth). This sounds horrendous. On the other hand, “1997″ itself occurs 19,761,552 times, so we’re looking at errors at the 1 percent level. Is that something to worry about? I just don’t know for sure. It probably depends on what you want to do with the data.

Thanks very much for that pointer to your posting on hex strings in the Google Books data. We must try roman numerals!

How about fractional approximations: 22/7, 335/113 and possibly 314/100? In particular, I expected 22/7 to show up on the graph, but it seems indistinguishable from the rest of the distribution.

Another curiosity on the graph: numbers rounded to 1 decimal place are 10-20 times as common as those rounded to 2 decimal places (the pattern of the terminal digits shows that terminal zeros are dropped), but these are about 40 times as common as numbers rounded to 3 decimal places. Perhaps section numbers are distorting the results.

To verify the 707 digits are correct, you should also publish a checksum. That way when I do the calculation, I can compare. I’ve seen several pi approximations, what are the correct float, double, and long double values? The int one is 3.

Brian,

Characters such as & and ! are each treated as a single entity so the number of possible combinations is not that great.

What made you pick on the 40 per book threshold, it sounds a bit high? I suspect 3.14 is more likely to refer to a section number than a poor representation of pi and perhaps have a handful of occurrences.

I have found a worryingly wide number of different representation of pi in source code.

Roman numerals would be interesting; I would expect the usage to decrease over time. Another possibility is to search for incorrect values, such as the 3.1459265 I found in Mifit’s core/Helix.cpp

So, which book contained the best approximation of pi?

When talking about the number of digits after the decimal point so the “3.” does not get counted. The 706 digits which ended with “200” should have been left at “1995” without rounding up.

So, which book contained the best approximation of pi? The key board here is book no web location. I have heard of a book with 2,000,000 digits of pi. It was known as the most boring book in the world. I have printed the first 1,000,000 digits of pi for the fun of it.

If you want to find out more about the number William Shanks produced with the error that stood for 72 years, Google “William Shanks 707 digits” for my file. William Shanks calculated 709 and rounded down to 707 digits

Erwin