Given the present state of life in America, what we really need is an Approximation to Rationality Day, but that may have to wait for 20/1/21. In the meantime, let us merrily fiddle with numbers, searching for ratios of integers that brazenly invade the personal space of famous irrationals.
When I was a teenager, somebody told me about the number 355/113, which is an exceptionally good approximation to π. The exact value is
3.141592920353982520964564173482358455657958984375,
correct through the first six digits after the decimal point. In other words, it differs from the true value by less than one-millionth. I was intrigued, and so I set out to find an even better approximation. My search was necessarily a pencil-and-paper affair, since I had no access to any electronic or even mechanical aids to computation. The spiral-bound notebook in which I made my calculations has not survived, and I remember nothing about the outcome of the effort.
A dozen years later I acquired some computing machinery: a Hewlett-Packard programmable calculator, called the HP-41C. Here is the main loop of an HP-41C program that searches for good rational approximations. Note the date at the top of the printout (written in middle-endian format). Apparently I was finishing up this program just before Approximation Day in 1981.
What’s that you say? You’re not fluent in the 30-year-old Hewlett-Packard dialect of reverse Polish notation? All right, here’s a program that does roughly the same thing, written in an oh-so-modern language, Julia.
function approximate(T, dmax)
d = 1
leastError = T
while d <= dmax && leastError > 0
n = Int(round(d * T))
err = abs(T - n/d) / T
merit = 1 / ((n + d)^2 * err)
if err < leastError
println("$n/$d = $(n/d) error = $err merit = $merit")
leastError = err
end
d += 1
end
end
The algorithm is a naive, sequential search for fractions \(n/d\) that approximate the target number \(T\). For each value of \(d\), you need to consider only one value of \(n\), namely the integer nearest to \(d \times T\). (What happens if \(d \times T\) falls halfway between two integers? That can’t happen if \(T\) is irrational.) Thus you can begin with \(d = 1\) and continue up to a specified largest denominator \(d = dmax\). The accuracy of the approximation is measured by the error term \(|T - n/d| / T\). Whenever a value of \(n/d\) yields a new minimum error, the program prints a line of results. (This version of the algorithm works correctly only for \(T \gt 1\), but it can readily be adapted to \(T \lt 1\).)
The HP-41C has a numerical precision of 10 decimal digits, and so the closest possible approximation to π is 3.141592654. Back in 1981 I ran the program until it found a fraction equal to this value—a perfect approximation, from the program’s point of view. According to a note on the printout, that took 13 hours. The Julia program above, running on a laptop, completes the same computation in about three milliseconds. You’re welcome to take a scroll through the results, below. (The numbers are not digit-for-digit identical to those generated by the HP-41C because Julia calculates with higher precision, about 16 decimal digits.)
3/1 = 3.0 error = 0.045070341573315915 merit = 1.3867212410256813 13/4 = 3.25 error = 0.03450712996224109 merit = 0.10027514940370374 16/5 = 3.2 error = 0.018591635655129744 merit = 0.12196741256165179 19/6 = 3.1666666666666665 error = 0.007981306117055373 merit = 0.20046844169789904 22/7 = 3.142857142857143 error = 0.0004024993041452083 merit = 2.9541930379680195 179/57 = 3.1403508771929824 error = 0.00039526983405584675 merit = 0.04542368072920613 201/64 = 3.140625 error = 0.0003080138345651019 merit = 0.04623150469956595 223/71 = 3.140845070422535 error = 0.00023796324342470652 merit = 0.04861781754719378 245/78 = 3.141025641025641 error = 0.0001804858353094197 merit = 0.053107007660473673 267/85 = 3.1411764705882352 error = 0.00013247529441315622 merit = 0.060922789404334425 289/92 = 3.141304347826087 error = 9.177070539240495e-5 merit = 0.07506646742266793 311/99 = 3.1414141414141414 error = 5.6822320879624425e-5 merit = 0.10469195703580983 333/106 = 3.141509433962264 error = 2.6489760736525772e-5 merit = 0.19588127575835135 355/113 = 3.1415929203539825 error = 8.478310581938076e-8 merit = 53.85164473263654 52518/16717 = 3.1415923909792425 error = 8.37221074104896e-8 merit = 0.00249177288308447 52873/16830 = 3.141592394533571 error = 8.259072954625822e-8 merit = 0.0024921016732136797 53228/16943 = 3.1415923980404887 error = 8.147444291923546e-8 merit = 0.0024926612882136163 53583/17056 = 3.141592401500938 error = 8.03729477091334e-8 merit = 0.0024934520351304946 53938/17169 = 3.1415924049158366 error = 7.928595172899531e-8 merit = 0.0024944743578840687 54293/17282 = 3.141592408286078 error = 7.821317056655376e-8 merit = 0.0024957288257085445 54648/17395 = 3.141592411612532 error = 7.715432730151448e-8 merit = 0.002497216134767719 55003/17508 = 3.1415924148960475 error = 7.610915194012454e-8 merit = 0.0024989371196291283 55358/17621 = 3.1415924181374497 error = 7.507738155653036e-8 merit = 0.0025008927426067996 55713/17734 = 3.1415924213375437 error = 7.405876001006156e-8 merit = 0.0025030840968725283 56068/17847 = 3.1415924244971145 error = 7.305303737979925e-8 merit = 0.002505512419906649 56423/17960 = 3.1415924276169265 error = 7.20599703886498e-8 merit = 0.002508179074048983 56778/18073 = 3.141592430697726 error = 7.107932141383905e-8 merit = 0.0025110855755419263 57133/18186 = 3.14159243374024 error = 7.01108591937022e-8 merit = 0.002514233565685482 57488/18299 = 3.1415924367451775 error = 6.915435783817789e-8 merit = 0.0025176248413626597 57843/18412 = 3.1415924397132304 error = 6.820959725288218e-8 merit = 0.0025212613363967255 58198/18525 = 3.141592442645074 error = 6.727636243231866e-8 merit = 0.002525145143834103 58553/18638 = 3.141592445541367 error = 6.635444374259433e-8 merit = 0.0025292785028112976 58908/18751 = 3.141592448402752 error = 6.544363663870371e-8 merit = 0.0025336638062423296 59263/18864 = 3.141592451229856 error = 6.454374152317083e-8 merit = 0.002538303603848205 59618/18977 = 3.1415924540232916 error = 6.365456332197522e-8 merit = 0.002543200616913158 59973/19090 = 3.1415924567836564 error = 6.277591190862598e-8 merit = 0.002548357720152209 60328/19203 = 3.1415924595115348 error = 6.190760125601375e-8 merit = 0.0025537779743748956 60683/19316 = 3.1415924622074964 error = 6.10494500018427e-8 merit = 0.0025594646031786867 61038/19429 = 3.1415924648720983 error = 6.020128088319864e-8 merit = 0.002565421015548036 61393/19542 = 3.141592467505885 error = 5.936292059519092e-8 merit = 0.0025716508123781218 61748/19655 = 3.141592470109387 error = 5.853420007366852e-8 merit = 0.0025781577749599853 62103/19768 = 3.1415924726831244 error = 5.771495407114599e-8 merit = 0.002584945883912429 62458/19881 = 3.141592475227604 error = 5.690502101544554e-8 merit = 0.002592019327133724 62813/19994 = 3.141592477743323 error = 5.6104242868339024e-8 merit = 0.0025993825084809985 63168/20107 = 3.1415924802307655 error = 5.531246526690591e-8 merit = 0.0026070400439016164 63523/20220 = 3.1415924826904056 error = 5.4529537523533324e-8 merit = 0.0026149967637792084 63878/20333 = 3.141592485122707 error = 5.375531191912607e-8 merit = 0.002623257749852838 64233/20446 = 3.141592487528123 error = 5.2989644268538606e-8 merit = 0.0026318283126966317 64588/20559 = 3.141592489907097 error = 5.22323933551431e-8 merit = 0.0026407140236596287 64943/20672 = 3.1415924922600618 error = 5.148342135490336e-8 merit = 0.002649920699086574 65298/20785 = 3.1415924945874427 error = 5.0742592988226976e-8 merit = 0.002659454449139831 65653/20898 = 3.1415924968896545 error = 5.0009776226755164e-8 merit = 0.0026693216486930156 66008/21011 = 3.141592499167103 error = 4.928484186928889e-8 merit = 0.002679528965991537 66363/21124 = 3.1415925014201855 error = 4.8567663400430846e-8 merit = 0.0026900833784673454 66718/21237 = 3.1415925036492913 error = 4.7858116990585446e-8 merit = 0.0027009921818650063 67073/21350 = 3.141592505854801 error = 4.715608149595883e-8 merit = 0.0027122629998437182 67428/21463 = 3.1415925080370872 error = 4.6461438175842924e-8 merit = 0.002723903810648984 67783/21576 = 3.1415925101965145 error = 4.577407111668933e-8 merit = 0.002735922933992634 68138/21689 = 3.1415925123334407 error = 4.5093866383961494e-8 merit = 0.0027483290931549346 68493/21802 = 3.1415925144482157 error = 4.442071258756658e-8 merit = 0.002761131395876878 68848/21915 = 3.141592516541182 error = 4.375450074049751e-8 merit = 0.002774339356802981 69203/22028 = 3.1415925186126747 error = 4.309512411747499e-8 merit = 0.0027879629217230834 69558/22141 = 3.1415925206630235 error = 4.244247783087354e-8 merit = 0.002802012512429091 69913/22254 = 3.14159252269255 error = 4.179645953751142e-8 merit = 0.0028164989998024 70268/22367 = 3.1415925247015695 error = 4.115696873186072e-8 merit = 0.0028314337694556623 70623/22480 = 3.1415925266903915 error = 4.0523907028763286e-8 merit = 0.002846828724926181 70978/22593 = 3.141592528659319 error = 3.989717788071482e-8 merit = 0.00286269633032941 71333/22706 = 3.1415925306086496 error = 3.9276686719222797e-8 merit = 0.0028790496258831624 71688/22819 = 3.1415925325386738 error = 3.86623409548065e-8 merit = 0.0028959022542887716 72043/22932 = 3.141592534449677 error = 3.805404969428105e-8 merit = 0.0029132685103826087 72398/23045 = 3.1415925363419395 error = 3.7451723882115376e-8 merit = 0.0029311633622333107 72753/23158 = 3.1415925382157353 error = 3.685527615907423e-8 merit = 0.002949602495467867 73108/23271 = 3.1415925400713336 error = 3.626462086221821e-8 merit = 0.002968602349703417 73463/23384 = 3.1415925419089974 error = 3.567967430761971e-8 merit = 0.002988180133716996 73818/23497 = 3.141592543728987 error = 3.510035365949903e-8 merit = 0.003008353961046636 74173/23610 = 3.1415925455315543 error = 3.452657862652023e-8 merit = 0.003029142753805288 74528/23723 = 3.1415925473169497 error = 3.395826962413729e-8 merit = 0.0030505664465106676 74883/23836 = 3.141592549085417 error = 3.339534904681598e-8 merit = 0.0030726459300795604 75238/23949 = 3.1415925508371956 error = 3.283774056124397e-8 merit = 0.003095403169820992 75593/24062 = 3.141592552572521 error = 3.228536938904675e-8 merit = 0.0031188612412389144 75948/24175 = 3.1415925542916234 error = 3.173816202407169e-8 merit = 0.0031430444223940115 76303/24288 = 3.14159255599473 error = 3.1196046373746034e-8 merit = 0.0031679782521683033 76658/24401 = 3.141592557682062 error = 3.065895190043484e-8 merit = 0.0031936895918127546 77013/24514 = 3.1415925593538385 error = 3.01268089146511e-8 merit = 0.0032202067806171002 77368/24627 = 3.141592561010273 error = 2.9599549423203633e-8 merit = 0.003247559639023363 77723/24740 = 3.1415925626515766 error = 2.9077106281049175e-8 merit = 0.0032757796556622983 78078/24853 = 3.1415925642779543 error = 2.8559414180798277e-8 merit = 0.0033048999843237645 78433/24966 = 3.14159256588961 error = 2.804640823913544e-8 merit = 0.003334955716987436 78788/25079 = 3.1415925674867418 error = 2.753802541039899e-8 merit = 0.0033659838476231357 79143/25192 = 3.141592569069546 error = 2.703420321435919e-8 merit = 0.003398023556100075 79498/25305 = 3.1415925706382137 error = 2.6534880725724155e-8 merit = 0.0034311162371627422 79853/25418 = 3.141592572192934 error = 2.6039997867349902e-8 merit = 0.0034653057466538235 80208/25531 = 3.141592573733892 error = 2.554949569295635e-8 merit = 0.0035006385417218717 80563/25644 = 3.14159257526127 error = 2.5063316245769302e-8 merit = 0.0035371638899188347 80918/25757 = 3.1415925767752455 error = 2.4581402841236452e-8 merit = 0.0035749340371894456 81273/25870 = 3.1415925782759953 error = 2.410369936023742e-8 merit = 0.003614004535709633 81628/25983 = 3.1415925797636914 error = 2.3630150955873712e-8 merit = 0.00365443439340209 81983/26096 = 3.141592581238504 error = 2.3160703488036753e-8 merit = 0.00369628643041249 82338/26209 = 3.141592582700599 error = 2.2695304230197833e-8 merit = 0.003739627468693587 82693/26322 = 3.1415925841501404 error = 2.2233900879902193e-8 merit = 0.0037845288174018898 83048/26435 = 3.1415925855872895 error = 2.1776442124200985e-8 merit = 0.0038310665494126084 83403/26548 = 3.1415925870122043 error = 2.1322877639651253e-8 merit = 0.00387932189896066 83758/26661 = 3.1415925884250404 error = 2.087315795095796e-8 merit = 0.003929381726572982 84113/26774 = 3.1415925898259505 error = 2.0427234430973973e-8 merit = 0.003981339007706688 84468/26887 = 3.1415925912150855 error = 1.9985059017984126e-8 merit = 0.004035293430477111 84823/27000 = 3.1415925925925925 error = 1.9546584922495102e-8 merit = 0.004091351857390988 85178/27113 = 3.1415925939586176 error = 1.9111765637729565e-8 merit = 0.004149629190123568 85533/27226 = 3.1415925953133033 error = 1.868055578777407e-8 merit = 0.004210248941258058 85888/27339 = 3.141592596656791 error = 1.825291042078912e-8 merit = 0.004273344214343279 86243/27452 = 3.1415925979892174 error = 1.78287858571571e-8 merit = 0.004339058439193095 86598/27565 = 3.1415925993107203 error = 1.7408138417260385e-8 merit = 0.004407546707464268 86953/27678 = 3.1415926006214323 error = 1.6990925835061217e-8 merit = 0.004478976601684539 87308/27791 = 3.1415926019214853 error = 1.6577106127237806e-8 merit = 0.004553529781140699 87663/27904 = 3.1415926032110093 error = 1.6166637875900305e-8 merit = 0.004631403402447433 88018/28017 = 3.141592604490131 error = 1.5759480794022753e-8 merit = 0.0047128116308472546 88373/28130 = 3.141592605758976 error = 1.5355594877295166e-8 merit = 0.004797987771392931 88728/28243 = 3.1415926070176683 error = 1.4954940686839493e-8 merit = 0.0048871863549194705 89083/28356 = 3.141592608266328 error = 1.4557479914641577e-8 merit = 0.004980685405908598 89438/28469 = 3.141592609505076 error = 1.4163174252687263e-8 merit = 0.005078789613658918 89793/28582 = 3.1415926107340284 error = 1.3771986523826276e-8 merit = 0.005181833172630217 90148/28695 = 3.141592611953302 error = 1.338387969226633e-8 merit = 0.005290183824183623 90503/28808 = 3.1415926131630103 error = 1.2998817570363058e-8 merit = 0.005404246870669908 90858/28921 = 3.1415926143632653 error = 1.2616764535904027e-8 merit = 0.005524470210563737 91213/29034 = 3.141592615554178 error = 1.2237685249392783e-8 merit = 0.005651350205744754 91568/29147 = 3.1415926167358563 error = 1.1861545360838771e-8 merit = 0.005785438063205309 91923/29260 = 3.1415926179084073 error = 1.1488310802967408e-8 merit = 0.005927347979056494 92278/29373 = 3.141592619071937 error = 1.111794779122008e-8 merit = 0.006077766389438445 92633/29486 = 3.141592620226548 error = 1.0750423671902066e-8 merit = 0.006237462409303776 92988/29599 = 3.1415926213723435 error = 1.0385705649960649e-8 merit = 0.006407301439430316 93343/29712 = 3.1415926225094237 error = 1.0023761778491034e-8 merit = 0.00658826005755035 93698/29825 = 3.1415926236378877 error = 9.664560534662385e-9 merit = 0.006781444748602359 94053/29938 = 3.1415926247578327 error = 9.308070961075804e-9 merit = 0.006988114128701429 94408/30051 = 3.1415926258693556 error = 8.954262241690382e-9 merit = 0.007209706348604964 94763/30164 = 3.14159262697255 error = 8.603104549971112e-9 merit = 0.007447871540046976 95118/30277 = 3.14159262806751 error = 8.254567918024995e-9 merit = 0.007704513406469473 95473/30390 = 3.141592629154327 error = 7.90862336746494e-9 merit = 0.007981838717667477 95828/30503 = 3.1415926302330917 error = 7.565241919903853e-9 merit = 0.008282421184374838 96183/30616 = 3.1415926313038933 error = 7.224395162386583e-9 merit = 0.008609280341750632 96538/30729 = 3.1415926323668195 error = 6.8860552473899216e-9 merit = 0.008965982432171553 96893/30842 = 3.1415926334219573 error = 6.550194468748648e-9 merit = 0.009356770586561815 97248/30955 = 3.141592634469391 error = 6.216785968445456e-9 merit = 0.009786732283709331 97603/31068 = 3.1415926355092054 error = 5.885802747105052e-9 merit = 0.010262022067809991 97958/31181 = 3.1415926365414837 error = 5.557218370784088e-9 merit = 0.010790155391967196 98313/31294 = 3.1415926375663066 error = 5.231007112329143e-9 merit = 0.011380406450991833 98668/31407 = 3.1415926385837554 error = 4.907143103228812e-9 merit = 0.012044356667029002 99023/31520 = 3.1415926395939087 error = 4.585601323119603e-9 merit = 0.01279665696194468 99378/31633 = 3.141592640596845 error = 4.266356751638026e-9 merit = 0.013656119502875172 99733/31746 = 3.1415926415926414 error = 3.9493849338525334e-9 merit = 0.014647305857352692 100088/31859 = 3.141592642581374 error = 3.6346615561895634e-9 merit = 0.015802906908552822 100443/31972 = 3.1415926435631176 error = 3.322162870507497e-9 merit = 0.017167407267272748 100798/32085 = 3.1415926445379463 error = 3.0118652700227016e-9 merit = 0.018802933529623964 101153/32198 = 3.141592645505932 error = 2.703745854741474e-9 merit = 0.020798958087527405 101508/32311 = 3.1415926464671475 error = 2.397781441954139e-9 merit = 0.02328921472604781 101863/32424 = 3.141592647421663 error = 2.0939496970989362e-9 merit = 0.026482916558483883 102218/32537 = 3.1415926483695484 error = 1.7922284269720909e-9 merit = 0.03072676661447583 102573/32650 = 3.141592649310873 error = 1.492595579727815e-9 merit = 0.03664010548445531 102928/32763 = 3.141592650245704 error = 1.195029527594277e-9 merit = 0.04544847105306477 103283/32876 = 3.1415926511741086 error = 8.995092082315892e-10 merit = 0.05996553050516452 103638/32989 = 3.1415926520961532 error = 6.060132765838922e-10 merit = 0.0883984797913258 103993/33102 = 3.1415926530119025 error = 3.1452123574324146e-10 merit = 0.16916355170353897 104348/33215 = 3.141592653921421 error = 2.5012447443706518e-11 merit = 2.1127131430431656
The error values in the middle column of the table above shrink steadily as you read from the top of the list to the bottom. Each successive approximation is more accurate than all those above it. Does that also mean each successive approximation is better than those above it? I would say no. Any reasonable notion of “better” in this context has to take into account the size of the numerator and the denominator.
If you want an approximation of \(\pi\) accurate to seven digits, I can give you one off the top of my head: \(3141593/1000000\). But the numbers making up that ratio are themselves seven digits long. What makes \(355/113\) impressive is that it achieves seven-digit accuracy with only three digits in the numerator and the denominator. Accordingly, I would argue that a “better” approximation is one that minimizes both error and size. The rightmost column of the table, filled with numbers labeled “merit” is meant to quantify this intuition.
When I wrote that program in 1981, I chose a strange formula for merit, one that now baffles me:
\[\frac{1}{(n + d)^2 * err}.\]
Adding the numerator and denominator and then squaring the sum is an operation that makes no sense, although the formula as a whole does have the correct qualitative behavior, favoring both smaller errors and smaller values of \(n\) and \(d\). In trying to reconstruct what I had in mind 26 years ago, my best guess is that I was trying to capture a geometric insight, and I flubbed it when translating math into code. On this assumption, the correct figure of merit would be:
\[\frac{1}{\sqrt{n^2 + d^2} * err}.\]
To see where this formula comes from, consider a two-dimensional lattice of integers, with a ray of slope \(\pi\) drawn from the origin and going on to infinite distance.
Because the line’s slope is irrational, it will never pass through any point of the integer lattice, but it will have many near misses. The near-miss points, with coordinates interpreted as numerator and denominator, are the accurate approximations to \(\pi\). The diagram suggests a measure of the merit based on distances. An approximation gets better when we minimize the distance of the lattice point from the origin as well as the vertical distance from the point to the \(\pi\) line. That’s the meaning of the formula with \(\sqrt{n^2 + d^2}\) in the denominator.
Another approach to defining merit simply counts digits. The merit is the ratio of the number of correctly predicted digits in the irrational target \(T\) to the number of digits in the denominator. A problem with this scheme is that it’s rather coarse. For example, \(13/4\) and \(16/5\) both have single-digit denominators and they each get one digit of \(\pi\) correct, but
\(16/5\) actually has a smaller error.
To smooth out the digit-counting criterion, and distinguish between values that differ in magnitude but have the same number of digits, we can take logarithms of the numbers. Let merit equal: \(-log(err) / log(d)\). (The \(log(err)\) term is negated because the error is always less than \(1\) and so its logarithm is negative.)
Here’s a comparison of the three merit criteria for some selected approximations to \(\pi\):
n/d 1981 merit distance merit log merit 3/1 1.3867212448620723 7.016316181613145 -- 13/4 0.10027514901117529 2.1306165422053285 2.4284808488226544 16/5 0.12196741168912356 3.208700907602539 2.4760467349663537 19/6 0.20046843839209055 6.288264070960828 2.6960388788612515 22/7 2.954192079226498 107.61458138965322 4.017563128080901 179/57 0.04542369572848121 13.467303354323912 1.9381258641568968 201/64 0.04623152429195394 15.390920494844842 1.9441196398907357 223/71 0.04861784421796857 17.956388291625093 1.9573120958787444 245/78 0.05310704607396699 21.548988850935377 1.9785253787278367 267/85 0.06092284944437125 26.93965209372642 2.0098618723780515 289/92 0.07506657421887829 35.92841360228601 2.055872071177696 311/99 0.10469219759604646 53.921550739835986 2.1273838230139175 333/106 0.1958822412726219 108.02438852795403 2.259868093766371 355/113 53.76883630752973 31610.90993685001 3.444107245852723 52163/16604 0.002495514149618044 215.57611105028013 1.6757260012234105 • • • • • • • • • • • • 103993/33102 0.2892417579456485 49813.04849576935 2.1538978293241056 104348/33215 0.5006051667655171 86508.24042805366 2.2065386096084607 208341/66317 0.3403602724772912 117433.39822796892 2.1589243556399245 312689/99532 0.6343809166515098 328504.0552596196 2.207421489352196
All three measures agree that \(22/7\) and \(355/113\) are quite special. In other respects they give quite different views of the data. My weird 1981 formula compares \((n + d)^{-2}\) with \(err^{-1}\); the asymmetry in the exponents suggests the merit will tend to zero as \(n\) and \(d\) increase, at least in the average case. The maximum of the distance-based measure, on the other hand, appears to grow without bound. And the logarithmic merit function seems to be settling on a value near 2.0. This implies that we shouldn’t expect to see many \(n/d \) approximations where the number of correct digits is greater than twice the number of digits in \(d\). The late Tom Apostol and Mamikon A. Mnatsakanian proved a closely related proposition (“Surprisingly accurate rational approximations,” Mathematics Magazine, Vol. 75, No. 4 (Oct. 2002), pp. 307-310).
The final joke on my 1981 self is that all this searching for better approximants can be neatly sidestepped by a bit of algorithmic sophistication. The magic phrase is “continued fractions.” The continued fraction for \(\pi\) begins:
\[ \pi = 3+\cfrac{1}{7+\cfrac{1}{15+\cfrac{1}{292+\cfrac{1}{1 + \cdots}}}}\]
Evaluating the successive levels of this expression yields a sequence of “convergents” that should look familiar:
\[3/1, 22/7, 333/106, 355/113, 103993/33102, 104348/33215.\]
It is a series of “best” approximations to \(\pi\), generated without bothering with all the intervening non-“best” values. I produced this list in CoCalc (a.k.a. SageMathCloud), following the excellent tutorial in William Stein’s Elementary Number Theory. Even much larger approximants gush forth from the algorithm in milliseconds. Here’s the 100th element of the series:
\[\frac{4170888101980193551139105407396069754167439670144501}{1327634917026642108692848192776111345311909093498260}\]
A question remains: In what sense are these approximations “best”? It’s guaranteed that every element of the series is more accurate than all those that came before, but it’s not clear to me that they also satisfy any sort of compactness criterion. But that’s a question to be taken up another day. Perhaps on Continued Fraction Day.
]]>In a barn, 100 chicks sit peacefully in a circle. Suddenly, each chick randomly pecks the chick immediately to its left or right. What is the expected number of unpecked chicks?
Robitaille took less than a second to buzz in with the correct answer, according to the Times.
The next day, Jordan Ellenberg tweeted a followup problem:
Since I don’t have to squeeze this story into 140 characters, I’ll fill in some details of Ellenberg’s question, as I understand it. Where the original problem called for a single round of synchronized random pecking, we now have multiple rounds. During a round, each chick randomly turns either left or right and pecks one of its neighbors. However, once a chick has been pecked, it will never peck again, even if it continues to receive pecks. When two adjacent chicks peck each other in the same round, they both drop out of the pecking game for all future rounds. If an unpecked chick winds up sitting between two pecked neighbors, it can never be pecked and will therefore keep on pecking forever. The question is, what proportion of the flock will survive to become invulnerable peckers?
Spoilers below, so now’s the time to work out the answers for yourself. While you’re busy with that, I’m going to say a few words about chickens, and about the rhetoric and semiotics of mathematical “word problems.”
My only direct knowledge of poultry comes from boyhood visits to my Aunt Noretta’s farm in southern New Jersey. That’s not much of a claim to expertise, but for what it’s worth I never saw her chickens sit in a circle, and they didn’t peck randomly. (They had a pecking order!) Furthermore, nothing I observed in their social interactions resembled the turn-the-other-cheek behavior of the chickens described in this problem. Why does a pecked chick never peck again? This is a bigger riddle than the quantitative question we are asked to address. Has the chick suddenly discovered the wisdom and power of nonviolence? I can think of another explanation, but it’s not for the squeamish: Maybe pecked chicks don’t peck back because pecks are lethal.
I know it’s silly to demand narrative realism in a story like this one. Mathematical word problems belong to a genre where no one expects verisimilitude. They are set in a world where knaves always lie and knights always speak the truth, where shipwrecked sailors obsess about the divisibility properties of a pile of coconuts, where people don’t know the color of the hat on their own head. Even the laws of physics yield to mathematical necessity: A fly shuttling between oncoming locomotives instantaneously reverses direction. Those chicks sitting in a circle are not fluffly bundles of yellow plumage; they are mathematical abstractions. They have coordinates and state variables rather than feathers.
I’m okay with abstraction; by all means, let us strip away extraneous detail. Nevertheless, isn’t the point of word problems to connect the mathematics to some aspect of familiar experience? Consider the ancient and famous river-crossing problem, where the fox must not be left alone with the chicken, which must not be left alone with the bag of corn. These constraints are easy to understand when you know something about the dietary preferences of foxes and chickens. That kind of intuitive boost is not to be found in the pecking problem. On the contrary, a little knowledge of avian behavior actually makes the problem more perplexing.
But no matter. Onward! Have you come up with your answers?
The single-round problem from the Mathcounts Competition yields to the oldest trick in the probability book. A chick remains unpecked only if both of its neighbors turn away and peck in the other direction. On both the left and the right, the probability of escaping a peck is \(\frac{1}{2}\), and the two events are independent, so the probability of staying unpecked on both sides is \(\frac{1}{2} \times \frac{1}{2} = \frac{1}{4}\). This argument applies identically to all the birds in the circle, so you can expect 25 percent of the chicks to come through unscathed.
Do you agree with this analysis? I came up with it pretty quickly when I read the Times article (though not nearly fast enough to beat Luke Robitaille to the buzzer). But then I began to have doubts. Is it strictly true that a chick’s left and right neighbors are totally independent? After all, they are connected by a chain of other chicks. Perhaps some influence can propagate around the circle, creating a correlation between left and right and altering the probability of survival.
Time for an experiment: Write the program, run the simulation. Set up a ring of 100 unpecked chickens and allow a single round of random simultaneous pecking. Repeat many times and calculate the mean number of unpecked birds remaining. (Some quick notation: Let \(N\) be the number of chicks in the ring and \(S\) be the number that survive unpecked. I’ll use \(\bar{S}\) for the mean value of \(S\) averaged over \(R\) repetitions of the experiment.) My results:
\(R\) | \(\bar{S}\) |
---|---|
100 | 24.79 |
10,000 | 24.9881 |
1,000,000 | 25.000274 |
100,000,000 | 24.99991518 |
As expected, the mean is quite close to 25 survivors. Furthermore, each time the sample size increases by a factor of 100, the accuracy of the approximation improves about tenfold. This pattern conforms to a statistical rule of thumb—that the fluctuations in a random process are proportional to the square root of the sample size. Thus the slight departures from \(\bar{S} = 25\) appear to be innocent random noise, not some systematic bias.
So that settles it, right?
Well, the simulation looks pretty convincing for the specific case of \(N = 100\) chicks, but the result might differ for other values of N. In particular, perhaps there’s some finite-size effect that becomes apparent only when N is small. Consider a “circle” of just two chicks. In this situation the left neighbor and the right neighbor are one and the same chicken! No matter what random choices are made, the two chicks immediately peck each other, and the proportion of survivors is not 25 percent but zero.
The next-larger “circle” consists of three chicks arranged in a triangle. The two neighbors of a chick are distinct, but they are also neighbors of each other. What happens when the three chickens are set loose on one another? The system has \( 2^3 = 8\) possible pecking patterns, and we can easily examine all of them. In the diagram, the arrows indicate where the chicks choose to direct their pecking.
In two cases, where all the chicks peck left or all peck right, there are no survivors. In every other instance exactly one chick remains unpecked. Aggregating the eight patterns, we find six unpecked chicks out of 24 total chicks, for a proportion of \(\bar{S} = \frac{1}{4}\). Thus it appears the finite-size anomaly afflicts only the two-chick version of the problem.
But wait! There’s another possible confounding factor. Can we be sure of seeing the same outcome for both even and odd numbers of chicks? For any odd value of N there is just one way to annihilate all the chicks in a single round: They must all peck in the same direction. For even N, however, another pattern also leads to immediate extinction: Adjacent chicks can pair up, knocking each other out. Won’t this extra pathway slightly alter the overall probability of survival?
Let’s see what happens with N = 4. Now there are \(2^4 = 16\) possible outcomes:
As expected, four patterns leave no survivors at all. On the other hand, there are also four patterns that leave two chicks unpecked rather than just one. Miraculously, the extra losses and the extra gains balance exactly. In all we have 16 survivors out of 64 chicks, so the ratio is again \(\bar{S} = \frac{1}{4}\).
After that long and twisty detour through the combinatorics of chicken pecking, we are right back where we started. The probability of surviving unpecked after a single round of pecking is \(\frac{1}{4}\) for any \(N \gt 2\). All of my fretting about finite-size effects and odd-even disparities was a waste of time. So why have I inflicted it on you? Well, although those worries turn out to be unfounded, they are not farfetched. Making just a small change to the pecking protocol leads to a different outcome. Let the pecking be sequential rather than simultaneous. Some designated chick initiates the sequence of pecks, and then the birds take turns, proceeding clockwise around the circle. When a chick’s turn comes, if it has already been pecked, it does nothing. If it is unpecked, it pecks either its left or its right neighbor, choosing randomly. The round ends when every chick has had a turn.
For \(N = 2\) it’s easy to see that the first chick to peck always survives and the other chick always dies, for a survival rate of \(\frac{1}{2}\). With a little more pencil-and-paper chicken scratching, you can establish that the 50 percent survival rate also holds for \(N = 3\). Looking at very large values of \(N\), computer experiments indicate that the survival fraction again approaches \(\frac{1}{2}\) as N goes to infinity. Between these extremes, however, there’s some funny business:
At \(N = 4\) the survivor rate dips below 0.47. (The exact probability is \(\frac{30}{64} = 0.46875\).) This is a minimum. But as the rate recovers back toward 0.5, there is some telltale wiggling in the curve that reveals an odd-even bias: The survival probability is depressed further for even N than for odd N. This is just the kind of behavior I was looking for (but not finding) in the original Mathcounts version of the problem.
Let us now take up Ellenberg’s problem of iterated pecking (using the simultaneous rather than the sequential protocol). We already know that after the first round we can expect to find about one-fourth of the chicks still unpecked. Clearly, the unpecked fraction cannot increase after multiple rounds. Thus in the final state the expected surviving fraction \(\bar{S}\) must lie somewhere between zero and \(\frac{1}{4}\).
It’s helpful to look at a typical configuration of pecked (●) and unpecked (○) chicks after a single round of synchronized pecking:
●○●●●○○●●●●○●●○●●○○●●●○○●●●●●○●●●●●●●●●○○●●●●●●○○●●●○●●●●●●○●●●○○●●●●●
(You’ll have to use your imagination to connect the left and right ends of this array and thereby form a ring.) Notice that there are long strings of pecked chicks, but the unpecked chicks appear in only two configurations. They are either singletons (●○●) or pairs (●○○●). The cause of this pattern is not hard to understand. After a round of pecking, a group of three consecutive unpecked chicks (●○○○●) is impossible. The middle chick must have pecked either left or right, and so it cannot have two unpecked neighbors.
These constraints simplify the analysis of subsequent rounds. The singletons are essentially immortal and unchangeable: The unpecked chick in the middle can never be pecked, and the pecked neighbors can never be unpecked. For the pairs, there are four possible fates, corresponding to the four ways the two active chicks could choose to peck:
In any one round, all four of these events have the same probability, namely \(\frac{1}{4}\). The first three result states are terminal, in the sense that further rounds of pecking will leave them unchanged. In the fourth case we are left with an adjacent pair again, which will therefore face the same set of choices in the next round. Eventually, as the number of rounds goes to infinity, the fourth case must yield one of the other outcomes, and thus in the long run we can consider the fourth case to have probability zero and each of the other three cases to have probability \(\frac{1}{3}\).
And now it’s time to bring all these contingent events together and work out a chicken’s long-term probability of survival. The diagram below presents the scheme. In the first round of pecking, three-fourths of the chicks are eliminated immediately. Of the remaining one-fourth, half are singletons, which survive indefinitely. The other surviving chicks are members of pairs, with another pecking chick as either a right neighbor or a left neighbor.
The lower part of the diagram summarizes the effect of all subsequent rounds, which are assumed to continue until all pairs have been either annihilated or reduced to singletons. (I call this pecking to completion.) For each pathway that leads to a surviving singleton, the probability is the product of the individual probabilities encountered along that pathway. There are three such pathways, with probabilities \(\frac{1}{8}, \frac{1}{48}\), and \(\frac{1}{48}\), for a sum of \(\frac{1}{6}\).
I have to confess that I did not come up with this analysis—or with the correct answer—on my first try. I was able to work it out only after I had run a simulation and thus knew what I was looking for. Even then I had trouble with double counting.
Here are the simulation results:
\(R\) | \(\bar{S}\) |
---|---|
100 | 16.53 |
10,000 | 16.6835 |
1,000,000 | 16.664404 |
100,000,000 | 16.66701664 |
Again note that accuracy seems to improve as the square root of the sample size, although the variance here is larger than in the single-round experiment.
What about finite-size effects? In circles with only two or three members, the fate of the chicks is fully decided after a single round of pecking: \(\bar{S}\) is 0 and \(\frac{1}{4}\) respectively. Thus these smallest rings escape the \(\frac{1}{6}\) rule, but it appears that circles of all larger sizes converge to \(\frac{1}{6}\). There’s no evidence of even-odd discrepancies.
Another approach to understanding the iterated chicken-pecking problem is through the theory of Markov chains. For a ring of \(N\) chicks we list all \(2^N\) states of the flock and assign a probability to each transition between states. Consider a ring of four chicks, which has 16 states. Symmetries allow us to consolidate some sets of states, and other states can be ignored because they are unreachable from the starting state of four unpecked chicks ().
Only the four states in the red box need to be retained in the model. The transitions between them are recorded in a directed graph, where each arrow is labeled with the corresponding probability. Note that the starting state has only outgoing arrows; there is no way to re-enter the state once you leave. The states and are absorbing: The only outgoing arrow leads directly back to the same state; thus, once you reach one of those states, you never escape it.
The essential information from the directed graph can be captured in a \(4 \times 4\) matrix, where the rows and columns are labeled with the four states, and the matrix entries represent the probability of a transition from the row state to the column state. The entries in each row sum to 1, as they must if they are to represent probabilities.
The pattern of zero entries in the transition matrix implies that certain states can’t be reached from other states, even by an indirect route. For this reason the Markov model is said to be irregular. That’s a bit awkward, because regular Markov models are easier to analyze and understand. In a regular model, when you take successive powers of the transition matrix, it converges to a steady state, where all the rows are identical and every column consists of a single, repeated value. This fixed point reveals the system’s long-term probability distribution. An irregular Markov model may not even have a stable limiting distribution, but this one does, and it seems to offer some insight. Every ring of four chickens must wind up in one of the two absorbing states. With probability two-thirds that terminal state will be and with probability one-third . This result is consistent with the finding that one-sixth of the chickens survive unpecked.
So, finally, that wraps it up, right? Both the contest problem and Ellenberg’s iterative extension asked for the expected number of surviving chickens, and we have supplied the answers: for a circle of \(N\) chickens, the expected number of survivors \(\bar{S}\) is \(\frac{N}{4}\) after a single round of pecking and \(\frac{N}{6}\) upon pecking to completion. Ironically, though, the expected value of a probabilistic process doesn’t necessarily tell you what to expect. Consider a simpler problem: When you flip a fair coin 100 times, how many heads do you expect to see? The obvious answer is 50, and it’s correct in the sense that no other number has a higher likelihood of correctly predicting the outcome of the experiment. However, the probability of seeing exactly 50 heads is only about 0.08, and thus some other number will turn up more than 90 percent of the time.
Instead of looking only at the expected value, let’s examine the range of possible \(S\) values in the pecking game. We’ve already established that zero survivors is a possible outcome, so that forms a lower bound. What is the upper bound—the maximum number of survivors? In the single-round process, every chick pecks, and so after that round every chick must have at least one pecked neighbor. On the basis of this fact I claim that the surviving population can never be greater than \(\frac{N}{2}\). (Do you agree? It took me a while to persuade myself it’s true.)
If \(S\) can never be greater than \(\frac{N}{2}\), the next question is whether it can ever attain that bound. And if we can have equal numbers of pecked (●) and unpecked (○) chicks, how are they arranged in the ring? It’s tempting to propose the following configuration:
●○●○●○●○●○●○
This is a stable state: The unpecked chicks can never be pecked, so no further changes are possible. And the fraction of survivors is \(\frac{1}{2}\). But there’s a problem with this pattern: It cannot be reached from the starting state. Look at any of the black pecked chicks and ask yourself: Which of its neighbors did it peck? Neither of them, evidently, since they are both unpecked. But that’s not possible, given that every chicken must peck in the first round.
Although the alternating black and white arrangement is ruled out, we’re on the right track. There’s another configuration that also leaves one-half of the chicks unpecked after a single round, and that pattern is achievable from the starting state:
●●○○●●○○●●○○
When you join the ends to form a ring, every chick, whether pecked or not, has one pecked neighbor. It turns out this is the only way—after allowing for some obvious symmetries—to reach 50 percent survivorship. (Strictly speaking, 50 percent is attainable only when \(N\) is divisible by 4, but \(S\) is never less than \(\frac{N-2}{2}\).)
When the pecking continues to completion, the upper bound of \(S = \frac{N}{2}\) is no longer reachable. Suppose we tried to maintain \(\frac{N}{2}\) over multiple rounds of pecking. Clearly we would have to start in the first round with the maximal-survivor state ●●○○●●○○●●○○
. However, at least half of the unpecked chicks in this configuration must succumb in subsequent rounds, leaving no more than \(\frac{N}{4}\) survivors.
Does this argument mean that \(S = \frac{N}{4}\) is the greatest possible after pecking to completion? No, it doesn’t. There’s another pattern where one of every three chicks survives:
●●○●●○●●○●●○
This configuration is reachable in a single round and stable indefinitely, since none of the pecking chicks has any pecking neighbors. No other arrangement has a higher density of survivors once the pecking process goes to completion.
To summarize: After one round of pecking the number of surviving chicks must lie somewhere between zero and \(\frac{N}{2}\), and the expected number \(\bar{S}\) is right in the middle at \(\frac{N}{4}\). After all further rounds of pecking are completed, the count of unpecked chicks is between zero and \(\frac{N}{3}\), with the expected value again in the middle, at \(\bar{S} = \frac{N}{6}\).
“How many chickens survive?” is a question that seems to call for a numeric answer, but in truth the most informative response is not a number at all; it is a distribution:
Each curve records the results of a million experiments with a ring of 100 chicks, giving the frequency of each possible value of \(S\). As expected, the one-round distribution has a peak at 25 survivors, and the iterated curve peaks at 17 (the closest integer to \(\frac{100}{6}\). Note that the red curve is not only shifted to the left but is also slightly taller and narrower.
To get a better view of the details, let’s zoom in. For the sake of smoother curves, I’m going to switch to experiments with \(N = 10{,}000\) chickens. First the green single-round curve, then the red one for the iterated pecking experiment:
With the larger value of \(N\), the curves now peak at 2500 and at 1666.67—exactly the positions expected for \(\frac{N}{4}\) and \(\frac{N}{6}\). Finding the peaks at these positions is no surprise, but what governs the width and the overall shape of the curve? In other words, what is the mathematical nature of the distributions?
One guess that’s always worth a try is the normal (or Gaussian) distribution. For the pecking problem, a normal distribution defines \(P(S)\), the probability of observing \(S\) survivors, as follows:
\[P(S) = \frac{1}{\sigma\sqrt{2 \pi}} \exp -\frac{1}{2}\left(\frac{S - \mu}{\sigma}\right)^2.\]
That’s a pretty messy equation for such a familiar concept, but it’s possible to tease out the basic meaning. The equation defines a symmatric curve with a peak where \(S\) is equal to \(\mu\), the mean of the distribution. The width of the peak depends on \(\sigma\), the standard deviation. Because the area under the curve is a constant, \(\sigma\) also effectively determines the height: A narrower peak has to be taller.
We can fit a normal distribution to the pecking data using a procedure that finds the optimal values of \(\mu\) and \(\sigma\)—those that minimize the discrepancy between the data points and the mathematical model. In the two graphs below the fitted models are superimposed on the two data plots, first for one round of pecking and then for pecking to completion:
The fits appear to be quite close indeed, with the theoretical curves splitting the experimental ones from end to end. In some sense this result has to be counted a success, and yet I don’t find this approach to the problem fully satisfying. The normal curve provides a very good descriptive model of the pecking process, but not a predictive or explanatory one. Remember, the curve is fitted to the data, not the other way around. I see no obvious way to construct a specific normal distribution from what I know about the underlying interactions of pecking chickens. In particular, where do the values of \(\sigma\) in the two models come from? Why is \(\sigma \approx 25\) in the one-round model and \(\sigma \approx 23.6\) in the iterated model? These values look like free parameters, which we have to tune to suit the data. Moreover, they will differ for every value of \(N\). Another issue: the normal curve is a continuous distribution, defined over the entire real number line. The pecking function is discrete; it makes sense only for integer numbers of chickens.
Let’s set aside the normal curve and consider another plausible model: the binomial distribution, which is discrete, and which turns up in many probabilistic contexts. Suppose you roll 10,000 dice and count how many of them come to rest with a 1 showing on the upper face. When you repeat this experiment many times, the expected number of 1s is one-sixth of 10,000, the same as the expected number of survivors in the iterated chicken-pecking experiment. With dice, there’s a well-known mathematical expression that defines not just the expected value but also the form of the entire distribution. Assume that every die has probability \(p\) of showing a \(1\). We are going to roll \(N\) dice and we want to know the probability of seeing \(k\) \(1\)s for any \(k\) between \(0\) and \(N\). The formula that supplies this information is:
\[P(k) = {N \choose k} p^k (1 - p)^{N - k}.\]
Here \(p^k (1 - p)^{N - k}\) gives the probability of any specific arrangement of \(k\) \(1\)s among \(N\) dice. The binomial coefficient \(N \choose k\), equal to \(N! / k! (N-k)!\), counts the number of such arrangements.
With \(N = 10000\) and \(p = \frac{1}{6}\) we get a curve showing the outcome of the dice-rolling experiment mentioned above. Perhaps the same curve also describes what happens to the iterated pecking model, which has the same expected value? Alas no.
The binomial curve is wider and flatter than the distribution of iterated pecking survivors. What has gone wrong? When I first saw the graph, I had an inkling. As noted above, the binomial coefficient \(N \choose k\) counts all the ways of choosing \(k\) items from a set of size \(N\). This is appropriate for an experiment with dice, since all possible arrangementds of \(k\) successes among \(N\) trials are equally likely. In particular, when you roll \(10{,}000\) dice, you could conceivably see no \(1\)s at all, or all \(10{,}000\) dice could land with a \(1\) showing face up; the entire range of outcomes has probability greater than zero.
The pecking problem is different. It’s not possible for 100 percent of the chickens to remain unpecked. Thus only a subset of the \(N \choose k\) arrangements are attainable. If the binomial distribution is going to work in this context, we need to adjust it somehow to include only the feasible outcomes.
With the thought that it’s easier to solve a problem if you already know the answer, I tried fiddling with the parameters of the distribution to see how the graph responded. My goal was to squeeze the curve into a narrower and taller profile while keeping it centered at the same mean. The mean is equal to \(Np\), so if we decrease \(N\) we have to increase \(p\) by the same factor. Here are the results of some experiments:
The dark green curve is the one we’ve already seen, for a binomial distribution with \(N = 10000\) and \(p = \frac{1}{6}\). Going to \(N = 5000\) and \(p = \frac{1}{3}\) appears to be a step in the right direction, and \(N = 3333\) and \(p = \frac{1}{2}\) is even better. Then, with \(N = 2500\) and \(p = \frac{2}{3} \ldots\) Bingo! The yellow curve is an excellent match to the pecking data. Thus it appears we can predict the survivorship of an \(N\)-member pecking ring by constructing a binomial distribution with parameters \(N^\prime = \frac{N}{4}\) and \(p^\prime = 4p\).
I can pull the same trick to find a binomial distribution that matches the single-round pecking data. This time the magic numbers that bend the curve to the correct trajectory are \(N’ = \frac{N}{3} = 3333\) and \(p’ = 3p = \frac{3}{4}\).
Unlike the normal distribution, the binomial model is constructive, or predictive. From the two parameters \(N’\) and \(p’\) we can calculate both the mean of the distribution and the standard deviation. The mean is simply \(N’ p’\); the standard deviation is \(\sqrt{N’ p’ (1 - p’)}\). For the example of the \(10{,}000\) chickens pecking to completion, the mean \(\mu\) works out to \(1{,}666.666 \dots\) (as expected), and the standard deviation \(\sigma\) is \(23.570226\). (The fitted normal distribution had \(\sigma = 23.567337\).) For the single-round case, \(\mu\) is exactly \(2500\) and \(\sigma\) is \(25\). (To avoid roundoff errors, I am taking \(N\) to be \(9999\) instead of \(10{,}000\).)
Hooray, eh? At last we have a formula for calculating the shape and location of the chicken-pecking distribution, based on a few simple parameters—\(N’\) and \(p’\). But I’m still grumpy, indeed more perplexed and frustrated than ever. Maybe the model explains the data, but what explains the model? With \(10{,}000\) chickens and a first-round survivor probability of \(\frac{1}{4}\), why does the formula call for \(N’ = 3333\) and \(p’ = \frac{3}{4}\)? Where do those numbers come from? And why \(N’ = 2500\) and \(p’ = \frac{2}{3}\) for the iterated case?
I am embarrassed to admit how long I have spent helplessly flailing and thrashing in the bogs of probability theory, trying to solve these mysteries. (I even turned to a recent book called The Probability Lifesaver, which I highly recommend—but it didn’t save my life.) In the search for answers I have investigated the multinomial extensions of binomials. I have looked into convolutions of distributions and computed contingent probabilities. I have filled whole pads of scratch-paper with soldierly rows of ●s and ○s, searching for patterns that would explain those enigmatic fractions \(\frac{N}{3}\) paired with \(\frac{3}{4}\), and \(\frac{N}{4}\) with \(\frac{2}{3}\). Night after night I’ve gone to bed with a promising idea, only to awaken and recognize a fatal flaw.
Now I believe I do have a correct explanation. It has passed the overnight test several nights in a row. I’m going to reveal it, but not until the end of this essay. Perhaps you’ll figure it out on your own before then. In the meantime, I’m going to widen the horizons of the chicken problem.
Our cozy circle of chickens is a one-dimensional structure. You can go clockwise or counterclockwise around the ring; there are no other meaningful directions in this little universe. Now suppose that instead of getting all our chickens in a row, we arrange them in a grid, an array of columns and rows, covering a region of a two-dimensional surface. To avoid leaving a subset of chickens on the exposed edges of a rectangular array, we can mate the left edge with the right edge and the top edge with the bottom edge. (Topologically, this turns the rectangle into a torus.) Getting real chickens to cooperate in this experiment would be even harder than in the one-dimensional version, but no matter; we’ve long since lost all touch with barnyard reality.
The most important fact about the two-dimensional flock is that each chicken has four neighbors instead of two. With twice as many hostile neighbors, one might well guess that a chicken would be more vulnerable to a pecking attack. On the other hand, each of those neighbors spreads its pecking over twice as many potential targets. How do these competing effects balance out?
For a single round of pecking, we can calculate the survival probability in the same way we did for the one-dimensional system. A chick remains unpecked only if all of its neighbors turn elsewhere to peck. Each neighbor does so with probability \(\frac{3}{4}\), and so the probability that all of them turn away is \(\left(\frac{3}{4}\right)^4\). Numerically, this works out to about 0.3164, compared with 0.25 in the circle. Thus the fraction surviving is greater in two dimensions than in one; the distraction of having more targets outweighs the danger of having more attackers. The distribution observed in computer experiments confirms this finding.
Here’s what a \(40 \times 40\) lattice of chicks looks like after a single round of pecking.
There are 1,600 chicks in the two-dimensional array. If you count the unpecked ○s, you’ll find there are 501, for a survival fraction of 0.3131, close to the theoretical value of 0.3164. Simulations confirm the expected survival rate of \(\left(\frac{3}{4}\right)^4\) for \(N \times N\) lattices with any value of \(N\) greater than \(2\). (For the \(2 \times 2\) grid, the survival rate is \(\frac{1}{4}\), as in the one-dimensional system. There’s a reason why!)
When I stare at the pattern above, I notice a certain stringy or loopy texture, with chains of ○s separating blobs of ●s. This might be a trick of the eye and mind, but I think not. In two dimensions the no-three-in-a-row restriction is lifted; the array includes rows and columns with as many as six consecutive unpecked chicks, as well as diagonal lines. But you will not see a solid \(3 \times 3\) block () of unpecked chicks, or even a \(3 \times 3\) cross (). Such patterns cannot exist because the chick in the middle of the block must have pecked one of its four neighbors. More generally, the system is still bound by the rule that every chick, whether pecked or unpecked, must have at least one pecked neighbor.
Since more chicks survive the first round of pecking in a two-dimensional world, it seems plausible there might also be a greater proportion of survivors when the pecking continues to completion. Let’s try the experiment:
In this \(40 \times 40\) array there are 238 survivors out of 1,600 chicks, which is less than the one-sixth survival rate seen in one dimension. In a sample of a million such pecking grids, I found that the mean survival rate \(\bar{S}\) is about 0.1533. Compare the distributions for one- and two-dimensional systems:
In going from 1D to 2D the peak shifts to the left, with the mean moving from 0.1667 to 0.1533. The 2D hump is also a little taller and skinnier, thus showing reduced variance.
Why stop at two dimensions? Let us ask our ever-accommodating chickens to roost in a three-dimensional lattice, again with opposite boundaries joined to create the 3D equivalent of a toroidal surface. It’s not hard to guess how this experiment is going to turn out. Back in one dimension, where every chick had two neighbors, the fraction of survivors after a single round of pecking was \(\left(\frac{1}{2}\right)^2 = \frac{1}{4}\). In two dimensions, with four neighbors, the corresponding number was \(\left(\frac{3}{4}\right)^4 = \frac{81}{256}\). In the three-dimensional pecking party each chick has six neighbors, so the obvious extrapolation is \(\left(\frac{5}{6}\right)^6 = \frac{15625}{46656}\), with a value of \(\approx 0.3349\). Running the simulation supports this surmise, and shows a clear trend when we construct chicken lattices with still higher numbers of dimensions.
From this series of results we can boldly generalize: When every chick has \(n\) neighbors, the fraction expected to survive a single round of pecking is:
\[\left(\frac{n - 1}{n}\right)^n.\]
As \(n\) increases, this expression converges on a value of approximately \(0.36787944\). Does that number look familiar? It is \(\frac{1}{e}\). (Changing the minus sign to a plus generates \(e\) itself, \(2.71828\).) When I stumbled upon this formula, the sudden appearance of \(\frac{1}{e}\) took me by surprise, but it shouldn’t have. The constant turns up in the same way in a model of rumor spreading that I wrote about some years ago.
What about the iterated pecking process in higher dimensions? The fraction of survivors shows a steady decline as the number of dimensions increases:
The proportion of chicks that never get pecked falls from 16.7 percent in one dimension to about half that when we embed our intrepid chickens in seven-dimensional space. In other words, a higher-dimensional space raises the initial survival rate (after one round of pecking), but depresses long-term survival (after pecking to completion). Here’s another way of showing the effect of dimension—tracking the mean number of survivors remaining after each round of pecking in one dimension through seven dimensions.
I can offer a rough, hand-wavy rationale for this trend. If you are a chick in a one-dimensional ring, your chance of surviving the first round of pecking is only \(\frac{1}{4}\), but if you make it through that round, your chance of avoiding a peck in the second round is at least \(\frac{1}{2}\). Why the improvement? It’s because of your own actions: Your pecking in the first round eliminated the threat from one of your two neighbors. Your odds continue improving in subsequent rounds: The longer you last, the greater the chance that you will hang on until all your neighbors are pacified.
The same trend holds in higher dimensions, but the magnitude of the effect tapers off. In four dimensions, for example, you have eight neighbors, and your chance of surviving the first round is \(\left(\frac{7}{8}\right)^8\), or about 0.34. Because you peck one of those neighbors, your probability of making it through the second round is better, but only slightly so: \(\left(\frac{7}{8}\right)^7\), or 0.39.
Looking at the graphs above, one might surmise that as the dimension \(D\) goes to infinity, the number of survivors (after pecking to completion) will drop to zero. To explore this idea, we don’t actually need infinite-dimensional space. What matters most is not the geometric arrangement of the chickens but the number of neighbors, and we can approximate an infinite-dimensional lattice just by declaring that all chickens are nearest neighbors. In other words, the who-pecks-whom graph becomes complete, with an arc from every chick to every other chick. This does seem to be a recipe for annihilation; you can’t be safe as long as even one other chicken continues to peck. But the details of the end game allow a little room for variation. Will there be one survivor or none?
Peter Winkler discusses a similar problem, “Group Russian Roulette,” in Mathematical Puzzles: A Connoisseur’s Collection (p. 33). The actors in his version are not chickens but “armed and angry people,” who engage in rounds of simultaneously shooting random neighbors. Winkler observes that the probability of a survivor does not approach a limit as \(N\) increases. I don’t see this effect in the chicken problem: There is almost always a last chicken standing. What makes the difference, I believe, is that Winkler’s roulette players don’t waste their ammunition on players who have already been shot, whereas the chickens continue to peck at neighbors who don’t peck back.
Finally, I return to the narrow confines of one dimension and to the mysterious binomial distributions that seem to predict the statistics of chicken pecking in this system. To review: If you roll 10,000 dice and count those that show a \(1\), you can expect to find about 1667. If you put 10,000 chicks in a circle and wait until all the pecking is done, you can expect about 1667 unpecked survivors. The dice experiment is described by a binomial distribution with parameters \(N = 10000\) and \(p = \frac{1}{6}\). The same model doesn’t work for the chickens: The predicted distribution is much broader than the observed one. But that’s not the weird part. The real puzzler is why a different binomial model, with parameters \(N’ = 2500\) and \(p’ = \frac{2}{3}\), does seem to match the experimental results.
The dice model’s failure to work for chicken pecking is not really a surprise. A key assumption underlying the binomial distribution is that the events or objects being counted are independent. That’s true for dice; one die doesn’t care what the others do. But the circle of pecking chickens is all about interactions between neighbors. If you have already been pecked, that alters the odds that your neighbors will eventually be pecked. Independence enters the binomial distribution through the coefficient \(N \choose k\). Given \(N\) dice with \(k\) of them showing \(1\)s, all possible interleavings of the \(1\)s among the other dice are equally likely; the binomial coefficient counts those arrangements. But given \(N\) chicks with \(k\) of them unpecked, it’s not true that all arrangements are equally likely. Indeed, many patterns, such as ○○○, are impossible.
If neighbor interactions spoil the binomial model with \(N = 10{,}000\) and \(p = \frac{1}{6}\), how are those interactions overcome in the model with \(N’ = 2500\) and \(p’ = \frac{2}{3}\)? For the longest time I was beguiled by the observation that 2500 is the expected number of survivors after a single round of pecking, and two-thirds of those individuals can be expected to survive all subsequent rounds. Surely, having those two numbers turn up in the binomial distribution cannot be a meangingless coincidence. Maybe not, but I was able to make sense of the situation only when I gave up on that line of inquiry.
What’s needed is a model in which we count the arrangements of 2500 objects, where two-thirds of the objects can be considered successes or survivors. I have found such a model. The objects are not individual chickens. They are groups of four chickens. Consider this set of 4-tuples:
a = ○●●●
b = ●○●●
c = ●●●●
If you select elements from this set at random and string them together, any sequence you create could be an output of the iterated pecking process. A typical result looks like this:
●●●●○●●●●○●●○●●●●●●●○●●●●○●●●●●●●○●●○●●●●○●●●●●●○●●●●○●●●○●●○●●●●●●●○●●
Note that this sequence satisfies all the rules for flock of chickens that has pecked to completion. All unpecked ○s are singletons, surrounded by pecked neighbors. At least two ●s separate every pair of ○s, and this ensures that every element of the sequence has at least one ● neighbor. There is no way of concatenating any selection of a, b, and c elements that violates these rules. Furthermore, if a, b, and c are chosen with equal probability, the expected proportion of ○s in the sequence is \(\frac{1}{6}\).
I am deeply ambivalent about this discovery. On the one hand, it’s always a relief to get to the bottom of a problem that has stumped you. On the other hand, what we have here is a recipe for creating a sequence with the same structure and statistics as the product of the pecking process, but it offers no insight into the nature of that process. There’s no connection with the behavior of the chickens. Worse, it’s not even a true or exact model. Although the curve appears to coincide with the data, it’s only an approximation. The proof of this fact is simple. The binomial distribution with \(N’ = 2500\) and \(p’ = \frac{2}{3}\) has an absolute cutoff at \(2500\). For any number of survivors greater than \(2500\), the model assigns a probability of zero. Yet the flock of \(10{,}000\) pecking chickens can in fact leave up to \(3333\) survivors.
The defect becomes visible in a smaller model, such as this one with \(N = 24\):
The predicted and observed curves exhibit slight mismatches everywhere, but pay particular attention to the right tail of the distribution, where the binomial curve (purple) dives to zero for all survivor numbers greater than six, whereas the experimental data (red) include 6718 instances with seven survivors and 49 instances with eight survivors.
A similar model for the one-round pecking process uses a set of four 3-tuples:
a = ○●●
b = ○●●
c = ●●○
d = ●●●
Again it generates a sequence that looks very much like the outcome of a pecking experiment, but fails to reproduce the tail of the distribution. In the model the highest possible density of survivors is \(\frac{1}{3}\) whereas it should be \(\frac{1}{2}\).
Perhaps you’re thinking that a cute high school problem about chicks pecking their neighbors doesn’t really merit an 8,000-word screed on Markov chains and probability distributions, with tables and equations and 25 graphs and diagrams. That thought has crossed my mind, too. However, I want to add just a few more words to argue that the exercise is not totally frivolous.
Mathematics does not owe us a tidy, closed-form, one-line solution to every problem, but we’d be foolish to give up the quest too easily. In this case, computer simulations are easy and productive. By running a program for five minutes I can get answers to a multitude of detailed questions, and I don’t have serious doubts about the correctness of those answers. But they don’t help me make the connection between the microscopic mechanisms (a chicken pecks left or right at random) and macroscopic observations (the distribution has \(\mu = \frac{1}{6}\) and \(\sigma = 23.56\)). Richard Hamming’s old chestnut says the purpose of computing is insight, not numbers, but insight is just what I’m missing.
Second, this is not really a problem about chickens, whether real or abstract. It is a gateway to a collection of other many-body problems in statistical physics and dynamical systems and cellular automata.
Finally, I’ve had fun, and what’s the harm in that? Maybe the fun’s not over. What about zombie chickens, whose pecks bring other chickens back to life?
Update 2017-07-11: Carl Witty has worked out the correct probability distribution for the single-round case. See his comment below.
]]>With pencil and paper it’s easy to show that \(6!\) doesn’t work. The factorial of \(6\) is \(1 \times 2 \times 3 \times 4 \times 5 \times 6 = 720\); adding \(1\) brings us to \(721\), which is not a square. (It factors as \(7 \times 103\).) On the other hand, \(7!\) is \(5040\), and adding \(1\) yields \(5041\), which is equal to \(71^2\). This makes for a very cute equation:
\[7! + 1 = 71^2.\]
Continuing on, you can establish that \(8! + 1\), \(9! +1\) and \(10! + 1\) are not square numbers. But to extend the search much further, we need mechanized assistance. Here’s a Julia function that does the obvious thing, generating successive factorials and checking each one to see if it is \(1\) less than a perfect square:
function search_fac_sqr(maxn)
fac = big(1) # bigints needed for n > 20
for n in 1:maxn
fac *= n # incremental factorial
r = isqrt(fac + 1) # floor of sqrt
if r * r == fac + 1
println(n, "! + 1 = ", r, "^2 = ", r^2)
end
end
println("That's all folks!")
end
With this tool in hand, let’s check out \(n! + 1\) for all \(n\) between \(1\) and \(100\). Here’s what the program reports:
search_fac_sqr(100)
4! + 1 = 5^2 = 25
5! + 1 = 11^2 = 121
7! + 1 = 71^2 = 5041
That's all folks!
Those are the three cases we’ve already discovered with pencil and paper—and no more are listed. In other words, among all values of \(n! + 1\) up to \(n = 100\), only \(n = 4\), \(n = 5\), and \(n = 7\) yield squares. When I continued the search up to \(n = 1{,}000\), I got exactly the same result: no more squares. Likewise \(n = 10{,}000\) and \(n = 100{,}000\). Allow me to mention that the factorial of \(100{,}000\) is a rather large number, with \(456{,}574\) decimal digits. At this point in the search, I began to grow weary; furthermore, I began to lose hope. When \(99{,}993\) successive values of \(n\) fail to produce a single square, it’s hard to sustain faith that success might be just around the corner. Nevertheless, I persisted. I got as far as \(n = 500{,}000\), which has \(2{,}632{,}341\) decimal digits. Not one more perfect square in the whole lot.
What can we learn from this evidence—or lack of evidence? Are 4, 5, and 7 the only values of \(n!\) that lie \(1\) short of a perfect square? Or are there more such cases somewhere out there along the number line, maybe just beyond my reach, waiting to be found? Could there be infinitely many? If so, where are they? If not, why not?
To my taste, the most satisfying way to resolve these questions would be to find some number-theoretical principle ensuring that \(n! + 1 \ne m^2\) for \(n \gt 7\). I have not discovered any such principle, but in a dreamy sort of way I can imagine what a proof might look like. Suppose we eliminate the “\(+1\)” part of the formula, and search for integers such that \(n! = m^2\). It turns out there is just one solution to this equation, with \(n = m = 1\). You needn’t bother lathering up your laptop in the quest for larger examples; there’s a simple proof they don’t exist. In any square number, all the prime factors must be present an even number of times, as in \(36 = 2 \times 2 \times 3\times 3\). In a factorial, at least one prime factor—the largest one—always appears just once. (If you’re not sure why, check out Bertrand’s postulate/Chebyshev’s theorem.)
Of course when we put the “\(+1\)” back into the formula, this whole line of reasoning falls to pieces. In general, the factorization of \(n!\) and of \(n! + 1\) are totally different. But maybe there’s some other property of \(n! + 1\) that conflicts with squareness. It might have something to do with congruence classes, or quadratic residues. From the definition of a factorial, we know that \(n!\) is divisible by all positive integers less than or equal to \(n\), which means that \(n! + 1\) cannot be divisible by any of those numbers (except \(1\)). This observation rules out certain kinds of squares, namely those that have small primes in their factorization. But for all \(n \gt 4\) the square root of \(n!\) greatly exceeds \(n\), so there’s plenty of room for larger factors, as in the case of \(7! + 1 = 71^2\).
Here’s another avenue that might be worth exploring. The decimal representation of any large factorial ends with a string of \(0\)s, formed as the products of \(5\)s and \(2\)s among the factors of the number. Thus \(n! + 1\) must look like
\[XXXXX \ldots XXXXX00000 \ldots 00001,\]
where \(X\) represents any decimal digit, and the trailing sequence of \(0\)s now ends with a single terminal \(1\). Can we figure out a way to prove that a number of this form is never a square? Well, if the final digit were anything other than \(1, 4,\) or \(9\), the proof would be easy, but lots of squares end in \(\ldots 01\), such as \(10{,}201 = 101^2\) and \(62{,}001 = 249^2\). If there’s some algebraic argument along these lines showing that \(n! + 1\) can’t be a square, it will have to be something subtler.
All of the above is make-believe mathematics. I have stirred up some ingredients that look like they might make a tasty confection, but I have no idea how to bake the cake. Perhaps someone else will supply the recipe. In the meantime, I want to entertain an alternative hypothesis: that nothing prevents \(n! + 1\) from being a square except improbability.
The pattern observed in the \(n! + 1 = m^2\) problem—a few matches among the smallest elements of the sequences, and then nothing more for many thousands of terms—is not unique to factorials and squares. Other pairs of sequences exhibit similar behavior. For example, I have tried matching factorials with triangular numbers. The triangulars, beginning \(1, 3, 6, 10, 15, 21, \ldots\), are defined by the formula \(T(m) = m(m + 1)/2\). If we look for factorials that are also triangular, we get \(1! = T(1) = 1\), then \(3! = T(3) = 6\), and finally \(5! = T(15) = 120\). No more examples appear through \(n = 100{,}000\).
What about factorials that are \(1\) less than a triangular, satisfying the equation \(n! + 1 = T(m)\)? I know of only one case: \(2! + 1 = 3\). Broadening the search a little, I found that \(n! + 4\) is triangular for \(n \in {2, 3, 4}\), again with no more hits up to \(100{,}000\).
For another experiment we can bring back the square numbers and swap out the factorials, replacing them with the ever-popular Fibonacci sequence, \(1, 1, 2, 3, 5, 8, 13, \ldots\), defined by the recurrence \(F(n) = F(n - 1) + F(n - 2)\), with \(F(1) = F(2) = 1\). It’s been known since the 1960s that \(1\) and \(144\) are the only positive integers that are both Fibonacci numbers and perfect squares. Looking for Fibonacci numbers that are \(1\) less than a square, I found that \(F(4) + 1 = 4\) and \(F(6) + 1 = 9\), with no other instances up to \(F(500{,}000)\).
We can do the same sort of thing with the Catalan numbers, \(1, 1, 2, 5, 14, 42, 132 \ldots\), another sequence with a huge fan club. I find no squares other than \(1\) among the Catalan numbers up to \(n = 100{,}000\); I don’t know if anyone has proved that none exist. A search for cases where \(C(n) + 1 = m^2\) also comes up empty, but there are a few low-lying matches for \(C(n) + k = m^2\) for \(k \in {2, 3, 4}\).
Finding similar behavior in all of these diverse sequences changes the complexion of the problem, in my view. If we discover some obscure, special property of \(n! + 1\) that explains why it never lands on a square (for large values of \(n\)), do we then have to invent another mechanism for Fibonacci numbers and still another for Catalan numbers? Isn’t it more plausible that some single, generic cause lies behind all the observations?
But the cause can’t be too generic. It’s not the case that you can take any two numeric sequences and expect to see the same kind of pattern in their intersections. Consider the factorials and the prime numbers. By the very nature of a factorial, none of them except 2! = 2 can possibly be prime, but there’s no obvious reason that \(n! + 1\) can’t be a prime. And, indeed, for \(n \le 100\) nine values of \(n! + 1\) are prime. Extending the search to \(n \le 1000\) turns up another seven. Here is the full set of known numbers for which \(n! + 1\) is prime:
\[1, 2, 3, 11, 27, 37, 41, 73, 77, 116, 154, 320, 340, 399, 427, 872, 1477, \\ 6380, 26951, 110059, 150209\]
They get rare as \(n\) increases, but there’s no hint of a sharp cutoff, as there is in the other cases explored above. Does the sequence continue indefinitely? That seems a reasonable conjecture. (For more on this sequence, including references, see Chris K. Caldwell’s factorial prime page.)
My question is this: Can we understand these curious patterns in terms of mere chance coincidence? The values of \(n! + 1\) form an infinite sequence of integers spread over the number line, dense near the origin but becoming extremely sparse as \(n\) increases. The values of \(m^2\) form another infinite sequence, again with diminishing density, although the dropoff is not as steep. Maybe factorials bump into squares among the smallest integers because there just aren’t enough of those integers to go around, and some of them have to do double duty. But in the vast open spaces out in the farther reaches of the number line, a factorial can wander around for years—maybe forever—and not meet a square.
Let me try to state this idea more precisely. Since \(n!\) cannot be a square, we know that it must lie somewhere between two square numbers; the arrangement on the number line is \((m - 1)^2 \lt n! \lt m^2\). The distance between the end points of this interval is \(m^2 - (m - 1)^2 = 2m - 1\). Now choose a number \(k\) at random from the interval, and ask whether \(n! + k = m^2\). Exactly one value of \(k\) must satisfy this condition, and so the probability of success is \(1/(2m - 1)\), or roughly \(1 / (2 \sqrt{n!})\). Because \(\sqrt{n!}\) increases very rapidly, this probability takes a nosedive toward zero as \(n\) increases. It is represented by the red curve in the graph below. Note that by \(n = 100\) the red curve has already reached \(10^{-80}\).
The green curve gives the probability of a collision between Fibonacci numbers and squares; the shape is similar, though it dives off the precipice a little later. The Fibonacci-square curve approximates a negative exponential: The probability is proportional to \(\phi^{-\sqrt{F(n)}}\), where \(\phi = (\sqrt{5} + 1) / 2 \approx 1.618\). The factorial-square curve is even steeper because the factorial function is superexponential: \(n!\) grows faster than \(c^n\) for any fixed \(c\).
The blue curve, recording the probability of coincidences between factorials and primes, has a very different shape. In the neighborhood of \(n!\) the average distance between consecutive primes is approximately \(\log n!\), which grows just a little faster than \(n\) itself and very much slower than \(n!\). The probability of collision between factorials and primes is roughly \(1 / \log n!\). The continuous blue curve corresponds to this smooth approximation. The blue dots sprinkled near that line give the probability based on actual distances between consecutive primes.
What to make of those curves? Is it legitimate to apply probability theory to these totally deterministic sequences of numbers? I’m not quite sure. Before confronting the question directly, I’d like to retreat a few steps and look at a simpler model where probability is clearly entitled to a seat at the table.
Let us borrow one of Jacob Bernouilli’s famous urns, which have room to hold an infinite number of ping pong balls. Start with one black ball and one white ball in the urn, then reach in and take a ball at random. Clearly, the probability of choosing black is \(1/2\). Put the chosen ball back in the urn, and also add another white ball. Now there are three balls and only one is black, so the probability of drawing black is \(1/3\). Add a fourth ball, and the probability of black falls to \(1/4\). Continuing in this way, the probability of black on the \(n\)th draw must be \(\frac{1}{n + 1}\).
If we go on with this protocol forever—always choosing a ball at random, putting it back, and adding an extra white ball—what is the probability of eventually seeing the black ball at least once? It’s easier to answer the complement of this question, calculating the probability of never seeing the black ball. This is the infinite product \(\frac{1}{2} \times \frac{2}{3} \times\frac{3}{4} \times\frac{4}{5} \ldots\), or:
\[P(\textrm{never black}) = \prod_{n = 1}^{\infty} 1 - \frac{1}{n+1}\]
The product goes to zero as \(n\) goes to infinity. In other words, in an endless series of trials, the probability of never drawing black is \(0\), which means the probability of seeing black at least once must be \(1\). (“Probability \(1\)” is not exactly the same thing as “certain,” but it’s mighty close.)
Now let’s try a different experiment. Again start with one black ball and one white ball, but after the first draw-and-replace cycle add two white balls, then four white balls, and so on, so that the total number of balls in the urn at stage \(n\) is \(2^n\); throughout the process all of the balls but one are white. Now the probability of never seeing the black ball is \(\frac{1}{2} \times \frac{3}{4} \times\frac{7}{8} \times\frac{15}{16} \ldots\), or:
\[P(\textrm{never black}) = \prod_{n = 1}^{\infty} 1 - \frac{1}{2^n}\]
This product does not go to zero, no matter how large \(n\) becomes. Neither does it go to \(1\). The product converges to a constant with the approximate value \(0.288788095\). Strange, isn’t it? Even in an infinite series of draws from the urn, you can’t be sure whether the black ball will turn up or not.
These two urn experiments do not correspond directly to any of the sequence coincidence problems described above; they simply illustrate a range of possible outcomes. But we can rig up an urn process that mimics the probabilistic treatment of the factorials-and-squares problem. At the \(n\)th stage, the urn holds \(1 + 2 \sqrt{n!}\) balls, only one of which is black. The probability of never seeing the black ball, even in an infinite series of trials, is
\[\prod_{n = 1}^{\infty} 1 - \frac{1}{1 + 2 \sqrt{n!}}.\]
This expression converges to a value of approximately \(0.2921426977\). It follows that the probability of seeing black at least once is \(1 - 0.2921426977\), or \(0.7078573023\). (No, that number is not \(1/\sqrt{2}\), although it’s close.)
An urn process resembling the factorials-and-primes problem gives a somewhat different result. Here the number of balls in the urn at stage \(n\) is \(\log n!\), again with just one black ball. The infinite product governing the cumulative probability is
\[\prod_{n = 2}^{\infty} 1 - \frac{1}{\log n!}.\]
On numerical evidence this expression seems to dwindle away to zero as \(n\) goes to infinity (although I’m not \(100\) percent sure of that). If it does go to \(0\), then the complementary probability that the black ball will eventually appear must be \(1\).
Some of these results leave me feeling befuddled, and even a little grumpy. Call me old-fashioned, but I always thought that rolling the dice infinitely many times ought to be enough to settle beyond doubt whether a pattern appears or not. In the harsh light of eternity, I would have said, everything is either forbidden or mandatory; as \(n\) goes to infinity, probability goes to \(0\) or it goes to \(1\). But apparently that’s not so. In the factorial urn model the probability of never seeing a black ball is neither \(0\) nor \(1\) but lies somewhere in the neighborhood of \(0.2921426977\). What does that mean, exactly? How am I supposed to verify the number, or even check its first few digits? Running an infinite series of trials is not enough; you need to collect a statistically significant sample of infinite experiments. For an exact result, try an infinite series of infinite experiments. Sigh.
The urn model corresponds in a natural way to the randomized version of the factorial-square problem, where we look at \(n! + k = m^2\) and choose \(k\) at random from an appropriate range of values. But what about the original problem of \(n! + 1 = m^2\)? In this case there’s no random variable, and hence there’s no point in running multiple trials for each value of \(n\). The system is deterministic. For each \(n\) the factorial of \(n\) has a definite value, and either it is or it isn’t adjacent to a perfect square. There’s no maybe.
Nevertheless, there might be a way to sneak probabilities in through the back door. To do so we have to assume that factorials and squares form a kind of ergodic system, where observing one chain of events for a long period is equivalent to watching many shorter chains. Suppose that factorials and squares are uncorrelated in their positions on the number line—that when a factorial lands between two squares, its distance from the larger square can be treated as a random variable, with every possible distance being equally likely. If this assumption holds, then instead of looking at one value of \(n!\) and trying many random values of \(k\), we can adopt a single value of \(k\) (namely \(k = 1\)) and look at \(n!\) for many values of \(n\).
Is the ergodic assumption defensible? Not entirely. Some distances between \(n!\) and \(m^2\) are known to be more likely than others, and indeed some distances are impossible. However, the empirical evidence suggests that the deviations must be slight. The histogram below shows the distribution of distances between a factorial and the next larger square for the first \(100{,}000\) values of \(n!\). The distances have all been normalized to the range \((0, 1)\) and classified in \(100\) bins. There is no obvious sign of bias. Calculating the mean and standard deviation of the same \(100{,}000\) relative distances yields values within \(1\) percent of those expected for a uniform random distribution. (The expected values are \(\mu = 1/2\) and \(\sigma = 1/12\).)
If this probabilistic approach can be taken seriously, I can make some quantitative statements about the prospects for ever finding a large factorial adjacent to a perfect square. As mentioned above, the overall probability that one or more values of \(n! + 1\) are equal to squares is about \(0.7078573023\). Thus we should not be too surprised that three such cases are already known, namely the examples with \(n = 4, 5,\) and \(7\). Now we can apply the same method to calculate the probability of finding at least one more case with \(n \gt 7\). Let’s make the question more general: “Whether or not I have seen any squares among the first \(C\) values of \(n! + 1\), what are the chances I’ll see any thereafter?” To answer this question, we can just remove the first \(C\) elements from the infinite product:
\[\prod_{n = C+1}^{\infty} 1 - \frac{1}{1 + 2\sqrt{n!}}.\]
For \(C = 7\), the answer is about \(0.0037\). For \(C = 100\), it’s about \(5.7 \times 10^{-80}\). We are sliding down the steep slope of the red curve.
As a practical matter, further searching for another factorial-square couple does not look like a promising way to spend time and CPU cycles. The probability of success soon falls into the realm of ridiculously small numbers like \(10^{-1{,}000{,}000}\). And yet, from the mathematical point of view, the probability never vanishes. Removing a finite number of terms from the front of an infinite product cannot change its convergence properties. If the original product converged to a nonzero value, then so will the truncated version. Thus we have wandered into the canyon of maximal frustration, where there’s no realistic hope of finding the prize, but the probabilities tell us it still might exist.
I am going to close this shambling essay by considering one more example—a cautionary one. Suppose we apply probabilistic reasoning to the search for a cube that is \(1\) less than a square. If we were looking for exact matches between cubes and squares, we’d find plenty of them: They are the sixth powers: \(1, 64, 729, \ldots\). But integer solutions to the equation \(n^3 + 1 = m^2\) are not so abundant. One low-lying example is easy to find: \(2^3 + 1 = 3^2\), but after 8 and 9 where can we expect to see the next consecutive cube and square?
The probabilistic approach suggests there might be reason for optimism. Compared with factorials and Fibonaccis, cubes grow quite slowly; the rate is polynomial rather than exponential or superexponential. As a result, the probability of finding a cube at a given distance from a square falls off much less steeply than it does for \(n!\) or \(F(n)\). In the graph below, \(P(n^3 + k = m^2)\) is the orange curve.
Note that the orange curve lies just below the blue one, which represents the probability that \(n!\) lies near a prime. The proximity of the two curves suggests that the two problems—factorials adjacent to primes, cubes adjacent to squares—might belong to the same class. We already know that factorial primes do seem to go on and on, perhaps endlessly. The analogy leads to a surmise: Maybe cube-square coincidences are also unbounded. If we keep looking, we’ll find lots more besides \(8\) and \(9\).
The surmise is utterly wrong. The problem has a long history. In 1844 Eugène Catalan conjectured that \(8\) and \(9\) are the only consecutive perfect powers among the integers; the conjecture was finally proved in 2004 by Preda Mihăilescu. For the special case of squares and cubes, Euler had already settled the matter in the 18th century. Thus, probabilities are beside the point.
All of the questions considered here belong to the category of Diophantine analysis—the study of equations whose solutions are required to be integers. It is a field notorious for problems that are easy to state but hard to solve. Catalan’s conjecture is one of the most famous examples, along with Fermat’s Last Theorem. When Diophantine problems are ultimately resolved, the proofs tend to be non-elementary, drawing on sophisticated tools from distant realms of mathematics—algebraic geometry in the proof of Fermat’s Last Theorem by Andrew Wiles and Richard Taylor, cyclotomic fields in Mihăilescu’s proof of the Catalan conjecture. As far as I know, probability theory has not played a central role in any such proof.
When I started wrestling with these questions a few weeks ago, I did not expect to discover a definitive solution. I’ve certainly fulfilled my expectations! As a matter of fact, in my own head the situation is more muddled now than it was at the outset. The realization that even an infinite series of experiments would not necessarily resolve some of the questions is deeply unsettling, and makes me wonder how much I really understand about probability theory. But that’s hardly unprecedented in mathematics. I suppose I’ll just have to get used to it.
Update: Thanks to a further tip from Tanton, I have learned that the problem has an extensive history, and also a name: Brocard’s problem, after Henri Brocard, who published on it in 1876 and 1885. Ramanujan mentioned it in 1913. Erdos conjectured there are no more solutions. Marius Overholt connected it with the abc conjecture. Bruce C. Berndt and William F. Galway established that there are no more solutions up \(10^9\). All this comes from the Wikipedia entry on Brocard’s problem. That article also mentions (but does not explain) that the solutions are called Brown numbers.
I have some more reading to do.
]]>Place numbers in the grid so that each outlined region contains the numbers 1 to n, where n is the number of squares in the region. The same number can never touch itself, not even diagonally.
Here is a partially completed example:
The black, pre-printed numbers are the “givens,” supplied by the puzzle creator. I filled in the pencil-written numbers in a sequence of “forced” moves dictated by two simple rules:
At this point in the solution process, with the grid in the state shown above, I was unable to find any other blank squares whose contents could be decided by following these two rules and no others. But I did spot a move based on a different kind of reasoning. Consider the two pairs of open squares marked in color:
The salmon-pink squares must hold the numbers 2 and 5, but it’s not immediately clear which number goes in which square. Likewise the lime-green squares must hold 2 and 4, in one order or the other. I submit that the numbers must have the following arrangement:
How do I justify that choice? Suppose the green 2 and 4 were transposed:
Then the pink 2 and 5 could be placed in either permutation, and no later moves elsewhere in the puzzle would ever resolve the ambiguity. This outcome is not acceptable if we assume the puzzle must have a unique solution. The uniqueness constraint might be expressed as a third rule:
I have vague qualms about this mode of puzzle-solving. It’s surely not cheating, but the third rule has a different character from the others. It exploits an assumed global property of the solution, rather than relying on local interactions. We are not making a choice because it is forced on us; we are choosing a cofiguration that will force a choice elsewhere.
In this particular puzzle it’s not actually necessary to apply the uniqueness constraint. There is at least one other pathway to a solution—which I’ll leave to you to find. Can we devise a puzzle that requires rule 3? I’m not quite sure the question is even well-formed. All constraint-satisfaction problems can be solved by a mindless brute-force algorithm: Just write in some numbers at random until you reach a contradiction, then backtrack. So if we want to force the solver to use a specific tool, we somehow have to outlaw that universal jackhammer.
The uniqueness constraint is not unique to the Capsules puzzle. I’ve encountered it often in kenkens, and occasionally in sudokus. I even have a sense of deja lu as I write this. I feel sure I’ve read a discussion of this very issue, somewhere in recent years, but I haven’t been able to lay hands on it. Pointers to precedents are welcome.
Addendum 2017-03-19: Jim Propp reminds me of his marvelous Self Referential Aptitude Test. The instructions begin:
The solution to the following puzzle is unique; in some cases the
knowledge that the solution is unique may actually give you a short-cut
to finding the answer to a particular question.
I completed the 20-question puzzle when SRAT first went public some years ago. This morning I found I was able to do it again with no diminution in enjoyment—or effort. I remembered none of the answers or the sequence of deductions needed to find them.
Highly recommended. And while you’re at it, check out Propp’s Mathematical Enchantments blog and his Twitter feed: @JimPropp.
]]>The answer to Tanton’s question is surely No: The series will never again land on an integer. I leaped to that conclusion immediately after reading the definition of the series and glancing at the first few terms. But what makes me so sure? Can I prove it?
I wrote a quick program to generate more terms:
1 2 5/2 17/6 37/12 197/60 69/20 503/140 1041/280 9649/2520 9901/2520 111431/27720 113741/27720 1506353/360360 1532093/360360 1556117/360360 3157279/720720 54394463/12252240 18358381/4084080 352893319/77597520
Overall, the trend visible in these results seemed to confirm my initial intuition. When the fractions are expressed in lowest terms, the denominator generally grows larger with each successive term. Looking at the terms more closely, it turns out that the denominators tend to be products of many small primes, whereas the numerators are either primes or products of a few comparatively large primes. For example:
\[\frac{9649}{2520} = \frac{9649}{2^3 \cdot 3^2 \cdot 5 \cdot 7} \qquad \textrm{and} \qquad \frac{18358381}{4084080} = \frac{59 \cdot 379 \cdot 821}{2^4 \cdot 3 \cdot 5 \cdot 7 \cdot 11 \cdot 13 \cdot 17}.\]
To produce an integer, we need to cancel all the primes in the factorization of the denominator by matching primes in the numerator; given the pattern of these numbers, that looks like an unlikely coincidence.
But there is reason for caution. Note the seventh term in the sequence, where the denominator has decreased from \(60\) to \(20\). To understand how that happens, we can run through the calculation of the term, which starts by summing the six previous terms.
\[\frac{60}{60} + \frac{120}{60} + \frac{150}{60} + \frac{170}{60} + \frac{185}{60} + \frac{197}{60} = \frac{882}{60}.\]
Then we calculate the mean, and add 1 to get the seventh term:
\[\require{cancel}\frac{882}{60} \cdot \frac{1}{6} = \frac{882}{360} = \frac{\cancel{2} \cdot \cancel{3} \cdot \cancel{3} \cdot 7 \cdot 7}{\cancel{2} \cdot 2 \cdot 2 \cdot \cancel{3} \cdot \cancel{3} \cdot 5} = \frac{49}{20} + 1 = \frac{69}{20}\]
Cancelations reduce the numerator and denominator of the mean by a factor of 18. It seems possible that somewhere farther out in the sequence there might be a term where all the factors in the denominator cancel, leaving an integer.
Another point to keep in mind: For large \(n\), the value of the Tanton function grows very slowly. Thus if integer values are not absent but merely rare, we might have to compute a huge number of terms to get to the next one. Reaching the neighborhood of 100 would take more than \(10^{40}\) terms.
So what do you think? Can we prove that no further integers appear in Tanton’s sequence? Or, on the contrary, might my instant conviction that no such integers exist turn out to be an alternative fact?
I’ve had my fun with this problem. I know the answer now, but I’m not going to reveal it yet. Others also deserve a chance to be distracted, or anaesthetized. I’ll be back in a few days to follow up—unless commenters explain what’s going on so thoroughly there’s nothing left for me to say.
Update 2017-01-30: Okay, pencils down. Not that anyone needs more time. As usual, my readers are way ahead of me. (See comments below, if you haven’t read them already.)
My own slow and roundabout voyage of discovery went like this. I had written a little piece of code for printing out n terms of the series, directly implementing the definition given in James Tanton’s tweet:
from fractions import Fraction as F
from statistics import mean
def tanton (n):
seq = [F(1)]
for i in range(n):
print(seq[i])
seq.append(mean(seq) + 1)
But this is criminally inefficient. On every pass through the loop we calculate the mean of the entire sequence, then throw that work away and do it all again the next time. Once you have the mean of \(n-1\) terms, isn’t there some way of updating it to incorporate the nth term? Well, yes, of course there is. You just have to appropriately weight the new term, dividing by n, before adding it to the mean. Here’s the improved code:
from fractions import Fraction as F
def faster_tanton (n):
m = F(1)
for i in range(1, n):
print(m)
m += F(1, i)
Tracing the execution of this function, we start out with 1, then add 1, then add 1/2, then 1/3, then 1/4, and so on. This is 1 plus the harmonic series. That series is defined as:
\[H_{n} = \sum_{i=1}^{n} \frac{1}{i} = \frac{1}{1} + \frac{1}{2} + \frac{1}{3} + \cdots + \frac{1}{n}\]
The first 10 partial sums are:
1 3/2 11/6 25/12 137/60 49/20 363/140 761/280 7129/2520 7381/2520
One fact about the harmonic series is very widely known: It diverges. Although \(H_{n}\) grows very slowly, that growth continues without bound as \(n\) goes to infinity. Another fact, not quite as well known but of prime importance here, is that no term of the series after the first is an integer. The simplest proof shows that when you factor the numerator and the denominator, the denominator always has more \(2\)s than the numerator; thus when the fraction is expressed in lowest terms, the numerator is odd and the denominator even. This proof can be found in various places on the internet, such as StackExchange. There’s also a good explanation in Julian Havil’s book Gamma: Exploring Euler’s Constant.
Neither of those sources mentions anything about the origin or author of the proof. When I scouted around for more information, I found more than a dozen sources that attribute the proof to “Taeisinger 1915,” but with no reference to an original publication. For example, a recent paper by Carlo Sanna (Journal of Number Theory, Vol. 166, September 2016, pp. 41–46) mentions Taeisinger and cites Eric Weisstein’s Concise Encyclopedia of Mathematics; consulting the online version of that work, Taeisinger is indeed credited with the theorem, but the only reference is to another secondary source, Paul Hoffman’s biography of Erdős, The Man Who Loved Only Numbers; there, on page 157, Hoffman writes, “In 1915, a man named Taeisinger proved. . .” and gives no reference or further identification. So who was this mysterious and oddly named Taeisinger? I have never heard of him, and neither has MathSciNet or the Zentralblatt or the MacTutor math biography pages. In Number Theory: A Historical Approach John J. Watkins gives a slender further clue: The first initial “L.”
After some further rummaging through bookshelves and online material, I finally stumbled on a reference to a 1915 publication I could actually track down. In the Comptes Rendus Mathematique (Vol. 349, February 2011, pp. 115–117) Rachid Aït Amranea and Hacène Belbachir include this item in their list of references:
L. Taeisinger, Bemerkung über die harmonische Reihe, Monatsch. Math. Phys. 26 (1915) 132–134.
When I got ahold of that paper, here’s what I found:
Not Taeisinger but Theisinger!
I still don’t know much of anything about Theisinger. His first name was Leopold; he came from Stockerau, a small town in Austria that doesn’t seem to have a university; he wrote on geometry as well as number theory.
What I do know is that a lot of authors have been copying each other’s references, going back more than 20 years, without ever bothering to look at the original publication.
]]>Spoiler alert: Everybody dies.
The setting is Melbourne, Australia, the southernmost major city on the planet. The entire population of the Northern Hemisphere was wiped out in the war, and airborne radioactivity is slowly creeping across the Equator. Darwin and Cairns, on Australia’s north coast, are already ghost towns, and the people of Melbourne are told they have less than a year to go.
A U.S. submarine takes refuge in Melbourne’s harbor. Over a period of weeks the captain of that vessel, Dwight, forms an attachment to a young woman named Moira. There’s affection on both sides, and maybe passion, but Dwight is determined to remain faithful to his wife and children back in Connecticut. He buys them presents: a diamond bracelet, a fishing rod, a pogo stick. He speaks of them in the present tense. Is Dwight delusional? Not exactly. He knows perfectly well that his family are all dead, and that he’ll never rejoin them except in the sense that he too will soon be dead. But those deaths are abtractions.
He had seen nothing of the destruction of the war . . . ; in thinking of his wife and of his home it was impossible for him to visualize them in any other circumstances than those in which he had left them. He had little imagination, and that formed a solid core for his contentment in Australia.
It’s not just Dwight who lacks imagination—or chooses to ignore the truths it reveals. Moira studies shorthand and typing for a future job that will never exist. Her father harrows fields for crops that will never grow. Another couple plant hundreds of daffodils whose blooms they will never see, and they invest in a lawn mower for grass they’ll never cut.
The author himself seems to share this selective connection to reality. Everyone in his doomed society is unfailingly polite, and usually cheerful. Civilization may be ending, but not civility. There’s not a single act of violence or malice or even selfishness in the entire story. Shute mentions no hoarding or profiteering, much less rape and pillage. No marauding bandits or desperate refugees from the contaminated north descend on this last haven. On the Beach is the antithesis of that other Australian vision of apocalypse: the Mad Max movies. (The first of the series, in 1979, was filmed near Melbourne.)
It’s also worth noting that no one in Shute’s world takes any steps to prolong life. The government is not hollowing out mountains to keep the human germ line going until the atmosphere clears. Families are not digging fallout shelters in the back yard. These last few representatives of Homo sapiens may indulge in a variety of follies, but hope isn’t one of them.
Why am I writing about this sad book just now? Well, obviously, it’s inauguration day. Which feels more like termination day.
The threat of nuclear disaster has continued to shadow us through all the years since Shute wrote his novel. The danger of a planet-scouring war seemed particularly urgent when I was 13 and reading On the Beach for the first time. I stood up in front of my eighth-grade English class to give an oral report on the book. My performance was not interrupted by a duck-and-cover drill, but it could have been.
Now we have handed control of 4,700 nuclear warheads to a petulant brat, and the danger seems greater than ever.
Revisiting that sense of menace is why I picked up the book, but it’s not what has made the strongest impression on me this second time around. I am both drawn to and appalled by the stoic acceptance of Shute’s fictional Melbournites. Given their circumstances, their reaction is not inappropriate. The worst has already happened, there’s nothing they can do to change it, they may as well make the best of it. In the face of certain extinction, what can you do but shrug your shoulders? Maybe the best way of muddling through is just to plant some daffodils.
Given the current mood of the nation and the world, I suddenly find it easier to understand Dwight’s behavior. The urge to pretend is powerful. I too want to believe that life can go on as normal, that I can continue to enjoy the private pleasures of family and friends, that I can retreat to a cozy office or library and lose myself in the world of ideas, in the “less fretful cosmos” of mathematics and science, or art and literature for that matter.
But we are not yet huddled on the beach, the last of the doomed. It’s late, but not yet too late. This is not the moment for resignation and acquiescence. Tomorrow we march!
]]>Carey’s Equality. Has everyone but me known all about this for ages and ages?
In a stationary population—where births equal deaths—the number of individuals who have lived a years is the same as the number who still have a years left to live. Here’s a more precise statement from James W. Vaupel of the Max-Planck-Institute for Demographic Research:
If an individual is chosen at random from a stationary population with a positive force of mortality at all ages, then the probability the individual is one who has lived a years equals the probability the individual is one who has that number of years left to live. For example, it is as likely the individual is age 80 as it is the individual has 80 years to live—not 80 years of remaining life expectancy but a remaining lifetime of precisely 80 years.
Is this fact obvious, a trivial consequence of symmetry? Or is it deep and mysterious? Apparently it was not clearly recognized until about 10 years ago, by James R. Carey, a biological demographer at UC Davis and UC Berkeley who was studying the age structure of fruitfly populations. The equality was proved in 2009 by Vaupel. A more general statement of the theorem and a more mathematically oriented proof were published in 2014 by Carey and Arni S. R. Srinivasa Rao of Augusta University.
I learned all this from a wide-ranging talk by Rao: “From Fibonacci to Alfred Lotka and beyond: Modeling the dynamics of population and age-structures.”
Go with the Green. Every weekday you walk from your home at the corner of 1st Avenue and 1st Street to your office at 9th Avenue and 9th Street. Since your city is laid out with a perfectly rectilinear grid, you have to go eight blocks east and eight blocks north. Assuming you never waste steps by turning south or west, or by straying outside the bounding rectangle, how many routes can you choose from?
It would be quite a chore to count the paths one by one, but combinatorics comes to the rescue. The answer is \(\binom{16}{8}\), the number of ways of choosing eight items (such as eastbound or northbound blocks) from a set with 16 members:
\[\binom{16}{8} = \frac{16!}{(8!)(8!)} = 12{,}870{.}\]
You could walk to work for 50 years without ever taking the same route twice. Which of those 12,870 paths is the shortest? That’s the beauty of the Manhattan metric: They all are. Every such path is exactly 16 blocks long.
But just because the routes are equally long doesn’t mean they are equally fast. Suppose there’s a traffic light at every intersection. Depending on the state of the signal, you can proceed either north or east without interruption, but you’ll have to wait for the light to change if you want to cross the other way. A sensible strategy, it seems, is always to go with the green if you can. Following this rule, you will never have to wait for a light unless you are on the north or the east boundary edge of the square.
The street grid with traffic lights came up in a talk by Ivan Corwin of Columbia University, titled “A Drunk Walk in a Drunk World.” The more conventional term for this subject is “random walks in random environments.” In an ordinary random walk (with a nonrandom environment), the walker chooses a direction at each step according to a fixed probability distribution—the same at all sites and at all times. With a random environment, the probabilities vary both with position and with time. In a brief aside, Corwin offered the street grid with traffic lights as an example of a random environment. If the lights are uncorrelated on the time scale of a pedestrian’s progress through the grid, the favored direction at any intersection is an independent random variable. Then the following question arises: If the walker always takes the green-light direction when that’s possible, which paths are the most heavily traveled?
Corwin’s answer is that the walker will likely follow a stairstep path, never venturing very far from the diagonal drawn between home and office. Thus even though the distance metric says all routes are equal, the walker winds up approximating the Euclidean shortest path.
Corwin gave no proof of his assertion, although he did show the result of a computer simulation. After ruminating on the problem for a while, I think I understand what’s going on. One way of thinking about it is to break the 16-block walk into two eight-block segments, then consider the single vertex that the two segments have in common. Suppose the common point is the central intersection at 5th Avenue and 5th Street. There are 70 ways of getting from home to this point, and for each of those paths there are another 70 ways to continuing on to the office. Thus 4,900 paths pass through the center of the grid. In contrast, only one path goes through the corner of 9th Avenue and 1st Street. The same kind of analysis can be applied recursively to show that the initial eight-block segment of the walk is more likely to pass through 3rd Avenue and 3rd Street than through 5th Avenue and 1st Street.
Another way to look at it is that it’s all about the binomial theorem and Pascal’s triangle. The binomial coefficient \(\binom{n}{m}\) is largest when \(m = n/2\), making the “middle-way” paths the likeliest.
This argument says that always going with the green will give you the fastest route across town (at least in terms of expectation value), and the route you follow is likely to lie near the diagonal. What the argument doesn’t say is that deliberately biasing your choices so that you stay near the diagonal will get you to work sooner; that’s clearly not true.
When I mentioned Corwin’s example to my friends Dan Silver and Susan Williams, Susan immediately pointed out that the model fails to capture some important features of walking in an urban environment. Streets have two sides, and generally two sidewalks. To get from the southwest corner of an intersection to the northeast corner, you need two green lights. I’m not sure whether the conclusions hold up when these complications are taken into account.
I should add that solving this citified problem was not the main point of Corwin’s talk. Instead, he was addressing the problem of a bartender who wants to build a tavern in rough and ever-changing terrain near the rim of the Grand Canyon. The bartender needs to know how close he can come to the edge without endangering inebriated customers who might wander over the cliff.
TASEP. I’m a sucker for simple models of complex behavior. This week I learned of a new one—new to me, anyway. Jinho Baik of the University of Michigan talked about TASEP, a “totally asymmetric simple exclusion process” (admittedly not the most vividly descriptive name). Here’s what little I understand of the model so far.
The setting is a one-dimensional lattice, which could be either an infinite line or a closed loop of finite size. Some lattice sites are vacant and some are occupied by a particle. (No site can ever host multiple particles.) At random intervals—random with an exponential distribution—a particle “wakes up” and tries to move one space to the right (on a line) or one space clockwise (on a loop). The move succeeds if the adjacent site is vacant; otherwise the particle goes back to sleep until the next time the exponential alarm clock rings. Given some initial distribution of the particles, how does that distribution evolve over time.
When I see a model like this one, my impulse is to write some code and see what it looks like in action. I haven’t yet done that, but this is my current understanding of what I should expect to see. If you start with the smoothest possible particle distribution (alternating occupied and vacant sites), the particles will tend to clump together. If you start with a maximally clumpy state (one area solidly filled, another empty), the particles will tend to spread out. Baik and his colleagues seek a more precise description of how the density fluctuations evolve over time. And they have found one! Unfortunately, I’m not yet prepared to explain it, even in my hand-waviest way. The best I can do is refer you to the most recent paper by Baik and Zhipeng Liu.
Debunking Guy. If you ever have an opportunity to hear Doron Zeilberger speak, don’t pass it up. At this meeting he gave a spirited and inspiring defense of experimental mathematics, under the title “Debunking Richard Guy’s Law of Small Numbers.” Sitting in the front row was 100-year-old Richard Guy. Neither one of them was in any way daunted by this confrontation. In any case, Doron’s talk was more homage than attack. Later, I had a chance to ask Guy what he thought of it. “His heart is in the right place,” he said.
Guy’s Strong Law of Small Numbers says:
There aren’t enough small numbers to meet the many demands made of them.
As a consequence, if you discover that \(f(n)\) yields the same value as \(g(n)\) for several small values of \(n\), it’s not always safe to assume that \(f(n) = g(n)\) for all \(n\). Euler discovered a cautionary example that’s now well known: The equation \(n^2 + n + 41\) evaluates to a prime for all \(n\) from \(-40\) to \(+39\), but not outside that range.
Zeilberger doesn’t deny the risk of mistaking such accidents for mathematical truths. As a matter of fact, he discusses some of the most dramatic examples: the Pisot numbers, some of which produce coincidences that persist for thousands of terms, and yet ultimately break down. But such pathologies are not a sign that “empirical” mathematics is useless, he says; rather, they suggest the need to refine our proof techniques to distinguish true identities from false coincidences. In the case of the Pisot numbers, he offers just such a mechanism.
A paper by Zeilberger, Neil J. A. Sloane, and Shalosh B. Ekhad (Zeilberger’s computer/collaborator) outlines the main ideas of the JMM talk, though sadly it cannot capture the theatrics.
Soundararajan on Tao on Erdős. Take a sequence of +1s and –1s, and add them up. Can you design the sequence so that the absolute value of the sum is never greater than 1? That’s easy: Just write down the alternating sequence, +1, –1, +1, –1, +1, –1, . . . . But what if, after you’ve selected your sequence, an adversary applies a rule that selects some subset of the entries. Can you still count on keeping the absolute value of the sum below a specified bound? This is a version of the Erdős discrepancy problem, which Paul Erdős first formulated in the 1930s.
The question was finally given a definitive answer in 2015 by Terry Tao of UCLA. In the “Current Events” session of the JMM, Kannan Soundararajan of Stanford gave a lucid account of thre proof. You can read it for yourself, along with three other Current Events talks, by downloading the Bulletin.
Proust’s Powdered-Wig Party. Finally, a personal note. In the closing pages of Marcel Proust’s immense novel A la Recherche du Temps Perdu, the narrator attends a party where he runs into many old friends from Parisian high and not-so-high society. He is annoyed that no one told him the party was a costume ball: All of the guests are wearing white powdered wigs, as if they were gathering at the court of Louis XIV. Then the narrator catches sight of himself in a mirror and realizes that he too is coiffed in white.
At these annual math gatherings I run into people I have known for 30 years or more. For some time I’ve been aware that the members of this cohort, including me, are no longer in the first blush of youth. This year, however, the powdered wigs have seemed particularly conspicuous. Everyone I talk to, it seems, is planning for imminent retirement.
But of course this geriatric impression owes more to selection effects than to the aging of the mathematical population overall. Indeed, the corridors here are full of youngsters attending their first or third or fifth JMM. Which brings us back to Carey’s Equality. If we can safely assume that the population of meeting attendees is stationary, then the proportion of people who have been coming to these affairs for 30 years should be equal to the proportion who will attend 30 more meetings.
]]>What is truth? said jesting Pilate, and did not stay for an answer.
Lately there’s been a lot of news about fake news (some of it, for all I know, fake). Critics are urging Facebook, Google, and Twitter to filter out the fraudulent nonsense. This seems like a fine idea, but it presupposes that the employees—or algorithms—doing the filtering can reliably distinguish fact from fiction. Even if they can tell the difference, can we count on the companies to stand up to the prevaricators? Sure, Facebook can block traffic from a clickbait website run by a teenager in Macedonia. But what if the lies were to come from an account registered in the .gov domain?
When misinformation is stamped with the imprimatur of the president or other high government officials, there’s not much hope of shutting it down at the source or breaking the chain of transmission. This problem was not created by the new communication technologies of the internet age, and it is not unique to the incoming Trump administration. I have probably been lied to by every president who has served during my lifetime, and I could name seven of those presidents whose fibs are well documented. But Trump is different. He is not a devious liar, careful not to be caught in a contradiction. He is simply indifferent to truth. When challenged to support a dubious claim, he shrugs or changes the subject. The question of veracity seems not to interest him. And his election suggests that some part of the voting public feels the same way.
What to do? The only practical remedy I can suggest is to work diligently to uncover the truth, to publish it widely, and to help the public reach sound judgments about what to believe. All three of these tasks are difficult, but the last one, in my view, is the real stumper. As the signal-to-noise ratio in public discourse dives toward zero, we would all do well to sharpen our powers of discrimination. But I worry most about that subpopulation for whom strict factual accuracy is not the primary criterion when they choose stories to pass on to their friends and to embrace as the basis of important decisions. I don’t know how to change this, but I feel it’s important to try.
I’d like to begin with a more personal and less political anecdote. Some years ago, when the internet was young, a friend began sending me emails with subject lines like “Save 7 y.o. Jessica Mydek from cancer” or “Fw: Fw: FW: Fw: Bill No. 602P 5-cent tax on every email.” I would reply with a link to the debunking report at Snopes. My friend would thank me and sheepishly apologize, then the next month she would forward a message warning me not to blink my high beams if I saw a car with the headlights off—I’d be attacked by gang members conducting a rite of initiation. Email exchanges like these continued for a year or so, then they tapered off. Had my friend developed a measure of skepticism? Yes, but not in the way I had hoped. She had become skeptical of snopes.com. After all, it’s a website with a funny name, run by smug, self-appointed know-it-alls who make fun of gullible people. Why should she trust them?
Instead of an ad hoc watchdog like Snopes, maybe we should have an official arbiter of factuality, a certified and sanctified public agency. Call it the Ministry of Truth. And let’s give it enforcement powers: No social network or news outlet is allowed to publish anything unless the ministry attests to its accuracy.
Okay, that’s not such a hot idea after all.
In any case, no amount of scrupulous fact-checking would have cured my friend’s addiction to hoax email. There was something in those messages she wanted to believe. Even if 7 y.o. Jessica Mydek doesn’t exist, a world where chain letters can cure cancer is more appealing and empowering than the Snopesian world of grim facts, where you can only watch helplessly while a child dies. When you see a car driving without headlights, it’s more exciting to imagine a murderer at the wheel than a forgetful old fool. I’m sure my friend had her own doubts about some of these breathless pleas and warnings, but she was willing to overlook dodgy evidence or flawed logic for the sake of a good story.
As far as I know, my friend’s lax attitude toward factuality never caused grievous harm to herself or anyone else. But sometimes credulity can be disastrous.
Those who can make you believe absurdities can make you commit atrocities.
—Voltaire (paraphrase)
Let’s talk about Edgar Maddison Welch, the young man who showed up at the Comet Ping Pong pizzeria with a rifle and a handgun. By his own account, he sincerely believed he was going to rescue children being held captive in a basement room and subjected to unspeakable acts by Hillary Clinton and her associates. Where did that idea come from? Apparently it began with leaked emails from the hacked account of John Podesta, Clinton’s campaign chairman. According to an outline in the New York Times, eager sleuths on Reddit and 4chan discovered the phrase “cheese pizza” in the email texts, and recognized it as a code word for “child pornography.” Connecting the rest of the dots was easy and obvious: Podesta had corresponded with the owner of Comet Ping Pong, and Barack Obama had been photographed playing ping pong with a small boy, and so the basement of the restaurant must be where the Democrats slaughter their child sex slaves. However, the would-be rescuer with the AR-15 found no basement kill room—in fact, no basement at all. “The intel on this wasn’t 100 percent,” he told a Times reporter.
In case there’s even the slightest doubt, let me say plainly that I don’t believe a word of that grotesque tale about child abuse in the pizza parlor. Indeed, I can make sense of it only as a stupid joke, a parody, a deliberately preposterous confection. If I were fabricating such a malicious fiction, and if I wanted people to believe it, I would come up with something that’s not such a total affront to plausibility. Yet at least one reader of these fantasies took them in deadly earnest. We’ll never know how many more believe there might be a “grain of truth” in the story, even if specific details are wrong. And the purveyors of the myth are not backing down. In an AP story that ran in the Times on December 9 they propose that the Comet Ping Ping event was a “false flag,” yet another twist in the larger plot:
James Fetzer, a longtime conspiracy theorist who also believes the Sandy Hook school shooting was a hoax, told The Associated Press that Welch’s visit to the pizzeria was staged to distract the public from the truth of the “pizzagate” allegations. . . .
Fetzer and other conspiracy theorists seized on the fact that Welch had dabbled in movie acting as a giveaway that his visit to the restaurant was staged. . . . Blogger Joachim Hagopian, a false-flag proponent, told the AP that conspirators look for “a patsy or stooge” to pose as a lone gunman with an assault rifle. Welch, he said, “fits the pattern” with his acting background.
“He’s got an IMDB (Internet Movie Database) profile,” Hagopian said.
It’s easy to heap ridicule on these ideas. Indeed, by quoting them at length that’s exactly what I’m doing. How could anyone possibly believe in such contrived and convoluted schemes, such teetering towers of improbabilities? But it’s useful to keep in mind that the incredulity goes both ways. The conspiracy theorists would snigger at my naiveté for believing what I read in New York Times. Anyone who’s paying attention knows that all the big papers and TV networks are parties to the conspiracy. (Snopes is surely in on it too.)
Mathematics alone proves, and its proofs are held to be of universal and absolute validity, independent of position, temperature or pressure. You may be a Communist or a Whig or a lapsed Muggletonian, but if you are also a mathematician, you will recognize a correct proof when you see one.
—Philip J. Davis, American Mathematical Monthly, 79(3):254 (March 1972)
A high-stakes presidential election and accusations of child rape and murder certainly add force and immediacy to a discourse on the nature of truth, but they also distract. I would like to retreat from these incendiary themes, at least for a few paragraphs, and look at the calmer universe of mathematics, where we have well-developed mechanisms for distinguishing between truth and falsehood.
Take the case of angle trisectors—people who claim they can divide an arbitrary angle into equal thirds with the standard Euclidean toolkit of straightedge and compass. In some respects, trisectors are like peddlers of pizza parlor pedophilia, but when a trisector comes before you, you can give a stronger response than: “What you claim is contrary to common sense.” You can offer an absolute refutation: “What you claim is impossible. Pierre Laurent Wantzel proved it 180 years ago.” But I wouldn’t count on the trisector meekly accepting this answer and going away.
A few years ago, writing in American Scientist, I made an earnest effort to explain the Wantzel proof in some detail and in plain words, and I provided an English translation of Wantzel’s own paper from 1837. Soon after the article appeared, I began receiving letters festooned with elaborate geometric diagrams, some of them quite pretty, which the authors presented as proper straightedge-and-compass trisections. I wasn’t surprised at this development, but I was at a loss for how to respond. If a mathematical proof fails to persuade the reader of the truth of a mathematical proposition, what other kind of argument could possibly be more effective?
In the past few weeks I’ve given this incident further thought, and I’ve come to see it in a different light. The task of “persuading the reader,” even in mathematics, is not just about truth; it’s also about trust, or rapport, or social solidarity. The quip by Philip Davis that I reproduce above has long been a favorite of mine, but at this point I am tempted to turn it inside out. What I would say is not “If you’re a mathematician, you’ll recognize a proof” but “If you recognize a proof, you’re a mathematician.” The ability and willingness to engage in a certain style of reasoning, and to accept the consequences of that mental process no matter what the outcome, marks you as a member of the mathematical tribe. And, conversely, if you respond to a proof by saying “It may be impossible but I can do it anyway,” then you are not a member of this particular affinity group.
I am not arguing here that mathematical truth is some kind of socially determined quantity, and no more fundamental than religious or political doctrines. Quite the contrary, I am one of those stubborn prepostmodernists who believes in a reality that’s not just my private daydream. I’m convinced we all share one universe, where certain things are true and others aren’t, where certain events happened and others didn’t. The interior angles of a plane triangle will always sum to 180 degrees no matter what I say. Nevertheless, the process by which we recognize such truths and reach consensus about them is a social one, and it’s not infallible.
The same essay in which I discussed Wantzel’s proof also mentioned the infamous Monty Hall problem.
In 1990 Marilyn vos Savant, a columnist for Parade magazine, discussed a hypothetical situation on the television game show “Let’s Make a Deal,” hosted by Monty Hall. A prize is hidden behind one of three doors. When a contestant chooses door 1, Hall opens door 3, showing that the prize is not there, and offers the player the option of switching to door 2. Vos Savant argued . . . that switching improves the odds from 1/3 to 2/3.
In the following weeks thousands of letter writers berated vos Savant for her blatant error, insisting that the two remaining closed doors were equally likely to conceal the prize. Quite a few of those critics identified themselves as mathematicians or mathematics teachers. Even Paul Erdős took this side of the controversy (although he didn’t write a letter to Parade). But of course vos Savant was right all along.
This story was already well known when I told it in my American Scientist essay, but I have a reason for retelling it yet again now. Along with the mail from angle trisectors I also received irate messages from Monty Hall deniers, who insisted that the probabilities really are 1/2 and 1/2. But this time it wasn’t professional mathematicians who raised objections; they had long since resolved their differences and settled on the correct answer. Now it was outsiders, dissidents, who attacked what they perceived to be an ignorant, entrenched orthodoxy enforced by the professoriat. In other words, the same two factions continued to fight over the same question, but they had switched positions.
The point I’m making here is the unsurprising one that social factors influence judgment. We are all predisposed to go along with the views of those we know and trust, and we are skeptical, at least initially, of ideas that come from outsiders. We listen more attentively and sympathetically when the speaker is a trusted colleague. The scrawled manuscript from an unknown author claiming a simple proof of the Riemann hypothesis gets a cursory reading or none at all. There’s nothing wrong with making such distinctions. The alternative—equal treatment for the competent and the crackpot—would certainly not help advance the cause of truth. But it has to be acknowledged that these practices further alienate outsiders. By pushing them away and closing off the channel of communication—treating them as irredeemables and deplorables—we diminish the chance that they will ever find a path into the community.
Why do the nations so furiously rage together, and why do the people imagine a vain thing?
—Psalms, 2:1, via George Frideric Handel
Do these skirmishes over minor mathematical questions have anything to do with “Fakebook” news that might have turned the tide of a presidential election? I submit there is a connection. In both cases the nub of the problem is not discovering the truth but persuading people to recognize and own it. The mathematical examples show that even the most irrefutable kind of evidence—a deductive proof—is not always enough to win over skeptics or opponents.
Proof is said to “compel belief”: You embrace the result even against your will. Once you grant the premises, and you work through the chain of implications, accepting the validity of each step in turn, you have no choice but to accept the ultimate conclusion. Or so one might think. But this view of proof as an irresistible engine of reason underestimates the flexibility and creativity of the human mind. In fact we are all capable of believing impossible things before breakfast, and denying certainties after dinner, if we choose to. Mathematicians—members of the tribe—promise not to do so, but that pledge is not binding on anyone else.
When I look back over my various encounters with angle trisectors and other mathematical mavericks, I can’t recall a single instance where I successfuly persuaded someone to give up an erroneous belief and accept the truth. Not one soul saved. This record of failure does not give me great confidence when I think of venturing forth to combat fake political news, where we don’t even have the secret weapon of deductive proof.
I’m left with the thought that compelling people to acknowledge a truth may be the wrong approach, the wrong attitude. Voltaire was a great hero of free-thinking, but his motto “Écrassez l’infâme!” is a bit too militaristic for my taste. However you choose to translate that phrase, he meant it as a call to arms. Let us crush superstition, wipe out error and ignorance, put an end to fanaticism and irrationality. I’m for all that, but I don’t want to be bludgeoning people into accepting the truth. It doesn’t really change their minds, and at some point they bludgeon you back.
Rather than force the people to give up their false notions and vain things, I would let the truth seduce them. Let them fall in love with it. Doesn’t that sound grand? If only I had the slightest clue about how to make it happen.
I like mathematics largely because it is not human and has nothing particular to do with this planet or with the whole accidental universe—because, like Spinoza’s God, it won’t love us in return.
At this point my only consolation is a cold and severe one. Trump may be indifferent to truth, but the universe, in the long run, is utterly indifferent to him and his foibles. Our new president can declare that climate change is a hoax, and purge government agencies of all those who disagree, but those acts will not lower the concentration of carbon dioxide in the atmosphere.
Mathematical truths are even more aloof from human interference. In Orwell’s 1984 the Thought Police boast of making citizens believe that two plus two equals five. But all the sophistry of the Ministry of Truth and all the torture chambers of the Ministry of Love cannot alter the equation itself. They cannot make two and two equal five.
These are very small islands of certainty in a vast maelstrom of confusion, but they offer refuge, and maybe a place to build from.
]]>I’m glad to be here all the same. The ABC conjecture is one of those beguiling artifacts in number theory that seem utterly simple one moment and utterly baffling the next. As for Shinichi Mochizuki’s 500-page treatise on the conjecture, that’s baffling from start to finish, and not just for me. Four years after the manuscript was released, it remains a proof-on-probation because even the non-non-experts have yet to fully digest it. The workshop here in Burlington is part of the community’s digestive effort. I’m grateful for an opportunity to see the process at work, and naturally I’m curious about the eventual outcome. Will we finally have a proof, or will a gap be discovered? Or, as has happened in other sad cases, will the question remain unresolved? For me the best result will be not just a proof certified by a committee of experts but a proof I can understand, at least in outline, if I make the effort—a proof I might be able to explain to a broader audience.
The drive up to Burlington gave me a few hours of solitude to think about the conjecture and how it fits in with more familiar ideas in number theory. One notable connection is summed up as “ABC implies FLT.” Proving the ABC conjecture will bring us a new proof of Fermat’s Last Theorem, independent of the celebrated Andrew Wiles–Richard Taylor proof published 20 years ago. Interesting. But how are the two problems linked? As I cruised through the chlorophyll-soaked hills of the Green Mountain state, I noodled away at this question.
[Correction: ABC actually implies only that FLT has no more than a finite number of counterexamples, and only for exponent \(n \ge 4.\)]
I have written about the ABC conjecture twice before (2007 and 2012). Here’s a third attempt to explain what it’s all about.
The basic ingredients are three distinct positive integers, \(a\), \(b\), and \(c\), that satisfy the equation \(a + b = c\). Given this statement alone, the problem is so simple it’s silly. Even I can solve that equation. Pick any \(a\) and \(b\) you please, and I’ll give you the value of \(c\).
To make the exercise worth bothering with, we need to put some constraints on the values of \(a\), \(b\), and \(c\). One such constraint is that the three integers should have no factors in common. In other words they are relatively prime, or in still other words their greatest common divisor is 1. Excluding common factors doesn’t really make it any harder to find solutions to the equation, but it eliminates redundant solutions. Suppose that \(a\), \(b\), and \(c\) are all multiples of 7; then dividing out this common factor yields a set of smaller integers that also satisfy the equation: \(a\,/\,7 + b\,/\,7 = c\,/\,7\).
The ABC conjecture imposes a further constraint, and this is where the arithmetic finally gets interesting. We are to restrict our attention to triples of distinct positive integers that pass a certain test. First we find the prime factors of \(a\), \(b\), and \(c\), and cast out any duplicates. For example, given the triple \(a = 4, b = 45, c = 49\), the prime factors are \(2, 2, 3, 3, 5, 7, 7\); eliminating duplicates leaves us with the set \(\{2, 3, 5, 7\}\). Now we multiply all the distinct primes in the set, and call the product the radical, \(R\), of \(abc\). Here’s the punchline: The solution is admissible—it is an “\(abc\)-hit”—only if \(R \lt c\). For the example of \(a = 4, b = 45, c = 49\), this condition is not met: \(2 \times 3 \times 5 \times 7 = 210\), which of course is not less than 49.
The ABC conjecture holds that \(abc\)-hits are rare, in some special sense. Hits do exist; try working out the radical of \(a = 5, b = 27, c = 32\) to see an example. In fact, there are infinitely many \(abc\)-hits, with constructive algorithms for generating endless sequences of them. Yet, it’s part of the maddening charm of modern mathematics that objects can be both infinitely abundant and vanishingly rare at the same time. The particular kind of rareness at issue here says that \(R\) can be less than \(c\), but seldom by very much. As a measure of how much, define the power \(P\) of an \(abc\)-hit as \(\log(c) \,/\, \log(R)\). Then one version of the ABC conjecture states that there are only finitely many \(abc\)-hits with \(P \gt (1 + \epsilon)\) for any \(\epsilon \gt 0\).
On first acquaintance, all this rigmarole about radicals seems arbitrary and baroque. Who came up with that, and why?
As I tooled along the Interstate, I tried to answer this question to my own satisfaction. I made a little progress by thinking about what kinds of numbers we might expect to produce \(abc\)-hits. Are they big or small, nearly equal or of very different magnitudes? Are they primes or composites? Do we find squares or other perfect powers among them?
Primes are always a good place to start. Can we have an \(abc\)-hit with \(a\), \(b\), and \(c\) all prime? One complication is that either \(a\) or \(b\) will have to be equal to 2, because 2 is the only even prime, and we can’t have \(a + b = c\) with all three numbers odd. But that’s all right; there are still lots of triples (and conjecturally infinitely many) of the form \(2 + b = c\) with \(b\) and \(c\) prime; they are called twin primes. However, none of them are \(abc\)-hits. It’s easy to see why: If \(a\), \(b\), and \(c\) are all prime, then their radical is simply the product \(abc\), which for numbers larger than 1 is always going to be greater than \(a + b\).
We can extend this reasoning from the primes to all squarefree numbers, that is, numbers that have no repeated prime factors. (They are called squarefree because they are not divisible by any square.) For example, \(10, 21,\) and \(31\) form a squarefree \(abc\) triple, with prime factorizations \(2 \times 5\), \(3 \times 7\), and \(31\). But they do not produce an \(abc\)-hit, because their radical \(2 \times 3 \times 5 \times 7 \times 31 = 6510\) is clearly larger than \(a + b = c = 31\). And the same argument that rules out all-prime \(abc\)-hits applies here to exclude all-squarefree hits.
These results suggest that we look in the opposite direction, at squarefull numbers—and even cubefull numbers, and so on. We want lots of repeated factors. This strategy immediately pays off in the search for \(abc\)-hits. It maximizes the sum \(a + b = c\) without overly enlarging the radical—the product of the distinct primes. The very first of all \(abc\)-hits (in any reasonable ordering) offers an example. It is \(1 + 8 = 9\), or in factored form \(1 + 2^3 = 3^2\). This is a high-power hit, with \(\log(9) \,/\, \log(6) = 1.23\). The triple with highest known power is \(2 + (3^{10} \times 109) = 23^5\), yielding \(\log(6436342) \,/\, \log(15042) = 1.63.\)
Let’s look more closely at that first \(abc\)-hit, \(1 + 8 = 9\). Note that 8 and 9 are not just squarefull numbers; they are perfect powers. This triple is the subject of another famous conjecture. Eugène Catalan asked if \(8\) and \(9\) are the only consecutive perfect powers. Preda Mihailescu answered affirmatively in 2002. Thus we know that the equation \(1 + x^m = y^n\) has only this single solution. However, if we relax the rules just a little bit, we can find solutions to \(a + x^m = y^n\) where \(a\) has some value greater than 1. For example, there’s 3 + 125 = 128 (or 3 + 5^3 = 2^7), which is another high-power \(abc\)-hit.
Suppose we tighten the rules instead of relaxing them and ask for solutions to \(x^n + y^n = z^n\), where the three members of the triple are all nth powers of integers. If we could find solutions of this equation with large values of n, they would surely be a rich ore for high-power \(abc\)-hits. But alas, that’s the equation that Fermat’s Last Theorem tells us has no solutions in integers for \(n \gt 2\). The ABC conjecture turns this implication on its head. It says (if it can be proved) that the rarity of \(abc\)-hits implies there are no solutions to the Fermat equation.
That’s as far as I was able to get while musing behind the wheel—a vague intuition about the balance between addition and multiplication, a tradeoff between increasing the sum and reducing the radical, a hint of a connection between ABC and FLT. Not much, but a better sense of why it’s worth focusing some attention on this particular relation among numbers.
Now morning breaks over Burlington. Time to go learn something from those who are less non-expert.
]]>Last year, a proposal to raise the limit to 10,000 characters was shouted down in a storm of very terse but intense tweets.
The 140-character limit is enforced by the Twitter software. When you compose a tweet, a counter starts at 140 and is decremented with each character you type; if the number goes negative, the Tweet button is disabled (as in the screen capture above). Based on this observation, I had long believed that every tweet was indeed a little snippet of pure text composed of no more than 140 characters. Was I naïve, or what?
My belated enlightenment began earlier this week, when I began having trouble with links embedded in tweets. Clicking on a link opened a new browser tab, but the requested page failed to load. The process got stuck waiting to connect to a URL such as https://t.co/E0R99xtQng. The “t.co” domain gave me a clue to the source of the problem. A long URL (http://bit-player.org/2016/bertrand-russell-donald-trump-and-archimedes, for example) can use up your 140-character quota in a hurry, and so twitterers long ago turned to URL-shortening services such as bit.ly and TinyURL, which allow you to substitute an abbreviated URL for the original web address. The shortening services work by redirection. When your browser issues the request “GET http://bit.ly/xyz123″, what comes back is not the web page you’re seeking but a message such as “REDIRECT http://ultimate.destination.page.com”. The browser then automatically issues a second GET request to the provided destination address.
In 2011 Twitter introduced its own shortening service, t.co. Use of this service is automatic and inescapable. That is, any link included in a tweet will be converted into a 23-character t.co URL, whether you want it to be or not, and even if it’s already shorter than 23 characters. The displayed link may appear to refer to the original URL, but when you click on it, the browser will go first to a t.co address and only afterwards to the true target. Embedded images also have t.co URLs.
A drawback of all redirection services is that they become a bottleneck and a potential point of failure for the sites that depend on them. If t.co goes down, every link posted on Twitter becomes unreachable, and every image disappears. Is that what happened earlier this week when I was having trouble following Twitter links? Probably not; a disruption of that scale would have been widely noted. Indeed, I soon discovered that the problem was quite localized: It plagued all browsers on my computer, but other machines in the household were unaffected.
When I did a web search for “t.co broken links,” I quickly discovered a long discussion of the issue in the Twitter developer forum, with 87 messages going back to 2012. Grouchy complaints are interspersed with a welter of conflicting diagnoses and inconsistent remedies. Much attention focused on Apple hardware and software (which I use). A number of contributors argued that the problem is not in the browser but somewhere upstream—in the operating system, the router, the cable interface, or even the internet service provider.
After a day or two, my problem with Twitter links went away, and I never learned the exact cause. I hate it when that happens, although I hate it more when the problem doesn’t go away. However, that’s not why I’m writing this. What I want to talk about is something I stumbled upon in the course of my troubleshooting. I found a plugin for the Google Chrome browser, Goodbye t.co, that promised to bypass t.co and thereby fix the problem. How could it do that? If t.co is not responding, or if the response is not getting through to the browser, how can code running in the browser make any difference? It seems like tinkering with your television set when the broadcaster is off the air.
The source code for Goodbye t.co is on GitHub, so I took a look. The program is just a couple dozen lines of JavaScript. What I saw there sent me running back to my Twitter feed, to examine the web page using the browser’s developer tools.
Here’s a tweet I posted a few days ago, as it is displayed by the Twitter web site. Note the link to an arXiv paper:
And here’s the HTML that encodes that tweet in the web page:
The text of the tweet (“A problem in coding theory that comes from a Samuel Beckett play: ”) amounts to 66 characters, plus 25 more for the link (“arxiv.org/abs/1608.06001 “). But that’s not all that Twitter is sending out to my followers. Far from it. The block of HTML shown above is 751 characters, and the complete markup for this one tweet comes to just under 7,000 characters, or 50 times the nominal limit.
Take a closer look at the anchor tag in that HTML block:
The href
attribute of the anchor tag is a t.co URL; that’s where the browser will go when you click the link. But, reading on, we come to a data-expanded-url
that gives the final destination link in full. And then that same final destination URL appears again in the title
attribute. This explains immediately how Goodbye t.co can “bypass” the t.co service. It simply retrieves the data-expanded-url
and sends the browser there, without making the detour through t.co.
I have two questions. First, if you’re going to use a shortened, redirected URL, why also include the full-length URL in the page markup? The apparent answer is: So that the web browser can show the user the true destination. This is clearly the point of the title
attribute. When you hover on a link, the content of the title
attribute is displayed in a “tooltip.” I’m not so sure about the purpose of the data-expanded-url
attribute. It’s surely not there to help the author of Goodbye t.co. Twitter presumably has some JavaScript of its own that accesses that field.
The second question is the inverse of the first: If you’re going to include the full-length URL, why bother with the shortening-and-redirecting rigmarole? Twitter could shut down the t.co servers and doubtless save a pile of money. Those servers have to deal with all the links and images in some 200 billion tweets per year. The use of redirection doubles the number of requests and responses—that’s a lot of internet bandwidth—and introduces delays of a few hundred milliseconds (even when the service works correctly). Note that Twitter could still display a shortened URL within the text of the tweet, without requiring redirection.
Twitter’s own developer documents offer an answer to the second question:
Tens of millions of links are tweeted on Twitter each day. Wrapping these shared links helps Twitter protect users from malicious content while offering useful insights on engagement.
The promise to “protect users from malicious content” presumably means that if I link to a sufficiently sleazy site, Twitter will refuse to redirect readers there, or perhaps will just warn them of the danger. (I don’t know which because I’ve never encountered this behavior.) As for “offering useful insights on engagement,” I believe that phrase could be translated as “helping us target advertising and collect data with potential market value.” In other words, t.co is not just a cost center but also a revenue center. Every time you click on a link within a tweet, Twitter knows exactly where you’re going and can add that information to your profile.
A few months ago, Twitter announced a slight change to the 140-character rule. @handles included in the text will no longer count toward the character total, and neither will images or other media attachments. Some press reports suggested that links would also be excluded from the count, but the official announcement made no mention of links. And t.co redirection is clearly here to stay.
I can suggest two takeaway messages from this little episode in my life as an internaut.
If you want to limit the “insights on engagement” that Twitter accumulates about your activities, you might consider installing a plugin to bypass t.co redirection. There’s an ongoing argument about the wisdom and morality of such actions, focused in particular on ad-blocking software. I have my own views on this issue, but I’m not going to air them here and now.
The other small lesson I’ve learned is that using alternative URL-shortening services with Twitter is worse than pointless. Pre-shrinking the URL has no effect on the character count. It also obscures the true destination from the reader (since the title
attribute is “bit.ly/whatever”). Most important, it interposes two layers of redirection, with two delays, two potential points of failure, and two opportunities to collect saleable data. Yet I still see lots of bit.ly and goo.gl links in tweets. Am I missing or misunderstanding something?
A habit of finding pleasure in thought rather than in action is a safeguard against unwisdom and excessive love of power, a means of preserving serenity in misfortune and peace of mind among worries. A life confined to what is personal is likely, sooner or later, to become unbearably painful; it is only by windows into a larger and less fretful cosmos that the more tragic parts of life become endurable.
—Bertrand Russell, “Useless” Knowledge, 1935
For Russell, mathematics was one of those windows opening on a calmer universe. So it is for me too, and for many others. When you are absorbed in solving a problem, understanding a theorem, or writing a computer program, the world’s noisy bickering is magically muted. For a little while, at least, you can hold back life’s conflicts, heartaches, and disappointments.
But Russell was no self-absorbed savant, standing aloof from the issues of his time. On the contrary, he was deeply engaged in public discourse. During World War I he took a pacifist position (and went to prison for it), and he continued to speak his peace into the Vietnam era.
Bit-player.org is meant to be a little corner of Russell’s less fretful cosmos, both for me and, I hope, for my readers. In this space I would prefer to shut out the clamor of the hustings and the marketplace. And yet there comes a time to look up from bit-playing and listen to what’s going on outside the window.
A candidate for the U.S. presidency is goading his followers to murder his opponent. Here are his words (in the New York Times transcription):
Hillary wants to abolish—essentially abolish—the Second Amendment. By the way, and if she gets to pick—if she gets to pick her judges, nothing you can do folks. Although the Second Amendment people—maybe there is, I don’t know.
A day later, Donald Trump said he was merely suggesting that gun owners might be roused to come out and vote, not that they might assassinate a president. Yeah, sure. And when Henry II of England mused, “Who will rid me of this troublesome priest?” he was just asking an idle question. But soon enough Thomas Becket was hacked to death on the floor of Canterbury Cathedral.
This is not the first time Trump has strayed beyond tasteless buffoonery into reckless incitement. But this instance is so egregious I just cannot keep quiet. His words are vile and dangerous. I have to speak out against them. We face a threat to the survival of democracy and civil society.
Mathematics, after all, is one of those luxuries we can afford only so long as the thugs do not come crashing through the door. Another great mathematician offers a lesson here, in a tale told by Plutarch. During the sack of Syracuse, according to one version of the legend, Archimedes was puzzling out a mathemetical problem. He was staring at a diagram sketched in the sand when Roman soldiers came upon him. Deep in thought, he refused to turn away from his work until he had finished the proof of his theorem. One of the soldiers drew a sword and ran him through.
Why do the numbers run backwards? Could there be a connection with shotguns, whose sizes also seem to go the wrong way? A 20-gauge shotgun has a smaller bore than a 12-gauge, which in turn is smaller than a 10-gauge gun. Mere coincidence?
Answers are not hard to find. The Wikipedia article on American Wire Gauge (AWG) is a good place to start. And there’s a surprising bit of mathematical fun along the way. It turns out that American wire sizes make essential use of the 39th root of 92, a somewhat frillier number than I would have expected to find in this workaday, blue-collar context.
Wire is made by pulling a metal rod through a die—a block of hard material with a hole in it. In cross section, the hole is shaped something like a rocket nozzle, with conical walls that taper down to a narrow throat. As the rod passes through the die, the metal deforms plastically, reducing the diameter while increasing the length. But there’s a limit to this squeezing and stretching; you can’t transform a short, fat rod into a long, thin wire all in one go. On each pass through a die, the diameter is only slightly reduced—maybe by 10 percent or so. To make a fine wire, you need to shrink the thickness in stages, drawing the wire through several dies in succession. And therein lies the key to wire gauge numbers: The gauge of a wire is the number of dies it must pass through to reach its final diameter. Zero-gauge is the thickness of the original rod, without any drawing operations. Fourteen-gauge wire has been pulled through 14 dies in series.
Or at least that was how it worked back when wire-drawing was a hand craft, and nobody worried too much about exact specifications. If two wires had both been pulled through 14 dies, they would both be labeled 14-gauge, but they might well have different diameters if the dies were not identical. By the middle of the 19th century this sort of variation was becoming troublesome; it was time to adopt some standards.
The AWG standard keeps the traditional sequence of gauge numbers but changes their meaning. The gauge is no longer a count of drawing operations; instead each gauge number corresponds to a specific wire diameter. Even so, there’s an effort to keep the new standardized sizes reasonably close to what they were under the old die-counting system.
The wire-drawing process itself suggests how to do this. Each pass through a die reduces the wire diameter to some fraction of its former size, but the value of the fraction might vary a little from one die to the next. The standard simply decrees that the fraction is exactly the same in all cases. In other words, for every pair of adjacent gauge numbers, the corresponding wire diameters have the same ratio, \(R\).
What remains is to work out the value of \(R\). If we start with \(d_{36} = 0.005\) and multiply by \(R\), we’ll get \(d_{35}\); then, multiplying \(d_{35}\) by \(R\) yields \(d_{34}\), and so on. Continuing in this way, after multiplying by \(R\) \(39\) times, we should arrive at \(d_{-3} = 0.46.\) This iterative process can be summarized as:
\[\frac{d_{-3}}{d_{36}} = R^{39}.\]
Filling in the numeric values, we get:
\[\frac{0.46}{0.005} = 92 = R^{39}, \quad \textrm{and thus}\quad R = \sqrt[39]{92}.\]
And there the number lies before us, the \(39\)th root of \(92\). The numerical value is about \(1.122932\), with \(1/R \approx 0.890526\).
With this fact in hand we can now write down a formula that gives the AWG gauge number \(G\) as a function of wire diameter \(d\) in inches:
\[G(d) = -39 \log_{92} \frac{d}{0.005} + 36.\]
That’s a fairly bizarre-looking formula, with base-92 logarithms and a bunch of arbitrary constants floating around. On the other hand, at least it’s a genuine mathematical function, with a domain covering all the positive real numbers. It’s also smooth and invertible. That’s more than you can say for some other standards, such as the British Imperial Wire Gauge, which pastes together several piecewise linear segments.
Who came up with the rule of \(\sqrt[39]{92}\)? As far as I can tell it was Lucian Sharpe, of Brown and Sharpe, a maker of precision instruments and machine tools in Providence, Rhode Island. A history of the company published in 1949 gives this account:
Another activity begun in the [1850s] was the production of accurate gages. The brass business of Connecticut, centered in the Naugatuck Valley, required sheet metal and wire gauges for measuring their products. Mr. Sharpe, with his methodical mind, conceived the idea of producing sizes of wire in a regular progression, choosing a geometric series as best adapted to these needs. Such gages as were in use prior to this time were the product of English manufacture and were very irregular in their sizes.
The first Brown and Sharpe wire gauge was produced in 1857 and later became the basis of an American standard, which is now administered by ASTM.
Wire gauges are not the only numbers defined by a weird-and-wonderful root-taking procedure. The equal-tempered scale of music theory is based on the 12th root of 2. A musical octave represents a doubling of frequency, and the scale divides this interval into 12 semitones. In the equal-tempered version of the scale, any two adjacent semitones differ by a ratio of \(\sqrt[12]{2}\), or about \(1.05946\). It’s worth noting that instruments were being tuned to this scale well before the invention of logarithms. I assume it was done by ear or perhaps by geometry, not by algebra. Around 1600 Simon Stevin did attempt to calculate numerical values for the pitch intervals by decomposing 12th roots into combinations of square and cube roots; his results were not flawless. What would he have done with 39th roots?
Another example of a backward-running logarithmic progression is the magnitude scale for the brightness of stars and other celestial objects. For the astronomers, the magic number is the fifth root of \(100\), or about \(2.511886\); if two stars differ by one unit of magnitude, this is their brightness ratio. A difference of five magnitudes therefore works out to a hundredfold brightness ratio. Brighter bodies have smaller magnitudes. The star Vega defines magnitude \(0\); the sun has magnitude \(-27\); the faintest stars visible without a telescope are at magnitude \(6\) or \(7\).
The idea of stellar magnitudes is ancient, but the numerical scheme in current use was developed by the British/Indian astronomer N. R. Pogson in 1856. That was just a year before Sharpe came up with his wire gauge scale. Could there be a connection? It would make a nice story if we could find some timely account of Pogson’s work that Sharpe might plausibly have read (maybe in Scientific American, founded 1845), but that’s a pure flight of fancy for now.
And what about those shotguns? Are their gauges also governed by some sort of logarithmic law? No, the numerical similarity of gauges for wires and shotguns really is nothing but coincidence. The shotgun law is not logarithmic but reciprocal. Wikipedia explains:
The gauge of a firearm is a unit of measurement used to express the diameter of the barrel. Gauge is determined from the weight of a solid sphere of lead that will fit the bore of the firearm, and is expressed as the multiplicative inverse of the sphere’s weight as a fraction of a pound, e.g., a one-twelfth pound ball fits a 12-gauge bore. Thus there are twelve 12-gauge balls per pound, etc. The term is related to the measurement of cannon, which were also measured by the weight of their iron round shot; an 8 pounder would fire an 8 lb (3.6 kg) ball.
Addendum 2016-08-08: Leon Harkleroad has brought to my attention his excellent article on “Tuning with Triangles” (College Mathematics Journal, Vol. 39, No. 5 (Nov. 2008), pp. 367–373). He describes a simple geometric procedure that Vincenzo Galilei (father of Galileo) used for fretting stringed instruments. In essence it takes \(18/17 \approx 1.05882\) as an approximation to \(\sqrt[12]{2} \approx 1.05946\).
]]>When I first started playing with computers, back in the Pleistocene, writing a few lines of code and watching the machine carry out my instructions was enough to give me a little thrill. “Look! It can count to 10!” Today, learning a new programming language is my way of reviving that sense of wonder.
Lately I’ve been learning Julia, which describes itself as “a high-level, high-performance dynamic programming language for technical computing.” I’ve been dabbling with Julia for a couple of years, but this spring I completed my first serious project—my first attempt to do something useful with the language. I wrote a bunch of code to explore correlations between consecutive prime numbers mod m, inspired by the recent paper of Lemke Oliver and Soundararajan. The code from that project, wrapped up in a Jupyter notebook, is available on GitHub. A bit-player article published last month presented the results of the computation. Here I want to say a few words about my experience with the language.
I also discussed Julia a year ago, on the occasion of JuliaCon 2015, the annual gathering of the Julia community. Parts of this post were written at JuliaCon 2016, held at MIT June 21–25.
This document is itself derived from a Jupyter notebook, although it has been converted to static HTML—meaning that you can only read it, not run the code examples. (I’ve also done some reformatting to save space and improve appearance.) If you have a working installation of Julia and Jupyter, I suggest you download the fully interactive notebook from GitHub, so that you can edit and run the code. If you haven’t installed Julia, download the notebook anyway. You can upload it to JuliaBox.org and run it online.
A 2014 paper by the founders of the Julia project sets an ambitious goal. To paraphrase: There are languages that make programming quick and easy, and there are languages that make computation fast and efficient. The two sets of languages have long been disjoint. Julia aims to fix that. It offers high-level programming, where algorithms are expressed succinctly and without much fussing over data types and memory allocation. But it also strives to match the performance of lower-level languages such as Fortran and C. Achieving these dual goals requires attention both to the design of the language itself and to the implementation.
The Julia project was initiated by a small group at MIT, including Jeff Bezanson, Alan Edelman, Stefan Karpinski, and Viral B. Shah, and has since attracted about 500 contributors. The software is open source, hosted at GitHub.
In my kind of programming, the end product is not the program itself but the answers it computes. As a result, I favor an incremental and interactive style. I want to write and run small chunks of code—typically individual procedures—without having to build a lot of scaffolding to support them. For a long time I worked mainly in Lisp, although in recent years I’ve also written a fair amount of Python and JavaScript. I mention all this because my background inevitably colors my judgment of programming languages and environments. If you’re a software engineer on a large team building large systems, your needs and priorities are surely different from mine.
Enough generalities. Here’s a bit of Julia code, plucked from my recent project:
# return the next prime after x
# define the function...
function next_prime(x)
x += iseven(x) ? 1 : 2
while !isprime(x)
x += 2
end
return x
end
# and now invoke it
next_prime(10000000) ⇒ 10000019
Given an integer x
, we add 1 if x
is even and otherwise add 2, thereby ensuring that the new value of x
is odd. Now we repeatedly check the isprime(x)
predicate, or rather its negation !isprime(x)
. As long as x
is not a prime number, we continue incrementing by 2; when x
finally lands on a prime value (it must do so eventually, according to Euclid), the while
loop terminates and the function returns the prime value of x
.
In this code snippet, note the absence of punctuation: No semicolons separate the statements, and no curly braces delimit the function body. The syntax owes a lot to MATLAB—most conspicuously the use of the end
keyword to mark the end of a block, without any corresponding begin
. The resemblance to MATLAB is more than coincidence; some of the key people in the Julia project have a background in numerical analysis and linear algebra, where MATLAB has long been a standard tool. One of those people is Alan Edelman. I remember asking him, 15 years ago, “How do you compute the eigenvalues of a matrix?” He explained: “You start up MATLAB and type eig
.” The same incantation works in Julia.
The sparse punctuation and the style of indentation also make Julia code look a little like Python, but in this case the similarity is only superficial. In Python, indentation determines the scope of statements, but in Julia indentation is merely an aid to the human reader; the program would compute the same result even if all the lines were flush left.
The next_prime
function above shows a while
loop. Here’s a for
loop:
# for-loop factorial
function factorial_for(n)
fact = 1
for i in 1:n # i takes each value in 1, 2, ..., n
fact *= i # equivalent to fact = fact * i
end
return fact
end
factorial_for(10) ⇒ 3628800
The same idea can be expressed more succinctly and idiomatically:
# factorial via range product
function factorial_range(n)
return prod(1:n)
end
factorial_range(11) ⇒ 39916800
Here 1:n
denotes the numeric range \(1, 2, 3, …, n\), and the prod
function returns the product of all the numbers in the range. Still too verbose? Using an abbreviated style of function definition, it becomes a true one-liner:
fact_range(n) = prod(1:n)
fact_range(12) ⇒ 479001600
As an immigrant from the land of Lisp, I can’t in good conscience discuss the factorial function without also presenting a recursive version:
# recursive definition of factorial
function factorial_rec(n)
if n < 2
return 1
else
return n * factorial_rec(n - 1)
end
end
factorial_rec(13) ⇒ 6227020800
You can save some keystrokes by replacing the if... else... end
statement with a ternary operator. The syntax is
predicate ? consequent : alternative.
If predicate
evaluates to true, the expression returns the value of consequent
; otherwise it returns the value of alternative
. With this device, the recursive factorial function looks like this:
# recursive factorial with ternary operator
function factorial_rec_ternary(n)
n < 2 ? 1 : n * factorial_rec_ternary(n - 1)
end
factorial_rec_ternary(14) ⇒ 87178291200
One sour note on the subject of recursion: The Julia compiler does not perform tail-call conversion, which endows recursive procedure definitions with the same space and time efficiency as iterative structures such as while
loops. Consider this pair of daisy-plucking procedures:
function she_loves_me(petals)
petals == 0 ? true : she_loves_me_not(petals - 1)
end
function she_loves_me_not(petals)
petals == 0 ? false : she_loves_me(petals - 1)
end
she_loves_me(1001) ⇒ false
These are mutually recursive definitions: Control passes back and forth between them until one or the other terminates when the petals are exhausted. The function works correctly on small values of petals
. But let’s try a more challenging exercise:
she_loves_me(1000000)
LoadError: StackOverflowError:
while loading In[8], in expression starting on line 1
in she_loves_me at In[7]:2 (repeats 80000 times)
The stack overflows because Julia is pushing a procedure-call record onto the stack for every invocation of either she_loves_me
or she_loves_me_not
. These records will never be needed; when the answer (true
or false
) is finally determined, it can be returned directly to the top level, without having to percolate up through the cascade of stack frames. The technique for eliminating such unneeded stack frames has been well known for more than 30 years. Implementing tail-call optimization has been discussed by the Julia developers but is “not a priority.” For me this is not a catastrophic defect, but it’s a disappointment. It rules out a certain style of reasoning that in some cases is the most direct way of expressing mathematical ideas.
Given Julia’s heritage in linear algebra, it’s no surprise that there are rich facilities for describing matrices and other array-like objects. One handy tool is the array comprehension, which is written as a pair of square brackets surrounding a description of what the compiler should compute to build the array.
A = [x^2 for x in 1:5] ⇒ [1, 4, 9, 16, 25]
The idea of comprehensions comes from the set-builder notation of mathematics. It was adopted in the programming language SETL in the 1960s, and it has since wormed its way into several other languages, including Haskell and Python. But Julia’s comprehensions are unusual: Whereas Python comprehensions are limited to one-dimensional vectors or lists, Julia comprehensions can specify multidimensional arrays.
B = [x^y for x in 1:3, y in 1:3] ⇒
3x3 Array{Int64,2}:
1 1 1
2 4 8
3 9 27
Multidimensionality comes at a price. A Python comprehension can have a filter clause that selects some of the array elements and excludes others. If Julia had such comprehension filters, you could generate an array of prime numbers with an expression like this: [x for x in 1:100 if isprime(x)]
. But adding filters to multidimensional arrays is problematic; you can’t just pluck values out of a matrix and keep the rest of the structure intact. Nevertheless, it appears a solution is in hand. After three years of discussion, a fix has recently been added to the code base. (I have not yet tried it out.)
This sounds like something the 911 operator might do in response to a five-alarm fire, but in fact multiple dispatch is a distinctive core idea in Julia. Understanding what it is and why you might care about it requires a bit of context.
In almost all programming languages, an expression along the lines of x * y
multiplies x
by y
, whether x
and y
are integers, floating-point numbers, or maybe even complex numbers or matrices. All of these operations qualify as multiplication, but the underlying algorithms—the ways the bits are twiddled to compute the product—are quite different. Thus the *
operator is polymorphic: It’s a single symbol that evokes many different actions. One way to handle this situation is to write a single big function—call it mul(x, y)
—that has a chain of if... then... else
statements to select the right multiplication algorithm for each x, y
pair. If you want to multiply all possible combinations of \(n\) kinds of numbers, the if
statement needs \(n^2\) branches. Maintaining that monolithic mul
procedure can become a headache. Suppose you want to add a new type of number, such as quaternions. You would have to patch in new routines throughout the existing mul
code.
When object-oriented programming (OOP) swept through the world of computing in the 1980s, it offered an alternative. In an object-oriented setting, each class of numbers has its own set of multiplication procedures, or methods. Instead of one universal mul
function, there’s a mul
method for integers, another mul
for floats, and so on. To multiply x
by y
, you don’t write mul(x, y)
but rather something like x.mul(y)
. The object-oriented programmer thinks of this protocol as sending a message to the x
object, asking it to apply its mul
method to the y
argument. You can also describe the scheme as a single-dispatch mechanism. The dispatcher is a behind-the-scenes component that keeps track of all methods with the same name, such as mul
, and chooses which of those methods to invoke based on the data type of the single argument x
. (The x
object still needs some internal mechanism to examine the type of y
to know which of its own methods to apply.)
Multiple dispatch is a generalization of single dispatch. The dispatcher looks at all the arguments and their types, and chooses the method accordingly. Once that decision is made, no further choices are needed. There might be separate mul
methods for combining two integers, for an integer and a float, for a float and a complex, etc. Julia has 138 methods linked to the multiplication operator *
, including a few mildly surprising combinations. (Quick, what’s the value of false * 42
?)
# Typing "methods(f)" for any function f, or for an operator
# such as "*", produces a list of all the methods and their
# type signatures. The links take you to the source code.
methods(*) ⇒
Before Julia came along, the best-known example of multiple dispatch was the Common Lisp Object System, or CLOS. As a Lisp guy, I’m familiar with CLOS, but I’ve seldom found it useful; it’s too heavy a hammer for most of my little nails. But whereas CLOS is an optional add-on to Lisp, multiple dispatch is deeply embedded in the infrastructure of Julia. You can’t really ignore it.
Multiple dispatch encourages a style of programming in which you write an abundance of short, single-purpose methods that handle specific combinations of argument types. This is surely an improvement over writing one huge function with an internal tangle of spaghetti logic to handle dozens special cases. The Julia documentation offers this example:
# roll-your-own method dispatch
function norm(A)
if isa(A, Vector)
return sqrt(real(dot(A,A)))
elseif isa(A, Matrix)
return max(svd(A)[2])
else
error("norm: invalid argument")
end
end
# let the compiler do method dispatch
norm(x::Vector) = sqrt(real(dot(x,x)))
norm(A::Matrix) = max(svd(A)[2])
# Note: The 'x::y syntax declares that variable x will have
# a value of type y.
Splitting the procedure into two methods makes the definition clearer and more concise, and it also promises better performance. There’s no need to crawl through the if... else...
chain at runtime; the appropriate method is chosen at compile time.
That example seems pretty compelling. Edelman gives another illuminating demo, defining tropical arithmetic in a few lines of code. (In tropical arithmetic, plus
is min
and times
is plus
.)
Unfortunately, my own attempts to structure code in this way have not been so successful. Take the next_prime(x)
function shown at the beginning of this notebook. The argument x
might be either an ordinary integer (some subtype of Int
, such as Int64
) or a BigInt
, a numeric type accommodating integers of unbounded size. So I tried writing separate methods next_prime(x::Int)
and next_prime(x::BigInt)
. The result, however, was annoying code duplication: The bodies of those two methods were identical. Furthermore, splitting the two methods yielded no performance gain; the compiler was able to produce efficient code without any type annotations at all.
I remain somewhat befuddled about how best to exploit multiple dispatch. Suppose you are creating a graphics program that can plot either data from an array or a mathematical function. You could define a generic plot
procedure with two methods:
function plot(data::Array)
# ... code for array plotting
end
function plot(fn::Function)
# ... code for function plotting
end
With these definitions, calls to plot([2, 4, 6, 8])
and plot(sin)
would automatically be routed to the appropriate method. However, making the two methods totally independent is not quite right either. They both need to deal with scales, grids, tickmarks, labels, and other graphics accoutrements. I’m still learning how to structure such code. No doubt it will become clearer with experience; in the meantime, advice is welcome.
Multiple dispatch is not a general pattern-matching mechanism. It discriminates only on the basis of the type signature of a function’s arguments, not on the values of those arguments. In Haskell—a language that does have pattern matching—you can write a function definition like this:
factorial 0 = 1
factorial n = n * factorial (n - 1)
That won’t work in Julia. (Actually, there may be a baroque way to do it, but it’s not recommended.) (And I’ve just learned there’s a pattern-matching package.) (And another.)
In CLOS, multiple dispatch is an adjunct to an object-oriented programming system; in Julia, multiple dispatch is an alternative to OOP, and perhaps a repudiation of it. OOP is a noun-centered protocol: Objects are things that own both data and procedures that act on the data. Julia is verb-centered: Procedures (or functions) are the active elements of a program, and they can be applied to any data, not just their own private variables. The Julia documentation notes:
Multiple dispatch is particularly useful for mathematical code, where it makes little sense to artificially deem the operations to “belong” to one argument more than any of the others: does the addition operation in
x + y
belong tox
any more than it does toy
? The implementation of a mathematical operator generally depends on the types of all of its arguments. Even beyond mathematical operations, however, multiple dispatch ends up being a powerful and convenient paradigm for structuring and organizing programs.
I agree about the peculiar asymmetry of asking x
to add y
to itself. And yet that’s only one small aspect of what object-oriented programming is all about. There are many occasions when it’s really handy to bundle up related code and data in an object, if only for the sake of tidyness. If I had been writing my primes program in an object-oriented language, I would have created a class of objects to hold various properties of a sequence of consecutive primes, and then instantiated that class for each sequence I generated. Some of the object fields would have been simple data, such as the length of the sequence. Others would have been procedures for calculating more elaborate statistics. Some of the data and procedures would have been private—accessible only from within the object.
No such facility exists in Julia, but there are other ways of addressing some of the same needs. Julia’s user-defined composite types look very much like objects in some respects; they even use the same dot notation for field access that you see in Java, JavaScript and Python. A type definition looks like this.
type Location
latitude::Float64
longitude::Float64
end
Given a variable of this type, such as Paris = Location(48.8566, 2.3522)
, you can access the two fields as Paris.latitude
and Paris.longitude
. You can even have a field of type Function
inside a composite type, as in this declaration:
type Location
latitude::Float64
longitude::Float64
distance_to_pole::Function
end
Then, if you store an appropriate function in that field, you can invoke it as Paris.distance_to_pole()
. However, the function has no special access to the other fields of the type; in particular, it doesn’t know which Location
it’s being called from, so it doesn’t work like an OOP method.
Julia also has modules, which are essentially private namespaces. Only variables that are explicitly marked as exported
can be seen from outside a module. A module can encapsulate both code and data, and so it is probably the closest approach to objects as a mechanism of program organization. But there’s still nothing like the this or self keyword of true object-oriented languages.
Is it brave or foolish, at this point in the history of computer science, to turn your back on the object-oriented paradigm and try something else? It’s surely bold. Two or three generations of programmers have grown up on a steady diet OOP. On the other hand, we have not yet found the One True Path to software wisdom, so there’s surely room for exploring other corners of the ecosystem.
A language for “technical computing” had better have strong numerical abilities. Julia has the full gamut of numeric types, with integers of various fixed sizes (from 8 bits to 128 bits), both signed and unsigned, and a comparable spectrum of floating-point precisions. There are also BigInts
and BigFloats
that expand to accommodate numbers of any size (up to the bounds of available memory). And you can do arithmetic with complex numbers and exact rationals.
This is a well-stocked toolkit, suitable for many kinds of computations. However, the emphasis is on fast floating-point arithmetic with real or complex values, approximated at machine precision. If you want to work in number theory, combinatorics, or cryptography—fields that require exact integers and rational numbers of unbounded size—the tools are provided, but using them requires some extra contortions.
To my taste, the most elegant implementation of numeric types is found in the Scheme programming language, a dialect of Lisp. The Scheme numeric “tower” begins with integers at its base. The integers are a subset of the rational numbers, which in turn are a subset of the reals, which are a subset of the complex numbers. Alongside this hierarchy, and orthogonal to it, Scheme also distinguishes between exact and inexact numbers. Integers and rationals can be represented exactly in a computer system, but many real and complex values can only be approximated, usually by floating-point numbers. In doing basic arithmetic, the Scheme interpreter or compiler preserves exactness. This is easy when you’re adding, subtracting, or multiplying integers, since the result is also invariably an integer. With division of integers, Scheme returns an integer when possible (e.g., \(15 \div 3 = 5\)) and otherwise an exact rational (\(4 \div 6 = 2/3\)). Some Schemes go a step further and return exact results for operations such as square root, when it’s possible to do so (\(\sqrt{4/9} = 2/3\); but \(\sqrt{5} = 2.236068\); the latter is inexact). These are numbers you can count on; they mostly obey well-known principles of mathematics, such as \(\frac{m}{n} \times \frac{n}{m} = 1\).
Julia has the same tower of numeric types, but it is not so scrupulous about exact and inexact values. Consider this series of results:
15 / 3 ⇒ 5.0
4 / 6 ⇒ 0.6666666666666666
(15 / 17) * (17 / 15) ⇒ 0.9999999999999999
Even when the result of dividing two integers is another integer, Julia converts the quotient to a floating-point value (signaled by the presence of a decimal point). We also get floats instead of exact rationals for noninteger quotients. Because exactness is lost, mathematical identities become unreliable.
I think I know why the Julia designers chose this path. It’s all about type stability. The compiler can produce faster code if the output type of a method is always the same. If division of integers could yield either an integer or a float, the result type would not be known until runtime. Using exact rationals rather than floats would impose a double penalty: Rational arithmetic is much slower than (hardware-assisted) floating point.
Given the priorities of the Julia project, consistent floating-point quotients were probably the right choice. I’ll concede that point, but it still leaves me grumpy. I’m tempted to ask, “Why not just make all numbers floating point, the way JavaScript does?” Then all arithmetic operations would be type stable by default. (This is not a serious proposal. I deplore JavaScript’s all-float policy.)
Julia does offer the tools to build your own procedures for exact arithmetic. Here’s one way to conquer divide:
function xdiv(x::Int, y::Int)
q = x // y # yields rational quotient
den(q) == 1 ? num(q) : q # return int if possible
end
xdiv(15, 5) ⇒ 3
xdiv(15, 9) ⇒ 5//3
Another numerical quirk gives me the heebie-jeebies. Let’s look again at one of those factorial functions defined above (it doesn’t matter which one).
# run the factorial function f on integers from 1 to limit
function test_factorial(f::Function, limit::Int)
for n in 1:limit
@printf("n = %d, f(n) = %d\n", n, f(n))
end
end
test_factorial(factorial_range, 25) ⇒
n = 1, f(n) = 1
n = 2, f(n) = 2
n = 3, f(n) = 6
n = 4, f(n) = 24
n = 5, f(n) = 120
n = 6, f(n) = 720
n = 7, f(n) = 5040
n = 8, f(n) = 40320
n = 9, f(n) = 362880
n = 10, f(n) = 3628800
n = 11, f(n) = 39916800
n = 12, f(n) = 479001600
n = 13, f(n) = 6227020800
n = 14, f(n) = 87178291200
n = 15, f(n) = 1307674368000
n = 16, f(n) = 20922789888000
n = 17, f(n) = 355687428096000
n = 18, f(n) = 6402373705728000
n = 19, f(n) = 121645100408832000
n = 20, f(n) = 2432902008176640000
n = 21, f(n) = -4249290049419214848
n = 22, f(n) = -1250660718674968576
n = 23, f(n) = 8128291617894825984
n = 24, f(n) = -7835185981329244160
n = 25, f(n) = 7034535277573963776
Everything is copacetic through \(n = 20\), and then suddenly we enter an alternative universe where multiplying a bunch of positive integers gives a negative result. You can probably guess what’s happening here. The computation is being done with 64-bit, twos-complement signed integers, with the most-significant bit representing the sign. When the magnitude of the number exceeds \(2^{63} - 1\), the bit pattern is interpreted as a negative value; then, at \(2^{64}\), it crosses zero and becomes positive again. Essentially, we’re doing arithmetic modulo \(2^{64}\), with an offset.
When a number grows beyond the space allotted for it, something has to give. In some languages the overflowing integer is gracefully converted to a bignum
format, so that it can keep growing without constraint. Most Lisps are in this family; so is Python. JavaScript, with its all-floating-point arithmetic, gives approximate answers beyond \(20!\), and beyond \(170!\) all results are deemed equal to Infinity
. In several other languages, programs halt with an error message on integer overflow. C is one language that has the same wraparound behavior as Julia. (Actually, the C standard says that signed-integer overflow is “undefined,” which means the compiler can do anything it pleases; but the C compiler I just tested does a wraparound, and I think that’s the common policy.)
The Julia documentation includes a thorough and thoughtful discussion of integer overflow, defending the wraparound strategy by showing that all the alternatives are unacceptable for one reason or another. But even if you go along with that conclusion, you might still feel that wraparound is also unacceptable.
Casual disregard of integer overflow has produced some notably stinky bugs, causing everything from video-game glitches in Pac Man and Donkey Kong to the failure of the first Ariane 5 rocket launch. Dietz, Li, Regehr, and Adve have surveyed bugs of this kind in C programs, finding them even in a module called SafeInt that was specifically designed to protect against such errors. In my little prime-counting project with Julia, I stumbled over overflow problems several times, even after I understood exactly where the risks lay.
In certain other contexts Julia doesn’t play quite so fast and loose. It does runtime bounds checking on array references, throwing an error if you ask for element five of a four-element vector. But it also provides a macro (@inbounds
) that allows self-confident daredevils to turn off this safeguard for the sake of speed. Perhaps someday there will be a similar option for integer overflows.
Until then, it’s up to us to write in any overflow checks we think might be prudent. The Julia developers themselves have done so in many places, including their built-in factorial
function. See error message below.
test_factorial(factorial, 25) ⇒
n = 1, f(n) = 1
n = 2, f(n) = 2
n = 3, f(n) = 6
n = 4, f(n) = 24
n = 5, f(n) = 120
n = 6, f(n) = 720
n = 7, f(n) = 5040
n = 8, f(n) = 40320
n = 9, f(n) = 362880
n = 10, f(n) = 3628800
n = 11, f(n) = 39916800
n = 12, f(n) = 479001600
n = 13, f(n) = 6227020800
n = 14, f(n) = 87178291200
n = 15, f(n) = 1307674368000
n = 16, f(n) = 20922789888000
n = 17, f(n) = 355687428096000
n = 18, f(n) = 6402373705728000
n = 19, f(n) = 121645100408832000
n = 20, f(n) = 2432902008176640000
LoadError: OverflowError()
while loading In[19], in expression starting on line 1
in factorial_lookup at combinatorics.jl:29
in factorial at combinatorics.jl:37
in test_factorial at In[18]:5
Julia is gaining traction in scientific computing, in finance, and other fields; it has even been adopted as the specification language for an aircraft collision-avoidance system. I do hope that everyone is being careful.
“Batteries included” is a slogan of the Python community, boasting that everything you need is provided in the box. And indeed Python comes with a large standard library, supplemented by a huge variety of user-contributed modules (the PyPI index lists about 85,000 of them). On the other hand, although batteries of many kinds are included, they are not installed by default. You can’t accomplish much in Python without first importing a few modules. Just to take a square root, you have to import math
. If you want complex numbers, they’re in a different module, and rationals are in a third.
In contrast, Julia is refreshingly self-contained. The standard library has a wide range of functions, and they’re all immediately available, without tracking down and loading external files. There are also about a thousand contributed packages, but you can do quite a lot of useful computing without ever glancing at them. In this respect Julia reminds me of Mathematica, which also tries to put all the commonly needed tools at your fingertips.
A Julia sampler: you get sqrt(x)
of course. Also cbrt(x)
and hypot(x, y)
and log(x)
and exp(x)
. There’s a full deck of trig functions, both circular and hyperbolic, with a variant subset that take their arguments in degrees rather than radians. For combinatorialists we have the factorial
function mentioned above, as well as binomial
, and various kinds of permutations and combinations. Number theorists can test numbers for primality, and factor those that are composite, calculate gcds and lcms, and so on. Farther afield we have beta
, gamma
, digamma
, eta
, and zeta
functions, Airy functions, and a dozen kinds of Bessel functions.
For me Julia is like a well-equipped playground, where I can scamper from the swings to the teeter-totter to the monkey bars. At the recent JuliaCon Stefan Karpinsky mentioned a plan to move some of these functions out of the automatically loaded Base
module into external libraries. My request: Please don’t overdo it.
Julia works splendidly as a platform for incremental, interactive, exploratory programming. There’s no need to write and compile a program as a monolithic structure, with a main
function as entry point. You can write a single procedure, immediately compile it and run it, test it, modify it, and examine the assembly code generated by the compiler. It’s ad hack programming at its best.
To support this style of work and play, Julia comes with a built-in REPL, or read-eval-print loop, that operates from a command-line interface. Type an expression at the prompt, press enter, and the system executes the code (compiling if necessary); the value is printed, followed by a new prompt for the next command:
The REPL is pretty good, but I prefer working in a Jupyter notebook, which runs in a web browser. There’s also a development environment called Juno, but I haven’t yet done much with it.
An incremental and iterative style of development can be a challenge for the compiler. When you process an entire program in one piece, the compiler starts from a blank slate. When you compile individual functions, which need to interact with functions compiled earlier, it’s all too easy for the various parts of the system to get out of sync. Gears may fail to mesh.
Problems of this kind can come up with any language, but multiple dispatch introduces some unique quirks. Suppose you’re writing Julia code in a Jupyter notebook. You write and compile a function f(x)
, which the system accepts as a “generic function with one method.” Later you realize that f
should have been a function of two arguments, and so you go back to edit the definition. The modified source code now reads f(x, y)
. In most languages, this new definition of f
would replace the old one, which would thereafter cease to exist. But in Julia you have haven’t replaced anything; instead you now have a “generic function with 2 methods.” The first definition is still present and active in the internal workspace, and so you can call both f(x)
and f(x, y)
. However, the source code for f(x)
is nowhere to be found in the notebook. Thus if you end the Jupyter session and later reload the file, any calls to f(x)
will produce an error message.
A similar but subtler problem with redefinitions has been a subject of active discussion among the Julia developers for almost five years. There’s hope for a fix in the next major release.
If redefining a function during an interactive session can lead to confusion, trying to redefine a custom type runs into an impassable barrier. Types are immutable. Once defined, they can’t be modified in any way. If you create a type T
with two fields a
and b
, then later decide you also need a field c
, you’re stuck. The only way to redefine the type is to shut down the session and restart, losing all computations completed so far. How annoying. There’s a workaround: If you put the type definition inside a module, you can simply reload the module.
Sometimes it’s the little things that arouse the most heated controversy.
In counting things, Julia generally starts with 1. A vector with n
elements is indexed from 1
through n
. This is also the default convention in Fortran, MATLAB, Mathematica, and a few other languages; the C family, Python, and JavaScript count from 0
through n-1
. The wisdom of 1-based indexing has been a subject of long and spirited debate on the Julia issues forum, on the developer mailing list, and on StackOverflow.
Programs where the choice of indexing origin makes any difference are probably rare, but as it happens my primes study was one of them. I was looking at primes reduced modulo \(m\), and counting the number of primes in each of the \(m\) possible residue classes. Since the residues range from \(0\) to \(m - 1\), it would have been convenient to store the counts in an array with those indices. However, I did not find it terribly onerous to add \(1\) to each index.
Blessedly, the indexing war may soon be over. Tim Holy, a biologist and intrepid Julia contributor, discovered a way to give users free choice of indexing with only a few lines of code and little cost in efficiency. An experimental version of this new mechanism is in the 0.5 developer release.
After the indexing truce, we can still fight over the storage order for arrays and matrices. Julia again follows the precedent of Fortran and MATLAB, reading arrays in column-major order. Given the matrix
$$\begin{matrix}
x_{11} & x_{12} & x_{13} \\
x_{21} & x_{22} & x_{23} \\
\end{matrix}$$
Julia will serialize the elements as \(x_{11}, x_{21}, x_{12}, x_{22}, x_{13}, x_{23}\). C and many other recent languages read across the rows rather than down the columns. This is another largely arbitrary convention, although it may be of some practical importance that most graphic formats are row-major.
Finally there’s the pressing question of what symbol to choose for the string-concatenation operator. Where Python has "abc" + "def" ⇒ "abcdef"
, Julia does "abc" * "def" ⇒ "abcdef"
. The reason cited is pretty doctrinaire: In abstract algebras, ‘+’ generally denotes a commutative operation, and string concatenation is not commutative. If you’d like to know more, see the extensive, long-running discussion.
The official online documentation is reasonably complete, consistent, and clearly written, but it gives too few examples. Moreover, the organization is sometimes hard to fathom. For example, there’s a chapter on “Multidimensional Arrays” and another titled “Arrays”; I’m often unsure where to look first.
Some of the contributed packages have quite sparse documentation. I made heavy use of a graphics package called GadFly, which is a Julia version of the R package called ggplot2. I had a hard time getting started with the supplied documentation, until I realized that GadFly is close enough to ggplot2 that I could rely on the manual for the latter program.
The simple advice when you’re stumped is: Read the source, Luke. It’s all readily available and easily searched. And almost everything is written in Julia, so you’ll not only find the answers to your questions but also see instructive examples of idiomatic usage.
A few other online resources:
Julia by Example (Samuel Colvin)
Learn X in Y Minutes, Where X = Julia (Leah Hanson and others)
Introducing Julia (a Wikibook)
The Julia Express (Bogumił Kamínski)
Videos of the talks from JuliaCon 2016 are on YouTube.
As far as I know, Julia has no standard or specification document, which is an interesting absence. For a long time, programming languages had rather careful, formal descriptions of syntax and semantics, laying out exactly which expressions were legal and meaningful in a program. Sometimes the standard came first, before anyone tried to write a compiler or interpreter; Algol 60 was the premier example of this discipline. Sometimes the standard came after, as a device to reconcile diverging implementations; this was the situation with Lisp and C. For now, at least, Julia is apparently defined only by its evolving implementation. Where’s the BNF?
The rise and fall of programming languages seems to be governed by a highly nonlinear dynamic. Why is Python so popular? Because it’s so popular! With millions of users, there are more add-on packages, more books and articles, more conferences, more showcase applications, more jobs for Python programmers, more classes where Python is the language of instruction. Success breeds success. It’s a bandwagon effect.
If this rich-get-richer economy were the whole story, the world would have only one viable programming language. But there’s a price to be paid for popularity: The overcrowded bandwagon gets trapped in its own wheel ruts and becomes impossible to steer. Think of those 85,000 Python packages. If an improvement in the core language is incompatible with too much of that existing code, the change can be made only with great effort. For example, persuading Python users to migrate from version 2 to version 3 has taken eight years, and the transition is not complete yet.
Julia is approaching the moment when these contrary imperatives—to gain momentum but to stay maneuverable—both become powerful. At the recent JuliaCon, Stefan Karpinsky gave a talk on the path to version 1.0 and beyond. The audience listened with the kind of studious attention that Janet Yellin gets when she announces Fed policy on inflation and interest rates. Toward the end, Karpinsky unveiled the plan: to release version 1.0 within a year, by the time of the 2017 JuliaCon. Cheers and applause.
The 1.0 label connotes maturity—or at least adulthood. It’s a signal that the language is out of school and ready for work in the real-world. And thus begins the struggle to gain momentum while continuing to innovate. For members of Julia’s developer group, there might be a parallel conflict between making a living and continuing to have fun. I’m fervently hoping all these tensions can be resolved.
Will Julia attract enough interest to grow and thrive? I feel I can best address that question by turning it back on myself. Will I continue to write code in Julia?
I’ve said that I do exploratory programming, but I also do expository programming. Computational experiments are an essential part of my work as a science writer. They help me understand what I’m writing about, and they also aid in the storytelling process, through simulations, computer-generated illustrations, and interactive gadgets. Although only a small subset of my readers ever glance at the code I publish, I want to maximize their number and do all I can to deepen their engagement—enticing them to read the code, run it, modify it, and ideally build on it. At this moment, Julia is not the most obvious choice for reaching a broad audience of computational dabblers, but I think it has considerable promise. In some respects, getting started with Julia is easier than it is with Python, if only because there’s less prior art—less to install and configure. I expect to continue working in other languages as well, but I do want to try some more Julia experiments.
Yet that’s not the end of the story. The distribution of the primes looks random, with irregular gaps and clusters that seem quite haphazard. If there’s a pattern, it’s inscrutable. As a matter of fact, the primes look random enough that you could play dice with them. Make a list of consecutive prime numbers (perhaps starting with 11, 13, 17, 19, . . . ) and reduce them modulo 7. In other words, divide each prime by 7 and keep only the remainder. The result is a sequence of integers drawn from the set {1, 2, 3, 4, 5, 6} that looks much like the outcome of repeatedly rolling a fair die.
$$\begin{align*}
11 \bmod 7 & \rightarrow 4 \qquad 47 \bmod 7 \rightarrow 5 \\
13 \bmod 7 & \rightarrow 6 \qquad 53 \bmod 7 \rightarrow 4 \\
17 \bmod 7 & \rightarrow 3 \qquad 59 \bmod 7 \rightarrow 3 \\
19 \bmod 7 & \rightarrow 5 \qquad 61 \bmod 7 \rightarrow 5 \\
23 \bmod 7 & \rightarrow 2 \qquad 67 \bmod 7 \rightarrow 4 \\
29 \bmod 7 & \rightarrow 1 \qquad 71 \bmod 7 \rightarrow 1 \\
31 \bmod 7 & \rightarrow 3 \qquad 73 \bmod 7 \rightarrow 3 \\
37 \bmod 7 & \rightarrow 2 \qquad 79 \bmod 7 \rightarrow 2 \\
41 \bmod 7 & \rightarrow 6 \qquad 83 \bmod 7 \rightarrow 6 \\
43 \bmod 7 & \rightarrow 1 \qquad 89 \bmod 7 \rightarrow 5 \\
\end{align*}$$
Working with a larger sample (the first million primes greater than \(10^7\)), I have tallied up the number of primes with each of the six possible remainders mod 7 (otherwise known as the six possible congruence classes mod 7). I have also simulated a million rolls of a six-sided die. Looking at the results of these two exercises, can you tell which is which?
1 2 3 4 5 6 166,787 166,569 166,714 166,573 166,665 166,692 120 -98 47 -94 -2 25 1 2 3 4 5 6 166,768 166,290 166,412 166,638 167,282 166,610 101 -377 -255 -29 615 -57
In each table the first line counts the number of outcomes, \(x\), in each of the six classes; the second line shows the difference \(x - \bar{x}\), where \(\bar{x}\) is the mean value 1,000,000 / 6 = 166,667. In both cases the numbers seem to be spread pretty evenly, without any obvious biases. The first table represents the prime residues mod 7. They have a somewhat flatter distribution than the simulated die, with smaller departures from the mean; the standard deviations of the two samples are 84 and 346 respectively. On the evidence of these tables it looks like either process could supply the randomness needed for a casual game of dice.
There’s more to randomness, however, than just ensuring that the results are evenly distributed across the allowed range. Individual events in the series must also be independent of one another. One roll of a die should have no effect on the outcome of the next roll. As a test of independence, we can look at pairs of successive events. How many times is a 1 followed by another 1, by a 2, by a 3, and so on? A 6 × 6 matrix serves to record the counts of the 36 possible pairs. If the process is truly random, all 36 pairs should be equally frequent, apart from small statistical fluctuations. We can turn the matrix into a color-coded “heatmap,” where cells with higher-than-average counts are shown in warm shades of pink and red, while those below the mean are filled with cooler shades of blue. (The quantity plotted is not the actual count \(x\) but a normalized variable \(w = (x_{i,\,j} - \bar{x})\, /\, \bar{x}\), where \(\bar{x}\) is again the mean value—in this case 1,000,000 / 36 = 27,778.) Here is the heatmap for the simulated fair die:
Figure 1.
Not much going on there. Almost all the counts are so close to the mean value that the matrix cells appear as a neutral gray; a few are very pale pink or blue. It’s just what you would expect if consecutive rolls of a die are uncorrelated, and all possible pairs are equally likely.
Now for the corresponding matrix of consecutive primes mod 7:
Figure 2.
Well! I guess we’re not in Randomland anymore; this is where the old gray movie turns into Technicolor. The heatmap has a blue streak along the main diagonal (upper left to lower right), indicating that consecutive pairs of primes that have the same value mod 7 are strongly suppressed. In other words, the pairs \((1, 1), (2, 2), \ldots (6, 6)\) appear less often than they would in a truly random sequence. The superdiagonal (just above the main diagonal) is a lighter blue, meaning that \((i, j)\) pairs with \(j=i+1\) are seen at a little less than average frequency; for example, \((2, 3)\) and \((5, 6)\) have slightly negative values of normalized frequency. On the other hand, the subdiagonal (below the main diagonal) is all pink and red; pairs such as \((3, 2)\) or \((5, 4)\), with \(j=i-1\), occur with higher than average frequency. Away from the diagonal, in the upper right and lower left corners, we see a pastel checkerboard pattern.
If you’d prefer to squint at numbers rather than colored squares, here’s the underlying matrix:
pairs of consecutive primes mod 7 1 2 3 4 5 6 1 15656 24376 33891 29964 33984 28916 2 37360 15506 22004 32645 25095 33959 3 25307 41107 14823 22747 32721 30009 4 32936 26183 37129 14681 21852 33791 5 24984 34207 26231 41154 15560 24529 6 30543 25190 32636 25382 37453 15488
The departures from uniformity are anything but subtle. The third row, for example, shows that if you have just seen a 3 in the sequence of primes mod 7, the next number is much more likely to be a 2 than another 3. If you were wagering on a game played with prime-number dice, this bias could make a huge difference in the outcome. The prime dice are rigged!
These remarkably strong correlations in pairs of consecutive primes were discovered by Robert J. Lemke Oliver and Kannan Soundararajan of Stanford University, who discuss them in a preprint posted to the arXiv in March. What I find most surprising about the discovery is that no one noticed these patterns long ago. They are certainly conspicuous enough once you know how to look for them.
I suppose we can’t fault Euclid for missing them; ideas about randomness and probability were not well developed in antiquity. But what about Gauss? He was a connoisseur of prime tables, and he compiled his own lists of thousands of primes. In his youth, he wrote, “one of my first projects was to turn my attention to the decreasing frequency of primes, to which end I counted the primes in several chiliads . . . .” Furthermore, Gauss more less invented the idea of congruence classes and modular arithmetic. But apparently he never suspected there might be anything odd lurking in the congruences of pairs of consecutive primes.
In the 1850s the Russian mathematician Pafnuty Lvovich Chebyshev pointed out a subtle bias in the primes. Reducing the odd primes modulo 4 splits them into two subsets. All primes in the sequence 5, 13, 17, 29, 37, . . . are congruent to 1 mod 4; those in the sequence 3, 7, 11, 19, 23, 31, . . . are congruent to 3 mod 4. Chebyshev observed that primes in the latter category seem to be more abundant. Among the first 10,000 odd primes, for example, there are 4,943 congruent to 1 and 5,057 congruent to 3. However, the effect is tiny compared with the disparities seen in pairs of consecutive primes.
In modern times a few authors have reported glimpses of the consecutive-primes phenomenon; Lemke Oliver and Soundararajan mention three such sightings. (See references at the end of this article.) In the 1950s and 60s, Stanislaw Knapowski and Paul Turán investigated various aspects of prime residues mod m; in one paper, published posthumously in 1977, they discuss consecutive primes mod 4, with residues of 1 or 3. They “guess” that consecutive pairs with the same residue and those with different residues “are not equally probable.” In 2002 Chung-Ming Ko looked at sequences of consecutive primes (not just pairs of them) and constructed elaborate fractal patterns based on their varying frequencies. Then in 2011 Avner Ash and colleagues published an extended analysis of “Frequencies of Successive Pairs of Prime Residues,” including some matrices in which the diagonal depression is clearly evident.
Given these precedents, are Lemke Oliver and Soundararajan really the discoverers of the consecutive prime correlations? In my view the answer is yes. Although others may have seen the patterns before, they did not describe them in a way that registered on the consciousness of the mathematical community. As a matter of fact, when Lemke Oliver and Soundararajan announced their findings, the response was surprise verging on incredulity. Erica Klarreich, writing in Quanta, cited the reaction of James Maynard, a number theorist at Oxford:
When Soundararajan first told Maynard what the pair had discovered, “I only half believed him,” Maynard said. “As soon as I went back to my office, I ran a numerical experiment to check this myself.”
Evidently that was a common reaction. Evelyn Lamb, writing in Nature, quotes Soundararajan: “Every single person we’ve told this ends up writing their own computer program to check it for themselves.”
Well, me too! For the past few weeks I’ve been noodling away at lots of code to analyze primes mod m. What follows is an account of my attempts to understand where the patterns come from. My methods are computational and visual more than mathematical; I can’t prove a thing. Lemke Oliver and Soundararajan take a more rigorous and analytical approach; I’ll say a little more about their results at the end of this article.
If you would like to launch your own investigation, you’re welcome to use my code as a starting point. It is written in the Julia programming language, packed up in a Jupyter notebook, and available on GitHub. (Incidentally, this program was my first nontrivial experiment with Julia. I’ll have more to say about my experience with the language in a later post.)
All the examples presented above concern primes taken modulo 7, but there’s nothing special about the number 7 here. I chose it merely because the six possible remainders {1, 2, 3, 4, 5, 6} match the faces of an ordinary cubical die. Other moduli give similar results. Lemke Oliver and Soundararajan do much of their analysis with primes modulo 3, where there are only two congruence classes: A prime greater than 3 must leave a remainder of either 1 or 2 when divided by 3. This is the matrix of pair counts for the first million primes above \(10^7\):
1 2 1 218578 281453 2 281454 218514
Figure 3.
The pattern is rather minimalist but still recognizable: The off-diagonal entries for sequences \((1, 2)\) and \((2, 1)\) are larger than the on-diagonal entries for \((1, 1)\) and \((2, 2)\).
Primes modulo 10 have four congruence classes: 1, 3, 7, 9. Working in decimal notation, we don’t even need to do any arithmetic to see this. When numbers are written in base 10, every prime greater than 5 has a trailing digit of 1, 3, 7, or 9. Here are the frequency counts for the 16 pairs of consecutive final digits:
1 3 7 9 1 43811 76342 78170 51644 3 58922 41148 71824 78049 7 64070 68542 40971 76444 9 83164 63911 59063 43924
Figure 4.
The blue stripe along the main diagonal is clearly present, although elsewhere in the matrix the pattern is somewhat muted and muddled.
I have found that the correlations between successive primes show through most clearly when the modulus itself is a prime and also is not too small. Take a look at the heatmaps for consecutive primes mod 13 and mod 17:
Figure 5.
Figure 6.
Or how about mod 31?
Figure 7.
These would make great patterns for a quilt or a tiled floor, no? And there are interesting regularities visible in all of them. Diagonal stripes are prominent not just on the main corner-to-corner diagonal but throughout the matrix. Those stripes also generate a checkerboard motif; along any row or column, cells tend to alternate between red and blue. A subtler feature is an approximate bilateral symmetry across the antidiagonal (which runs from lower left to upper right). If you were to fold the square along this line, the cells brought together would be closely matched in color. (This is a fact noticed by Ash and his co-authors.)
As a focus of further analysis I have settled on looking at consecutive primes mod 19, a modulus large enough to yield clearly differentiated stripes but not so large as to make the matrix unweildy.
Figure 8.
How to make sense of what we’re seeing? A starting point is the observation that all the primes in our sample are odd numbers, and hence all the intervals between those primes are even numbers. For any given prime \(p\), the next candidates for primehood are \(p+2, p+4, p+6, \ldots\). Could this have something to do with the checkerboard pattern? If the steps between primes must be multiples of 2, that could certainly create correlations between every second cell in a given column or row. (Indeed, the every-other-cell correlations would be starkly obvious—all even-numbered entries would be exactly zero—if the modulus were an even number. It is only by “wrapping around” the edge of the matrix at an odd boundary that any of the even-numbered cells can be populated.)
The diagonal stripes in the matrix suggest strong correlations between all pairs of primes separated by a certain numerical interval. For example, the deepest blue diagonal and the brightest red diagonal are formed by cells separated by six places along the j axis. In the first row are cells 1 and 7, then 2 and 8, 3 and 9, and so on. It occurred to me that this relationship would be easier to perceive if I could “twist” the matrix, so that diagonals became columns. The idea is to apply a cyclic shift to each row; all the values in the row slide to the left, and those that fall off the left edge are reinserted on the right. The first row shifts by zero positions, the next row by one position, and so on. (Is there a name for this transformation? I’m just calling it a twist.)
When I wrote the code to apply this transformation, the result was not quite what I expected:
Figure 9.
What are those zigzags all along the antidiagonal? I guessed that I must have an off-by-one error. Indeed this was the nature of the problem, though the bug lay in the data, not the algorithm. The matrices I’ve displayed in all the figures above are only partial; they suppress empty congruence classes. In particular, the matrix for pairs of primes modulo 19 ignores all primes congruent to 0 mod 19—on the sensible-sounding grounds that there are no such primes. After all, if \(p > 19\) and \(p \equiv 0 \bmod 19\), then \(p\) cannot be prime because it is divisible by 19. Nevertheless, a row and a column for \(p \equiv 0 \bmod 19\) are properly part of the matrix. When they are included, the color-coded tableau looks like this:
Figure 10.
The presence of the zero row and column makes the definition of the twist transformation somewhat tidier: For each row \(i\), apply a leftward cyclic shift of \(i\) places. The resulting twisted matrix is also tidier:
Figure 11.
What do those vertical stripes tell us? In the original matrix, entry \(i, j\) represents the frequency with which \(i \bmod 19\) is followed by \(j \bmod 19\). Here, the color in cell \(i, j\) indicates the frequency with which \(i \bmod 19\) is followed by \((i + j) \bmod 19\). In other words, each column brings together entries with same interval mod 19 between two primes. For example, the leftmost column includes all pairs separated by an interval of length \(0 \bmod 19\), and the bright red column at \(j = 6\) counts all the cases where successive primes are separated by \(6 \bmod 19\).
The color coding gives a qualitative impression of which intervals are seen more or less commonly. For a more precise quantitative measure, we can sum along the columns and display the totals in a bar chart:
Figure 12.
Three obervations:
I wanted to understand the origin of these patterns. What makes interval 6 such a magnet for pairs of consecutive primes, and why do almost all the primes shun poor interval 0?
For the popularity of 6, I already had an inkling. In the 1990s Andrew Odlyzko, Michael Rubinstein, and Marek Wolf undertook a computational study of prime “jumping champions”:
An integer D is called a jumping champion if it is the most frequently occurring difference between consecutive primes ≤ x for some x.
Among the smallest primes (x less than about 600), the jumping champion is usually 2, but then 6 takes over and dominates for quite a long stretch of the number line. Somewhere around \(x = 10^{35}\), 6 cedes the championship to 30, which eventually gives way to 210. Odlyzko et al. estimate that the latter transition takes place near \(x = 10^{425}\). The numbers in this sequence of jumping champions—2, 6, 30, 210, . . . —are the primorials; the nth primorial is the product of the first n primes.
Why should primorials be the favored intervals between consecutive primes? If \(p\) is a large enough prime, then \(p + 2\) cannot be divisible by 2, \(p + 6\) cannot be divisible by either 2 or 3, \(p + 30\) cannot be divisible by 2, 3, or 5, and in general \(p + P_{n}\), where \(P_{n}\) is the nth primorial, cannot be divisible by any of the first n primes. Of course \(p + P_{n}\) might still be divisible by some larger prime, or there might be another prime between \(p\) and \(p + P_{n}\), so that the interprime interval is certainly not guaranteed to be a primorial. But these intervals have an edge over other contenders.
We can see this reasoning in action by taking the differences between successive elements in our list of a million eight-digit primes, then plotting their frequencies:
Figure 13.
Again interval 6 is the clear standout, accounting for 13.7 percent of the total; higher multiples of 6 also poke above their immediate neighbors. And note the overall shape of the distribution: a lump at the left (with a peak at 6), followed by a steady decline. The trend looks a little like a Poisson distribution, and indeed this is thought to be the correct description.
The color scheme slices the data set into tranches of 19 values each. The blue tranche, which includes inter-prime intervals of length 0 to 18, accounts for 68 percent of all the intervals present in the sample of a million primes; the gold tranche adds another 23 percent. The remaining 9 percent are spread widely and thinly. Not all of the intervals are shown in the graph; the spectrum extends as far as 210. (A single pair of consecutive primes in the sample has a separation of 210, namely 20,831,323 and 20,831,533.)
Figure 13 seems to reveal a great deal about the patterns of consecutive primes mod 19. I can make the graph even more informative with a simple rearrangement. Slide each 19-element tranche to the left until it aligns with the 0 tranche, stacking up bars that fall in the same column. Thus the second (gold) tranche moves left until bar 19 lines up with bar 0, and the third (rose) tranche brings together bar 38 with bar 0. Physically, this process can be imagined as wrapping the graph around a cylinder of circumference 19; mathematically, it amounts to reducing the inter-prime intervals modulo 19.
Figure 14.
If you ignore the garish colors, Figure 14 is identical to Figure 12: All the bar heights match up. This should not be a surprise. In Figure 12 we reduce the primes modulo 19 and then take the differences between successive reduced primes; in Figure 14 we take the differences between the primes themselves and then reduce those differences modulo 19. The two procedures are equivalent:
\[(p \bmod 19 - q \bmod 19) \bmod 19 = (p - q) \bmod 19.\]
Looking at the colors now, the pieces of the puzzle fall into place. Why are primes mod 19 so often separated by an interval of 6? Well, “mod 19” has very little to do with it; 6 itself is by far the most common interval between primes in this sample. The only other nonnegligible contribution to \(\delta \equiv 6 \bmod 19\) comes from the third tranche, specifically a few pairs of primes at a distance of 44.
The predominance of the first tranche also explains the disparity between odd and even intervals. All the intervals in the first tranche are necessarily even; odd intervals (mod 19) begin to appear only with the second tranche (intervals 19 to 37) and for that reason alone they are less well populated. For the eight-digit primes in this sample, more than two-thirds of consecutive pairs are closer than 19 units and thus wind up in the first tranche. (The median spacing between the primes is 12. The mean interval is 16.68, in close accord with the theoretical prediction of 16.72.)
Finally, Figure 14 also has something to say about the rarity of 0 intervals mod 19. No two consecutive primes can fall into the same congruence class mod 19 unless they are separated by a distance of 38 or some multiple of 38. Thus such pairs don’t enter the scene until the beginning of the third tranche, and there can’t be very many of them. The million-prime sample has 8,384 consecutive pairs at a distance of 38—less than 1 percent of the total. This is the main reason that a prime-number die so rarely shows the same face twice in a row. It’s the origin of the blue diagonal streak in all the matrices.
I find it interesting that we can explain so much about the pattern of consecutive primes mod m without delving into any of the deep and distinctive properties of prime numbers. In fact, we can replicate much of the pattern without introducing primes at all.
Two hundred years ago, Gauss and Legendre observed that in the neighborhood of a number \(x\), the fraction of all integers that are prime is about \(1 / \log x\). In 1936 the Swedish mathematician Harald Cramér suggested that we interpret this fraction as a probability. The idea is to go through the integers in sequence, accepting each \(x\) with probability \(1 / \log x\). The numbers in the accepted set will not be prime except by coincidence, but they will have the same large-scale distribution as the primes. Here are the first few entries in a list of a million such “Cramér primes,” where the random selection process started with \(x = 10^7\):
10000007 10000022 10000042 10000065 10000068 10000098 10000110 10000116 10000119 10000128 10000166
Now suppose we put these numbers through the same machinery we applied to the primes. We’ll reduce each Carmér prime mod 19 and then construct the 19 × 19 matrix of successors:
Figure 15.
The prominent diagonal features look familiar, but they are much simpler than those in the corresponding prime diagrams. For any Cramér prime p mod 19, the most likely successor is p + 1 mod 19, and the least likely is p + 19 mod 19. Between these extremes there’s a smooth gradient in frequency or probability, with just a few small fluctuations that can probably be attributed to statistical noise.
One thing that’s missing from this matrix is the checkerboard motif. We can restore some of that structure by generating a new set of numbers I call Cramér semiprimes. They are formed by the same kind of probabilistic sieving of the integers, but this time we consider only the odd numbers as candidates, and adjust the probability to \(2\, / \log x\) to keep the overall density the same:
Figure 16.
That’s more like it! With all even numbers excluded from the sequence, the minimum interval between semicramers is 2, and that is also the likeliest spacing.
With one further modification, we get an even closer imitation of the true prime matrix. In addition to excluding all integers divisible by 2, we knock out those divisible by 3, and adjust the probability of selection accordingly. Call the resulting numbers Cramér demisemiprimes.
Figure 17.
Note that 6 mod 19 is the likeliest interval between Cramér demisemiprimes, just as it is between true primes, and there are the same echoes at intervals of 12 and 18. Indeed, the only conspicuous difference between this matrix and Figure 10 (the corresponding diagram for true primes) is in the column and the row for numbers congruent to 0 mod 19. There can be no such numbers among the primes. If we eliminate them also from the Cramér numbers, the two matrices become hard to distinguish. Here they are side by side:
Figure 18.
If you look closely, there are differences to be found—check out the diagonal extending southeast from row 1, column 15—but overall these modified Cramer numbers are shockingly good fake primes. Even the symmetry about the antidiagonal is visible in both diagrams. And keep in mind that the two sets have only about 19 percent of their values in common; the Cramérs include 189,794 genuine primes.
I have one more twist to add to the tale. All the examples above are based on primes (or prime analogues) of eight decimal digits, or in other words numbers in the vicinity of \(10^7\). Do the same conclusions hold up with larger primes? Consider the tableau created by consecutive pairs of a million 40-digit primes, taken mod 19. The pattern is familiar but faded:
Figure 19.
Going on to primes of 400 digits each, again reduced mod 19, we find the colors have been bleached almost to oblivion:
Figure 20.
The blue streak on the main diagonal is barely discernible, and other features amount to mere random mottling.
Thus it seems that size matters when it comes to pairs of consecutive primes. For a hint as to why, take a look at the tally of differences between successive primes for the 40-digit sample:
Figure 21.
Compared with the distribution of intervals for eight-digit primes (Figure 13), the spectrum is much broader and flatter. In this representation the graph is truncated at interval 240; the long tail actually stretches far to the right, with the largest gap between consecutive primes at 1,328. Also, as predicted by Odlyzko and his colleagues, the most frequent interval between 40-digit primes is not 6 but 30.
Because of the wider distribution of intervals, the first tranche cannot dominate the behavior of the system in the way it does with eight-digit primes. When we stack up the tranches mod 19 (Figure 22, below), the first six or eight tranches all make substantial contributions. The odd-even alternation is still present, but the amplitude of these oscillations is much reduced. The leftmost bar in the graph, representing intervals congruent to 0 mod 19, is stunted but not as severely.
Figure 22.
The flattening of the spectrum becomes even more pronounced in the sample of a million 400-digit primes:
Figure 23.
Now the gaps between primes extend all the way out to 15,080, creating almost 800 tranches mod 19 (though only 13 are shown). And there’s a lot of intriguing, comblike structure in the array. In general, bars at multiples of 6 stand out at almost double the height of their near neighbors, showing the continuing influence of the smallest prime factors 2 and 3. Multiples of 6 that are also multiples of 30 reach even greater heights. Values in the sequence 42, 84, 126, 168, 210, . . . are also enhanced; these numbers are multiples of 42 = 2 × 3 × 7. And notice that 210, which is a multiple of 6 and 30 and 42, is the new champion interval, again supporting an Odlyzko prediction.
Despite all this intricate structure, when the bars are stacked up mod 19, the mixing of the 800 tranches is so thorough that the heights are almost uniform. All that’s left is a tiny bit of even-odd disparity.
Figure 24.
And the chronically unpopular class of intervals congruent to 0 mod 19 has finally caught up with its peers. Most of the height of the bars comes not from the dozen early tranches but from the hundreds of later ones representing intervals between 228 and 15,080 (all lumped together in the teal green area of the graph).
The experiments with large primes suggest a plausible surmise: As the size of the primes goes to infinity, all traces of correlations will fade to gray, and consecutive pairs of primes will be as random as rolls of an ideal die. But is it so? There are several reasons to be skeptical of this hypothesis. First, if we scale the modulus m along with the size of the primes—making it comparable in magnitude to the median gap between primes—the correlations may still show through. For my 40-digit sample, the median gap between primes is 66, so let’s look at the successive-pairs matrix mod 61. (To limit statistical noise, I did this computation with a sample of 10 million 40-digit primes rather than 1 million.)
Figure 25.
The stripes are back! Indeed, in addition to the familiar bright red stripes at intervals of 6, there’s a more diffuse pink-and-blue undulation with a period of 30. I would love to see a matrix for primes of 400 digits, which might well have even more complex features, with interacting waves at periods of 6, 30, and 210. Regrettably, I can’t show you that figure. The median gap between 400-digit primes is about 640, so we’d need to set m equal to a prime in this range, say 641. Filling that 641 × 641 matrix would require about a billion consecutive 400-digit primes, which is more than I’m prepared to calculate.
There are other reasons to doubt that the correlations disappear entirely as the primes grow larger. The comblike structure seen so clearly in Figures 21 and 23 suggests that rules of divisibility by small primes have a major influence on the distribution of large primes mod m—and this influence does not wane when the primes grow larger still. Furthermore, even when m is much smaller than the median inter-prime interval, the blue streak remains faintly visible. Here is the matrix for pairs of consecutive 400-digit primes mod 3:
1 2 1 248291 251128 2 251127 249453
Differences between on-diagonal and off-diagonal elements are much smaller than with eight-digit primes (compare Figure 3), but the discrepancies still don’t look like random variation.
To get a clearer picture of how the correlations vary as a function of the size of the primes, I set out to sample the sequence of primes over the entire range from 1-digit numbers to 400-digit numbers. In this project I decided to go Gauss one better: He tabulated primes by the chiliad (a group of 1,000), and I’ve been computing them by the myriad (a group of 10,000). To measure the correlations among primes mod m, I calculated the mean value of the diagonal elements of the matrix and the mean of the off-diagonal elements, then took the off/on ratio. If successive primes were totally uncorrelated, the ratio should converge to 1.
Figure 26 shows the result for 797 myriads of primes mod 3. The curve is concave upward, with a steep initial decline and then a much flatter segment. Starting at about 100 digits, there are samples with off/on ratios of less than 1, meaning that the diagonal is more densely populated than the off-diagonal regions. But even at 400 digits the majority of the ratios are still above 1. What are we looking at here? Does the curve slowly approach a ratio of 1, or is there a limiting value slightly greater than 1? Unfortunately, computational experiments will not give a definitive answer.
Figure 26.
The paper by Lemke Oliver and Soundararajan brings quite different tools to bear on this problem. Although they do some numerical exploration, their focus is on finding an analytic solution. The goal is a mathematical function or formula whose inputs are four positive integers: m is a modulus, a and b are congruence classes of primes mod m, and x is an upper limit on the size of the primes. The formula should tell us how often a is followed by b among all primes smaller than x. If we were in possession of such a formula, we could color in all the squares in the m × m successor matrix without ever having to compute the actual primes up to x.
Describing the behavior of all primes up to x is far more challenging than taking a sample of primes in the neighborhood of x. And the analytic approach is harder in other ways: It requires ideas rather than just cpu cycles. The reward is also potentially greater. The equation \(A = \pi r^2\) yields an exact truth about all circles, something no finite series of computations (with a finite approximation to \(\pi\)) can give us. There’s the promise not just of rigor but of insight.
Sadly, I’ve not yet been able to gain much insight from reading the analysis of Lemke Oliver and Soundararajan. The blame lies mainly with gaping holes in my knowledge of analytic number theory, but I think it’s also fair to say that the math gets pretty hairy at certain points in this discourse. The equation below constitutes the Main Conjecture of Lemke Oliver and Soundararajan. (I have made a minor change of notation and simplified one aspect of the equation: The original applies to sequences of r consecutive primes, but this version describes pairs only, i.e., \(r = 2\).)
\[\pi(x; m, a, b) = \frac{\mathrm{li}(x)}{\phi(m)^2}\left(1 + c_1(m; a, b)\frac{\log \log x}{\log x} + c_2(m; a, b) \frac{1}{\log x} + O\Big( \frac{1}{(\log x)^{7/4}} \Big) \right)\]
I think I understand enough of what’s going on here to at least offer a glossary. To the left of the equal sign, \(\pi(x; m, a, b)\) denotes a counting function; whereas \(\pi(x)\) counts the primes up to \(x\), \(\pi(x; m, a, b)\) is the number of pairs of consecutive primes mod \(m\) that fall into the congruence classes \(a\) and \(b\). To the right of the equal sign, the main coefficient \(\mathrm{li}(x) / \phi(m)^2\) is essentially the mean or expected number of pairs if the primes were distributed uniformly at random, with no correlations between successive primes; \(\mathrm{li}(x)\) is the logarithmic integral of \(x\), an approximation to \(\pi(x)\), and \(\phi(m)^2\) is the Euler totient function, which counts the square of the number of possible congruence classes for \(m\), or in other words the number of elements in the successor matrix.
The leading term inside the large parentheses is simply \(1\), and so it takes on the value of the main coefficient \(\mathrm{li}(x) / \phi(m)^2\); thus the mean number of pairs \((a, b)\) becomes the first approximation to the counting function. The three following terms act as corrections to this first approximation; for large \(x\) they should get progressively smaller, because \(\log \log x / \log x \gt 1 / \log x \gt 1 / (\log x)^{7/4}\) whenever \(x \gt e^e \approx 15\).
What about the coefficients of those three correction terms? The notation O(\cdot) for the smallest term indicates that we’re only going to worry about the term’s order of magnitude—which will be small for large \(x\). The coefficient \(c_1\) takes the following form in the case of \(r = 2\):
\[c_1(m; a, b) = \frac{1}{2} - \frac{\phi(m)}{2} (\#\{a \equiv b \bmod m\})\]
The expression \(\#\{a \equiv b \bmod m\}\) counts the number of cases where \(a\) and \(b\) lie in the same congruence class mod \(m\). Thus the effect of the term (if I understand correctly) is to reduce the overall count along the matrix diagonal, where \(a \equiv b \bmod m\).
As for coefficient \(c_2\), Lemke Oliver and Soundararajan remark that “in general, [it] seems complicated.” Indeed it does. And so this is the place where I should encourage those readers who want to know more to go read the original.
The complexity of the mathematical treatment leaves me feeling frustrated, but it’s hardly unusual for an easily stated problem to require a deep and difficult solution. I hang onto the hope that some of the technicalities will be brushed aside and the main ideas will emerge more clearly with further work. In the meantime, it’s still possible to explore a fascinating and long-hidden corner of number theory with the simplest of computational tools and a bit of graphics.
“God may not play dice with the universe, but something strange is going on with the prime numbers”—so said Paul Erdős and/or Mark Kac, though only with a little help from Carl Pomerance. The strangeness seems to be at its strangest when we play dice with the primes.
Addendum 2016-06-14. I noted above that the distribution of primes mod 7 seems flatter, or more nearly uniform, than the result of rolling a fair die. John D. Cook has taken a chi-squared test to the data and shows that the fit to uniform distribution is way too good to be the plausible outcome of a random process. His first post deals with the specific case of primes modulo 7; his second post considers other moduli.
References
Ash, Avner, Laura Beltis, Robert Gross, and Warren Sinnott. 2011. Frequencies of successive pairs of prime residues. Experimental Mathematics 20(4):400–411.
Chebyshev, Pafnuty Lvovich. 1853. Lettre de M. le Professeur Tchébychev à M. Fuss sur un nouveaux théorème relatif aux nombres premiers contenus dans les formes 4n + 1 et 4n + 3. Bulletin de la Class Physico-mathematique de l’Academie Imperiale des Sciences de Saint-Pétersbourg 11:208. Google Books
Cramér, Harald. 1936. On the order of magnitude of the difference between consecutive prime numbers. Acta Arithmetica 2:23–46. PDF
Derbyshire, John. 2002. Chebyshev’s bias.
Granville, Andrew. 1995. Harald Cramér and the distribution of prime numbers. Harald Cramér Symposium, Stockholm, 1993. Scandinavian Actuarial Journal 1:12–28. PDF
Granville, Andrew, and Greg Martin. 2004. Prime number races. arXiv
Hamza, Kais, and Fima Klebaner. 2012. On the statistical independence of primes. The Mathematical Scientist 37:97–105.
Klarreich, Erica. 2016. Mathematicians discover prime conspiracy. Quanta.
Knapowski, S., and P. Turán. 1977. On prime numbers ? 1 resp. 3 mod 4. In Number Theory and Algebra: Collected Papers Dedicated to Henry B. Mann, Arnold E. Ross, and Olga Taussky-Todd, pp. 157–165. Edited by Hans Zassenhaus. New York: Academic Press.
Ko, Chung-Ming. 2002. Distribution of the units digit of primes. Chaos Solitons Fractals 13(6):1295–1302.
Lamb, Evelyn. 2016. Peculiar pattern found in ‘random’ prime numbers. Nature doi:10.1038/nature.2016.19550.
Lemke Oliver, Robert J., and Kannan Soundararajan. 2016 preprint. Unexpected biases in the distribution of consecutive primes. arXiv
Odlyzko, Andrew, Michael Rubinstein, and Marek Wolf. 1999. Jumping champions. Experimental Mathematics 8(2):107–118.
Rubinstein, Michael, and Peter Sarnak. 1994. Chebyshev’s bias. Experimental Mathematics 3:173–197. Project Euclid
Tao, Terrence. Structure and randomness in the prime numbers. PDF
]]>Extrapolating the steep trend line of the past five years predicts a thousandfold increase in capacity by about 2012; in other words, today’s 120-gigabyte drive becomes a 120-terabyte unit.
Extending that same growth curve into 2016 would allow for another four doublings, putting us on the threshold of the petabyte disk drive (i.e., \(10^{15}\) bytes).
None of that has happened. The biggest drives in the consumer marketplace hold 2, 4, or 6 terabytes. A few 8- and 10-terabyte drives were recently introduced, but they are not yet widely available. In any case, 10 terabytes is only 1 percent of a petabyte. We have fallen way behind the growth curve.
The graph below extends an illustration that appeared in my 2002 article, recording growth in the areal density of disk storage, measured in bits per square inch:
The blue line shows historical data up to 2002 (courtesy of Edward Grochowski of the IBM Almaden Research Center). The bright green line represents what might have been, if the 1997–2002 trend had continued. The orange line shows the real status quo: We are three orders of magnitude short of the optimistic extrapolation. The growth rate has returned to the more sedate levels of the 1970s and 80s.
What caused the recent slowdown? I think it makes more sense to ask what caused the sudden surge in the 1990s and early 2000s, since that’s the kink in the long-term trend. The answers lie in the details of disk technology. More sensitive read heads developed in the 90s allowed information to be extracted reliably from smaller magnetic domains. Then there was a change in the geometry of the domains: the magnetic axis was oriented perpendicular to the surface of the disk rather than parallel to it, allowing more domains to be packed into the same surface area. As far as I know, there have been no comparable innovations since then, although a new writing technology is on the horizon. (It uses a laser to heat the domain, making it easier to change the direction of magnetization.)
As the pace of magnetic disk development slackens, an alternative storage medium is coming on strong. Flash memory, a semiconductor technology, has recently surpassed magnetic disk in areal density; Micron Technologies reports a laboratory demonstration of 2.7 terabits per square inch. And Samsung has announced a flash-based solid-state drive (SSD) with 15 terabytes of capacity, larger than any mechanical disk drive now on the market. SSDs are still much more expensive than mechanical disks—by a factor of 5 or 10—but they offer higher speed and lower power consumption. They also offer the virtue of total silence, which I find truly golden.
Flash storage has replaced spinning disks in about a quarter of new laptops, as well as in all phones and tablets. It is also increasingly popular in servers (including the machine that hosts bit-player.org). Do disks have a future?
In my sentimental moments, I’ll be sorry to see spinning disks go away. They are such jewel-like marvels of engineering and manufacturing prowess. And they are the last link in a long chain of mechanical contrivances connecting us with the early history of computing—through Turing’s bombe and Babbage’s brass gears all the way back to the Antikythera mechanism two millennia ago. From here on out, I suspect, most computers will have no moving parts.
Maybe in a decade or two the spinning disk will make a comeback, the way vinyl LPs and vacuum tube amplifiers have. “Data that comes off a mechanical disk has a subtle warmth and presence that no solid-state drive can match,” the cogniscenti will tell us.
“You can never be too rich or too thin,” someone said. And a computer can never be too fast. But the demand for data storage is not infinitely elastic. If a file cabinet holds everything in the world you might ever want to keep, with room to spare, there’s not much added utility in having 100 or 1,000 times as much space.
In 2002 I questioned whether ordinary computer users would ever fill a 1-terabyte drive. Specifically, I expressed doubts that my own files would ever reach the million megabyte mark. Several readers reassured me that data will always expand to fill the space available. I could only respond “We’ll see.” Fourteen years later, I now have the terabyte drive of my dreams, and it holds all the words, pictures, music, video, code, and whatnot I’ve accumulated in a lifetime of obsessive digital hoarding. The drive is about half full. Or half empty. So I guess the outcome is still murky. I can probably fill up the rest of that drive, if I live long enough. But I’m not clamoring for more space.
One factor that has surely slowed demand for data storage is the emergence of cloud computing and streaming services for music and movies. I didn’t see that coming back in 2002. If you choose to keep some of your documents on Amazon or Azure, you obviously reduce the need for local storage. Moreover, offloading data and software to the cloud can also reduce the overall demand for storage, and thus the global market for disks or SSDs. A typical movie might take up 3 gigabytes of disk space. If a million people load a copy of the same movie onto their own disks, that’s 3 petabytes. If instead they stream it from Netflix, then in principle a single copy of the file could serve everyone.
In practice, Netflix does not store just one copy of each movie in some giant central archive. They distribute rack-mounted storage units to hundreds of internet exchange points and internet service providers, bringing the data closer to the viewer; this is a strategy for balancing the cost of storage against the cost of communications bandwidth. The current generation of the Netflix Open Connect Appliance has 36 disk drives of 8 terabytes each, plus 6 SSDs that hold 1 terabyte each, for a total capacity of just under 300 terabytes. (Even larger units are coming soon.) In the Netflix distribution network, files are replicated hundreds or thousands of times, but the total demand for storage space is still far smaller than it would be with millions of copies of every movie.
A recent blog post by Eric Brewer, Google’s vice president for infrastructure, points out:
The rise of cloud-based storage means that most (spinning) hard disks will be deployed primarily as part of large storage services housed in data centers. Such services are already the fastest growing market for disks and will be the majority market in the near future. For example, for YouTube alone, users upload over 400 hours of video every minute, which at one gigabyte per hour requires more than one petabyte (1M GB) of new storage every day or about 100x the Library of Congress.
Thus Google will not have any trouble filling up petabyte drives. An accompanying white paper argues that as disks become a data center specialty item, they ought to be redesigned for this environment. There’s no compelling reason to stick with the present physical dimensions of 2½ or 3½ inches. Moreover, data-center disks have different engineering priorities and constraints. Google would like to see disks that maximize both storage capacity and input-output bandwidth, while minimizing cost; reliability of individual drives is less critical because data are distributed redundantly across thousands of disks.
The white paper continues:
An obvious question is why are we talking about spinning disks at all, rather than SSDs, which have higher [input-output operations per second] and are the “future” of storage. The root reason is that the cost per GB remains too high, and more importantly that the growth rates in capacity/$ between disks and SSDs are relatively close . . . , so that cost will not change enough in the coming decade.
If the spinning disk is remodeled to suit the needs and the economics of the data center, perhaps flash storage can become better adapted to the laptop and desktop environment. Most SSDs today are plug-compatible replacements for mechanical disk drives. They have the same physical form, they expect the same electrical connections, and they communicate with the host computer via the same protocols. They pretend to have a spinning disk inside, organized into tracks and sectors. The hardware might be used more efficiently if we were to do away with this charade.
Or maybe we’d be better off with a different charade: Instead of dressing up flash memory chips in the disguise of a disk drive, we could have them emulate random access memory. Why, after all, do we still distinguish between “memory” and “storage” in computer systems? Why do we have to open and save files, launch and shut down applications? Why can’t all of our documents and programs just be everpresent and always at the ready?
In the 1950s the distinction between memory and storage was obvious. Memory was the few kilobytes of magnetic cores wired directly to the CPU; storage was the rack full of magnetic tapes lined up along the wall on the far side of the room. Loading a program or a data file meant finding the right reel, mounting it on a drive, and threading the tape through the reader and onto the take-up reel. In the 1970s and 80s the memory/storage distinction began to blur a little. Disk storage made data and programs instantly available, and virtual memory offered the illusion that files larger than physical memory could be loaded all in one go. But it still wasn’t possible to treat an entire disk as if all the data were all present in memory. The processor’s address space wasn’t large enough. Early Intel chips, for example, used 20-bit addresses, and therefore could not deal with code or data segments larger than \(2^{20} \approx 10^6\) bytes.
We live in a different world now. A 64-bit processor can potentially address \(2^{64}\) bytes of memory, or 16 exabytes (i.e., 16,000 petabytes). Most existing processor chips are limited to 48-bit addresses, but this still gives direct access to 281 terabytes. Thus it would be technically feasible to map the entire content of even the largest disk drive onto the address space of main memory.
In current practice, reading from or writing to a location in main memory takes a single machine instruction. Say you have a spreadsheet open; the program can get the value of any cell with a load instruction, or change the value with a store instruction. If the spreadsheet file is stored on disk rather than loaded into memory, the process is quite different, involving not single instructions but calls to input-output routines in the operating system. First you have to open the file and read it as a one-dimensional stream of bytes, then parse that stream to recreate the two-dimensional structure of the spreadsheet; only then can you access the cell you care about. Saving the file reverses these steps: The two-dimensional array is serialized to form a linear stream of bytes, then written back to the disk. Some of this overhead is unavoidable, but the complex conversions between serialized files on disk and more versatile data structures in memory could be eliminated. A modern processor could address every byte of data—whether in memory or storage—as if it were all one flat array. Disk storage would no longer be a separate entity but just another level in the memory hierarchy, turning what we now call main memory into a new form of cache. From the user’s point of view, all programs would be running all the time, and all documents would always be open.
Is this notion of merging memory and storage an attractive prospect or a nightmare? I’m not sure. There are some huge potential problems. For safety and sanity we generally want to limit which programs can alter which documents. Those rules are enforced by the file system, and they would have to be re-engineered to work in the memory-mapped environment.
Perhaps more troubling is the cognitive readjustment required by such a change in architecture. Do we really want everything at our fingertips all the time? I find it comforting to think of stored files as static objects, lying dormant on a disk drive, out of harm’s way; open documents, subject to change at any instant, require a higher level of alertness. I’m not sure I’m ready for a more fluid and frenetic world where documents are laid aside but never put away. But I probably said the same thing 30 years ago when I first confronted a machine capable of running multiple programs at once (anyone remember Multifinder?).
The dichotomy between temporary memory and permanent storage is certainly not something built into the human psyche. I’m reminded of this whenever I help a neophyte computer user. There’s always an incident like this:
“I was writing a letter last night, and this morning I can’t find it. It’s gone.”
“Did you save the file?”
“Save it? From what? It was right there on the screen when I turned the machine off.”
Finally the big questions: Will we ever get our petabyte drives? How long will it take? What sorts of stuff will we keep on them when the day finally comes?
The last time I tried to predict the future of mass storage, extrapolating from recent trends led me far astray. I don’t want to repeat that mistake, but the best I can suggest is a longer-term baseline. Over the past 50 years, the areal density of mass-storage media has increased by seven orders of magnitude, from about \(10^5\) bits per square inch to about \(10^{12}\). That works out to about seven years for a tenfold increase, on average. If that rate is an accurate predictor of future growth, we can expect to go from the present 10 terabytes to 1 petabyte in about 15 years. But I would put big error bars around that number.
I’m even less sure about how those storage units will be used, if in fact they do materialize. In 2002 my skepticism about filling up a terabyte of personal storage was based on the limited bandwidth of the human sensory system. If the documents stored on your disk are ultimately intended for your own consumption, there’s no point in keeping more text than you can possibly read in a lifetime, or more music than you can listen to, or more pictures than you can look at. I’m now willing to concede that a terabyte of information may not be beyond human capacity to absorb. But a petabyte? Surely no one can read a billion books or watch a million hours of movies.
This argument still seems sound to me, in the sense that the conclusion follows if the premise is correct. But I’m no longer so sure about the premise. Just because it’s my computer doesn’t mean that all the information stored there has to be meant for my eyes and ears. Maybe the computer wants to collect some data for its own purposes. Maybe it’s studying my habits or learning to recognize my voice. Maybe it’s gathering statistics from the refrigerator and washing machine. Maybe it’s playing go, or gossiping over some secret channel with the Debian machine across the alley.
We’ll see.
]]>